Make your own free website on

Problem 2.6:

Question 3:

You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: "It's so simple that I can't believe that no one has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But when I rated the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?"

a) Who is right, the marketing director or his boss? If you answered his boss, what would you do to fix the measure of satisfaction?

His boss is right.  This is because the only measurement that was used was the total number of complaints as compared to other products.  However, it is unclear whether or not the count of complaints is measured against the count of product sold.  It is a more accurate measurement if one measures the sales of product against the complaints and takes the ratio of that instead and not just the mere number of complaints.  It is then a more accurate picture of customer satisfaction measurement.  Though this in itself is not completely an accurate measurement.


b) What can you say about the attribute type of the original product satisfaction attribute?

The original product satisfaction attribute only keeps the counts of complain of a each production, for example: Milk has 5 complain, beef has 12 complain. This attribute type provides only enough information to distinguish between one product from others. This is a nominal attribute type.


Question 4:

A few months later, you are again approached by the same marketing director as in Exercise 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product over other, similar products. He explains, "When we develop new products, we typically create several variations and evaluate which one customers prefer. Our standard procedure is to give our test subjects all of the product variations at one time and then ask them to rank the product variations in order of preference. However, our test subjects are very indecisive, especially when there are more than two products. As a result, testing takes forever. I suggested that we perform the comparisons in pairs and then use these comparisons to get the rankings Thus, if we have three product variations, we have the customers compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our testing time with my new procedure is a third of what it was for the old procedure, but the employees conducting the tests complain that they cannot come up with a consistent ranking from the results. And my boss wants the latest product evaluations, yesterday. I should aslo mention that he was the person who came up with the old product evaluation approach. Can you help me?"

a) Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference? Explain.

Yes, the marketing director is in trouble.  The results of this evaluation will probably yield inconsistent data results.  This is because the grouping of the product evaluation does not take into account the ranking of the products.  All products are compared to each other but only in groups.  It is hard to determine relative ranking when the comparison is done this way.  The way ordinal comparisons work is the ability to rank.  However, in this case, there is no ranking because in all cases, the ranking is only compared one product to the other.


b) Is there a way to fix the marketing director's approach? More generally, what can you say about trying to create an ordinal measurement scale based on pair wise comparisons?

In order to obtain more accurate results, it is better to score the results.  So when products are compared to one another in a pair wise comparison, the “winners” of each of the product comparison should compete against each other.  The scoring can then be done and products can be ranked accordingly.  This will yield better results and it is then easier to know which product customers prefer.


c) For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take?  I do not think that this is not a reasonable approach.  The data type is the product (object), while the test subject is the attribute of this product.  In the original product evaluation, all attributes are treated equally.  So therefore, it yields an inaccurate result because an average is taken across all attributes.  There is no differentiation between which attribute is more important than the other.  However, in reality, attributes differ in importance.  It is necessary to rank attributes and then take relative averages as opposed to absolute averages.


Question 5:

Can you think of a situation in which identification numbers would be useful for prediction?

I think in all cases of (production) prediction the identification number is always used as a part of the object property.  The identification number is unique and discrete as part of the property of the object.


Question 14:

The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances.

Proximity measures can be used to measure similarity and dissimilarity among attributes.  Grouping similarity attributes together in a cluster can be used to compare elephant groups.  For example, take height.  Height of elephants can be measured and depending on the predefined definition of the attribute, are clustered together and can be classified as similar.  In this case the Euclidian method of measurement is the most relevant.


Question 15:
You are given a set of m objects that is divided into K groups, where the ith group is of size mi. If the goal is to obtain a sample of size n < m, what is the difference between the following two sampling schemes? (Assume sampling with replacement).

a) We randomly select n * Mi/m elements from each group.

This sampling method would enable a wide selection from each group.  This would ensure that all groups are represented.  The selection provides a very consistent sample size because the selection number of each group is proportional to the size of the group.


b) We randomly select n elements from the data set, without regard for the group to which an object belongs.

Without regard for which group an object belongs to, this type of sampling may inadvertently yield sub-optimal results because it picks randomly a number.  The size of each data group is not taken into consideration.  All groups are treated equally so in very large groups, it is possible to miss certain characteristics.

Question 16:

Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. Consider the variable transformation that is defined by

tf'ij = tfij * log (m / dfi)

where dfi is the number of documents in which the ith term appears, which is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation.

a) What is the effect of this transformation if a term occurs in one document? In every document?

(Reference: Kathy Macropol)

This is a very good answer to the question.


If a term appears in only one document, then dfi would be 1. This would then mean that we're taking the original tfij and multiplying it by the log of m.

If a term appears in every document, then we'll have a log (m / m), which is log 1, which is 0. This means the tf'ij will become 0.


b) What might be the purpose of this transformation?

This is a function that maps the entire values of given attributes to a new set of replacement values so that he old value can be identified with one of the new values.