Data Mining

**Assignment # 7**

**Association Analysis**

__Question 1:__

For each of the following questions, provide an example of an association rule from the market basket domain that satisfies the following conditions. Also describe whether such rules subjectively interesting.

__Market basket domain:__

TID |
Items |

1 |
{Bread, Milk} |

2 |
{Bread, Diapers, Beer, Eggs} |

3 |
{Milk, Diaper, Beer, Cola} |

4 |
{Bread, Milk, Diapers, Beer} |

5 |
{Bread, Milk, Diapers, Cola} |

a) A rule that has highly support and highly confidence.

Rule: {Beer}à {Diaper}

Support: 3/5 = 0.6

Confident = 3/3 = 1

This rule is not subjectively interesting because one cannot use the numbers of beer sold to determine with good success how many package of diapers are sold

**b)
****A rule that has reasonably high support but low
confidence.**

**Rule: None of the rules satisfy
this condition**

c) A rule that has low support and low confidence.

Rule: {Bread}à {Eggs}

Support: 1/5 = 0.2

Confident = 1/4 = 0.25

This rule is subjectively interesting, however statically data shows the opposite. One would think that when eggs are purchased, it is mostly eaten with bread for breakfast. However, it is evident from the statistical information that this is not the case. It is likely that eggs have other uses such as for baking.

d) A rule that has low support and high confidence.

Rule: {cola}à {Milk}

Support: 2/5 = 0.4

Confident = 2/2 = 1

This rule is not subjectively interesting. There is a low support that when a consumer purchases cola, this consumer would also purchase milk. There is a high confidence that this combination happens by chance.

__Question 2:__

Consider the data set show in table 6.22.

a) Compute the support for itemsets {e}, {b,d} and {b,d,e}by treading each transaction ID as a market basket.

Answer:

Support for {e} = 8/10 = 0.8

Support for {b,d} = 2/10 = 0.2

Support for {b,d,e} = 2/10 = 0.2

b) Use the results in part (a) to compute the confidence for association rules {b,d} à {e} and{e} à {b,d}. Is confidence a symmetric measure?

Answer:

Confident for {b,d}à{e}
= {b,d,e}/{b,d} = 0.2/0.2 = 1

Confident for {e}à{b,d}
= {e,b,d}/{e} = 0.2/0.8 = 0.25

Confident is not a symmetric measure

c) Repeat part (a) by treading each customer ID as a market basket. Each item should be treated as a binary variable (1 if item appears in at least one transaction bought by the customer, and 0 otherwise).

Answer:

A Market basket can be presented in a binary format as show below:

Customer ID |
a |
b |
c |
d |
e |

1 |
1 |
1 |
1 |
1 |
1 |

2 |
1 |
1 |
1 |
1 |
1 |

3 |
0 |
1 |
1 |
1 |
1 |

4 |
1 |
1 |
1 |
1 |
0 |

5 |
1 |
1 |
0 |
1 |
1 |

Support for {e} = 4/5 = 0.8

Support for {b,d} = 5/5 = 1

Support for {b,d,e} = 4/5 = 0.8

d) Use the results in part (c) to compute the confidence for the association rules {b,d}à {e} and {e} à {b,d}.

Answer:

Confident for {b,d}à{e}
= {b,d,e}/{b,d} = 0.8/1 = 0.8

Confident for {e}à{b,d}
= {e,b,d}/{e} = 0.8/0.8 = 1

Confident is not a symmetric measure

e) Suppose S1 and C1 are the support and confidence values of an association rule r when treating each transaction ID as a market basket. Also let s2 and c2 be the support and confidence values of r when treating each customer ID as a market basket. Discuss whether there are any relationships between s1 and s2 or c1 and c2.

Answer:

For both criterions: Transaction ID and/or Customer ID the formula of the calculations stayed the same for rule r, however the support and confident values are different. These differences are dependent on the target of the analysis. Even in a same data set it depends on what one is interested in so that a criteria can be set. A target of the analysis must be set first and that determines the criteria of the analysis. For example, if one sets the criteria of what is the most popular item sold in a set of transactions, then one would analyze the item. The other criteria may be to look at each customer and see what is the most in demand item purchased. In this case, one targets the customer and the item that the customer purchases the most to send promotions to that customer. So if customer A purchases a certain brand of diaper, one can target that customer to market another brand of diaper. This is also the same with say customer B and a certain brand of milk

__Question 14:__

Answer the following questions using the data sets shown in
Figure 6.34. Note that each data set contains 1000 items and 10,000
transactions. Dark cells indicate the presence of items and white cells
indicate the absence of items. We will apply the A*priori* algorithm to
extract frequent itemsets with *minsup = *10% (i.e., itemsets must be
contained in at least 1000 transactions)?

a) Which data set(s) will produce the most number of frequent itemsets?

__Answer:__ By definition the frequent itemsets must
satisfy the minsupp threshold of at least 1000 transactions. In figure (a) the
itemsets will produce the most number of frequent itemsets because it has at
least 2000 transactions.

b) Which data set(s) will produce the fewest number of frequent itemsets?

__Answer:__

The data set (d) is showing zero of frequent itemsets as the number of transactions in this dataset is less than 10%. Also the (f) produces just enough of 10% threshold

c) Which data set(s) will produce the longest frequent itemsets?

__Answer:__ The data set (e) is showing the longest
frequent itemsets as the number of transactions, which are over (or equal) the
threshold are stretched on almost all items.

d) Which data set(s) will produce frequent itemsets with highest maximum support?

__Answer: __ The
data set (b) is showing that it has the highest maximum support, because item #
100 appears to be on almost all transactions.

e) Which data set(s) will produce frequent itemsets containing items with wide-varying support levels (i.e., items with mixed support, ranging from less than 20% to more than 70%).

__Answer:__

The data set (e) is showing frequent itemsets containing wide range of support levels i.e.: ranging from less than 10% to 47% (approximately item #500)