Assignment 4
Question 2:
Consider the training example shown in table 4.7 for a binary classification problem
a) Compute the Gini index for the overall collection of the training examples.
Answer:
Class 0 has 10 counts
Class 1 has 10 counts
Total counts = 20
Gini = 1 (10/20)^2 (10/20)^2 = 0.5
b) Compute the Gini index for the Customer ID attribute.
Answer:
When splitting the data by Customer ID we will have 20 nodes/classes and each class has the Gini index of 0.
The average weighted of the Gini index for the nodes is
Gini = 1/20*0 + 1/20*0 + = 0
c) Compute the Gini index for the Gender attribute.
Answer:
If attribute Gender is selected to split the data then:
· Gini index for Male is: 0.4444
· Similarly Gini index for Female is: 0.46

M 
F 
Class 0 
6 
4 
Class 1 
3 
7 



The average weighted of the Gini index = (9/20)*0.444 + (11/20)*0.46 = 0.452
d) Compute the Gini index for the car type attribute using multiway split.
Answer:
If attribute Car Type is selected to split the data then:
· Gini index for Family is: 1  (1/4 )^2 (3/4)^2 = 0.375
· Similarly Gini index for Sport is: 1 (8/8)^2 (0/8)^2 = 0
· Gini index for Luxury is: 1 (1/8)^2 (7/8)^2 = 0.21875

Family 
Sport 
Luxury 
Class 0 
1 
8 
1 
Class 1 
3 
0 
7 
Gini Index 
0.375 
0 
0.21875 
The average weighted of the Gini index
= (4/20)*0.375 + (8/20)*0 + (8/20)*0.21875 = 0.1625
e) Compute the Gini index for the Shirt Size attribute using multiway split.
If attribute Shirt Size is selected to split the data then:
· Gini index for Small is: 1  (3/5 )^2 (2/5)^2 = 0.48
· Similarly Gini index for Medium is: 1 (3/7)^2 (4/7)^2 = 0.489
· Gini index for Large is: 1 (2/4)^2 (2/4)^2 = 0.5
· Gini index for Extra Large is: 1 (2/4)^2 (2/4)^2 = 0.5

Small 
Medium 
Large 
X  Large 
Class 0 
3 
3 
2 
2 
Class 1 
2 
4 
2 
2 
Gini Index 
0.48 
0.489 
0.5 
0.5 
The average weighted of the Gini index
= (5/20)*0.48 + (7/20)*0.489 + (4/20)*0.5 + (4/20)*0.5 = 0.491
f) Which attribute is better, Gender, Car Type, or Shirt Size?
Answer: Since the smaller the degree of impurity, the clearer the class distribution. Therefore the attribute Car Type = 0.1625 is better.
g) Explain why customer ID should not be used as the attribute test condition even though it has the lowest Gini.
Answer: The customer ID is the primary key (or unique) for each record, therefore it cannot be used as the attribute test condition.
Consider the training example shown in table 4.8 fo a binary classification problem.
a) What is the entropy of this collection of training samples with respect to the positive class?
Answer:

Counts 
Class + 
4 
Class  
5 
Total 
9 
Entropy = (4/9)log_{2}(4/9)  (5/9)log_{2}(5/9) = 0.99
b) What are the information gains of a1 and a2 relative to the training examples?
Answer:
For a1:
a1 
T 
F 
Class + 
3 
1 
Class  
1 
4 
Total 
4 
5 
T entropy = (3/4)log_{2}(3/4)  (1/4)log_{2}(1/4) = 0.8113
F entropy = (1/5)log_{2}(1/5)  (4/5)log_{2}(4/5) = 0.72l9
Information gain of a1 = 0.99 4/9*0.8113 5/9*0.7219 = 0.2345
For a2:
A2 
T 
F 
Class + 
2 
2 
Class  
3 
2 
Total 
5 
4 
T entropy = (2/5)log_{2}(2/5)  (3/5)log_{2}(3/5) = 0.97
F entropy = (2/4)log_{2}(2/4)  (2/4)log_{2}(2/4) = 1
Information gain of a1 = 0.99 5/9*0.97 4/9*1 = 0.0072
c) For a3, which is a continuous attribute, compute the information gain for every possible split.
Answer:
Split Point 
Entropy 
Gain 
2 
0.85 
0.0026 
3.5 
0.99 
0.0726 
4.5 
0.92 
0.0728 
5.5 
0.98 
0.0072 
6.5 
0.97 
0.0018 
7.5 
0.89 
0.1022 
d) What is the best split (among a1, a2 and a3) according to the information gain?
Answer:
To determine how well a test condition performs, a comparison between the measurement of impurity of the parent node and the child nodes. The larger different is the better the test condition. In this case the a1 has the largest value of 0.23
e) What is the best split (between a1 and a2) according to the classification error rate?
Answer:
For a1:
a1 
T 
F 
Class + 
3 
1 
Class  
1 
4 
Total 
4 
5 
T error = 1 max[3/4, 1/4] = 0.25
F error = 1 max[1/5, 4/5] = 0.2
Weighted average error = 4/9*0.25 + 5/9*0.2 = 0.333
For a2:
A2 
T 
F 
Class + 
2 
2 
Class  
3 
2 
Total 
5 
4 
T error = 1 max[2/5, 3/5] = 0.4
F error = 1 max[2/4, 2/4] = 0.5
Weighted average error = 5/9*0.4 +
4/9*0.5 = 0.4444
Therefore the best split is the lowest error a1 = 0.333
f) What is the best split (between a1 and a2) according to the Gini index?
Answer:
For a1:
a1 
T 
F 
Class + 
3 
1 
Class  
1 
4 
Total 
4 
5 
T Gini = 1 (3/4)^2 (1/4)^2 = 0.375
F error = 1 (1/5)^2 (4/5)^2 = 0.32
Weighted average Gini = 4/9*0.375 + 5/9*0.32 = 0.344
For a2:
A2 
T 
F 
Class + 
2 
2 
Class  
3 
2 
Total 
5 
4 
T Gini = 1 (2/5)^2 (3/5)^2 = 0.48
F Gini = 1 (2/4)^2 (2/4)^2 = 0.5
Weighted average Gini = 4/9*0.48 +
5/9*0.5 = 0.49
Therefore the best split is the lowest Gini a1 = 0.344
Question 6:
a) Compute the two level decision tree using the greedy approach, described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?
Answer:
For X:

X0 
X1 
Class1 
60 
40 
Class2 
60 
40 
Total 
120 
80 
X0 error = 1 max[60/120, 60/120] = 0.5
X1 error = 1 max[40/80, 40/80] = 0.5
Weighted error for X = 120/200*0.5 + 80/200*0.5 = 0.5
For Y:

Y0 
Y1 
Class1 
40 
60 
Class2 
60 
50 
Total 
100 
110 
Y0 error = 1 max[40/100, 60/100] = 0.6
Y1 error = 1 max[60/110, 50/110] = 0.45
Weighted error for Y = 100/210*0.6 + 110/210*0.45 = 0.52
For Z:

Z0 
Z1 
Class1 
30 
70 
Class2 
70 
30 
Total 
100 
100 
Z0 error = 1 max[30/100, 70/100] = 0.3
Z1 error = 1 max[70/100, 30/100] = 0.3
Weighted error for Z = 100/200*0.3 + 100/200*0.3 = 0.3
The greedy algorithm will select the attribute with a lowest value for the first split in this case: Z = 0.3
Now we have:

Child Z0 
Child Z1 
Class1 
30 
70 
Class2 
70 
30 
Total 
100 
100 
Child Z0 error = 1 max[30/100, 70/100] = 0.3
Child Z1 error = 1 max[70/100, 30/100] = 0.3
Weighted error for Children Z =
100/200*0.3 + 100/200*0.3 = 0.3
b) Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate for the induced tree?
Answer:
For X:

X0 
X1 
Class1 
60 
40 
Class2 
60 
40 

120 
80 
Since the error rate is 0.5 there are an equal number of record in each class 100 then the error rate is:
Weighted error rate = 120/200*.5 + 80/200*0.5 = 0.5
c) Compare the results of parts (a) and (b). Comment the suitability of the greedy heuristic used for splitting attribute selection.