 Assignment 4

Problem 4.8

Question 2:

Consider the training example shown in table 4.7 for a binary classification problem

a)      Compute the Gini index for the overall collection of the training examples.

Class 0 has 10 counts

Class 1 has 10 counts

Total counts = 20

Gini = 1 – (10/20)^2 – (10/20)^2 = 0.5

b)      Compute the Gini index for the Customer ID attribute.

When splitting the data by Customer ID we will have 20 nodes/classes and each class has the Gini index of 0.

The average weighted of the Gini index for the nodes is

Gini = 1/20*0 + 1/20*0 +… = 0

c)      Compute the Gini index for the Gender attribute.

If attribute Gender is selected to split the data then:

·        Gini index for Male is: 0.4444

·        Similarly Gini index for Female is: 0.46

 M F Class 0 6 4 Class 1 3 7

The average weighted of the Gini index  = (9/20)*0.444 + (11/20)*0.46 = 0.452

d)      Compute the Gini index for the car type attribute using multiway split.

If attribute Car Type is selected to split the data then:

·        Gini index for Family is: 1 - (1/4 )^2 – (3/4)^2 = 0.375

·        Similarly Gini index for Sport is: 1 – (8/8)^2 – (0/8)^2 = 0

·        Gini index for Luxury is: 1- (1/8)^2 – (7/8)^2 = 0.21875

 Family Sport Luxury Class 0 1 8 1 Class 1 3 0 7 Gini Index 0.375 0 0.21875

The average weighted of the Gini index

= (4/20)*0.375 + (8/20)*0 + (8/20)*0.21875 = 0.1625

e)      Compute the Gini index for the Shirt Size attribute using multiway split.

If attribute Shirt Size is selected to split the data then:

·        Gini index for Small is: 1 - (3/5 )^2 – (2/5)^2 = 0.48

·        Similarly Gini index for Medium is: 1 – (3/7)^2 – (4/7)^2 = 0.489

·        Gini index for Large is: 1- (2/4)^2 – (2/4)^2 = 0.5

·        Gini index for Extra Large is: 1- (2/4)^2 – (2/4)^2 = 0.5

 Small Medium Large X - Large Class 0 3 3 2 2 Class 1 2 4 2 2 Gini Index 0.48 0.489 0.5 0.5

The average weighted of the Gini index

= (5/20)*0.48 + (7/20)*0.489 + (4/20)*0.5 + (4/20)*0.5 = 0.491

f)        Which attribute is better, Gender, Car Type, or Shirt Size?

Answer: Since the smaller the degree of impurity, the clearer the class distribution. Therefore the attribute Car Type = 0.1625 is better.

g)      Explain why customer ID should not be used as the attribute test condition even though it has the lowest Gini.

Answer: The customer ID is the “primary key” (or unique) for each record, therefore it cannot be used as the attribute test condition.

Question 3

Consider the training example shown in table 4.8 fo a binary classification problem.

a)      What is the entropy of this collection of training samples with respect to the positive class?

 Counts Class + 4 Class - 5 Total 9

Entropy =  -(4/9)log2(4/9) - (5/9)log2(5/9)  = 0.99

b)      What are the information gains of a1 and a2 relative to the training examples?

For a1:

 a1 T F Class + 3 1 Class - 1 4 Total 4 5

T entropy =  -(3/4)log2(3/4) - (1/4)log2(1/4)  = 0.8113

F entropy = -(1/5)log2(1/5) - (4/5)log2(4/5)  = 0.72l9

Information gain of a1 = 0.99 – 4/9*0.8113 – 5/9*0.7219 = 0.2345

For a2:

 A2 T F Class + 2 2 Class - 3 2 Total 5 4

T entropy =  -(2/5)log2(2/5) - (3/5)log2(3/5)  = 0.97

F entropy = -(2/4)log2(2/4) - (2/4)log2(2/4)  = 1

Information gain of a1 = 0.99 – 5/9*0.97 – 4/9*1 = 0.0072

c)      For a3, which is a continuous attribute, compute the information gain for every possible split.

 Split Point Entropy Gain 2 0.85 0.0026 3.5 0.99 0.0726 4.5 0.92 0.0728 5.5 0.98 0.0072 6.5 0.97 0.0018 7.5 0.89 0.1022

d)      What is the best split (among a1, a2 and a3) according to the information gain?

To determine how well a test condition performs, a comparison between the measurement of impurity of the parent node and the child nodes. The larger different is the better the test condition. In this case the a1 has the largest value of 0.23

e)      What is the best split (between a1 and a2) according to the classification error rate?

For a1:

 a1 T F Class + 3 1 Class - 1 4 Total 4 5

T error = 1 – max[3/4, 1/4] = 0.25

F error = 1 – max[1/5, 4/5]   = 0.2

Weighted average error = 4/9*0.25 + 5/9*0.2 = 0.333

For a2:

 A2 T F Class + 2 2 Class - 3 2 Total 5 4

T error = 1 – max[2/5, 3/5] = 0.4

F error = 1 – max[2/4, 2/4]   = 0.5

Weighted average error = 5/9*0.4 + 4/9*0.5 = 0.4444

Therefore the best split is the lowest error a1 = 0.333

f)        What is the best split (between a1 and a2) according to the Gini index?

For a1:

 a1 T F Class + 3 1 Class - 1 4 Total 4 5

T Gini = 1 – (3/4)^2 – (1/4)^2 = 0.375

F error = 1 – (1/5)^2 – (4/5)^2 = 0.32

Weighted average Gini = 4/9*0.375 + 5/9*0.32 = 0.344

For a2:

 A2 T F Class + 2 2 Class - 3 2 Total 5 4

T Gini = 1 – (2/5)^2 – (3/5)^2 = 0.48

F Gini = 1 – (2/4)^2 – (2/4)^2 = 0.5

Weighted average Gini = 4/9*0.48 + 5/9*0.5 = 0.49

Therefore the best split is the lowest Gini a1 = 0.344

Question 6:

a)      Compute the two level decision tree using the greedy approach, described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?

For X:

 X0 X1 Class1 60 40 Class2 60 40 Total 120 80

X0 error = 1 – max[60/120, 60/120] = 0.5

X1 error = 1 – max[40/80, 40/80]   = 0.5

Weighted error for X = 120/200*0.5 + 80/200*0.5 = 0.5

For Y:

 Y0 Y1 Class1 40 60 Class2 60 50 Total 100 110

Y0 error = 1 – max[40/100, 60/100] = 0.6

Y1 error = 1 – max[60/110, 50/110]   = 0.45

Weighted error for Y = 100/210*0.6 + 110/210*0.45 = 0.52

For Z:

 Z0 Z1 Class1 30 70 Class2 70 30 Total 100 100

Z0 error = 1 – max[30/100, 70/100] = 0.3

Z1 error = 1 – max[70/100, 30/100]   = 0.3

Weighted error for Z = 100/200*0.3 + 100/200*0.3 = 0.3

The greedy algorithm will select the attribute with a lowest value for the first split in this case: Z = 0.3

Now we have:

 Child Z0 Child Z1 Class1 30 70 Class2 70 30 Total 100 100

Child Z0 error = 1 – max[30/100, 70/100] = 0.3

Child Z1 error = 1 – max[70/100, 30/100]   = 0.3

Weighted error for Children Z = 100/200*0.3 + 100/200*0.3 = 0.3

b)      Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate for the induced tree?