Make your own free website on Tripod.com

Assignment 4

Problem 4.8

Question 2:

Consider the training example shown in table 4.7 for a binary classification problem

a)      Compute the Gini index for the overall collection of the training examples.

Answer:

Class 0 has 10 counts

Class 1 has 10 counts

Total counts = 20

Gini = 1 – (10/20)^2 – (10/20)^2 = 0.5

b)      Compute the Gini index for the Customer ID attribute.

Answer:

When splitting the data by Customer ID we will have 20 nodes/classes and each class has the Gini index of 0.

The average weighted of the Gini index for the nodes is

Gini = 1/20*0 + 1/20*0 +… = 0

c)      Compute the Gini index for the Gender attribute.

Answer:

If attribute Gender is selected to split the data then:

·        Gini index for Male is: 0.4444

·        Similarly Gini index for Female is: 0.46

 

M

F

Class 0

6

4

Class 1

3

7

 

 

 

The average weighted of the Gini index  = (9/20)*0.444 + (11/20)*0.46 = 0.452

d)      Compute the Gini index for the car type attribute using multiway split.

Answer:

If attribute Car Type is selected to split the data then:

·        Gini index for Family is: 1 - (1/4 )^2 – (3/4)^2 = 0.375

·        Similarly Gini index for Sport is: 1 – (8/8)^2 – (0/8)^2 = 0

·        Gini index for Luxury is: 1- (1/8)^2 – (7/8)^2 = 0.21875

 

Family

Sport

Luxury

Class 0

1

8

1

Class 1

3

0

7

Gini Index

0.375

0

0.21875

The average weighted of the Gini index 

= (4/20)*0.375 + (8/20)*0 + (8/20)*0.21875 = 0.1625

e)      Compute the Gini index for the Shirt Size attribute using multiway split.

If attribute Shirt Size is selected to split the data then:

·        Gini index for Small is: 1 - (3/5 )^2 – (2/5)^2 = 0.48

·        Similarly Gini index for Medium is: 1 – (3/7)^2 – (4/7)^2 = 0.489

·        Gini index for Large is: 1- (2/4)^2 – (2/4)^2 = 0.5

·        Gini index for Extra Large is: 1- (2/4)^2 – (2/4)^2 = 0.5

 

 

Small

Medium

Large

X - Large

Class 0

3

3

2

2

Class 1

2

4

2

2

Gini Index

0.48

0.489

0.5

0.5

The average weighted of the Gini index 

= (5/20)*0.48 + (7/20)*0.489 + (4/20)*0.5 + (4/20)*0.5 = 0.491

f)        Which attribute is better, Gender, Car Type, or Shirt Size?

Answer: Since the smaller the degree of impurity, the clearer the class distribution. Therefore the attribute Car Type = 0.1625 is better.

g)      Explain why customer ID should not be used as the attribute test condition even though it has the lowest Gini.

Answer: The customer ID is the “primary key” (or unique) for each record, therefore it cannot be used as the attribute test condition.

 

Question 3

Consider the training example shown in table 4.8 fo a binary classification problem.

a)      What is the entropy of this collection of training samples with respect to the positive class?

Answer:

 

Counts

Class +

4

Class -

5

Total

9

Entropy =  -(4/9)log2(4/9) - (5/9)log2(5/9)  = 0.99

 

b)      What are the information gains of a1 and a2 relative to the training examples?

Answer:

For a1:

a1

T

F

Class +

3

1

Class -

1

4

Total

4

5

T entropy =  -(3/4)log2(3/4) - (1/4)log2(1/4)  = 0.8113

F entropy = -(1/5)log2(1/5) - (4/5)log2(4/5)  = 0.72l9

Information gain of a1 = 0.99 – 4/9*0.8113 – 5/9*0.7219 = 0.2345

For a2:

A2

T

F

Class +

2

2

Class -

3

2

Total

5

4

T entropy =  -(2/5)log2(2/5) - (3/5)log2(3/5)  = 0.97

F entropy = -(2/4)log2(2/4) - (2/4)log2(2/4)  = 1

Information gain of a1 = 0.99 – 5/9*0.97 – 4/9*1 = 0.0072

c)      For a3, which is a continuous attribute, compute the information gain for every possible split.

Answer:

 

Split Point

Entropy

Gain

2

0.85

0.0026

3.5

0.99

0.0726

4.5

0.92

0.0728

5.5

0.98

0.0072

6.5

0.97

0.0018

7.5

0.89

0.1022

 

d)      What is the best split (among a1, a2 and a3) according to the information gain?

Answer:

To determine how well a test condition performs, a comparison between the measurement of impurity of the parent node and the child nodes. The larger different is the better the test condition. In this case the a1 has the largest value of 0.23

e)      What is the best split (between a1 and a2) according to the classification error rate?

Answer:

For a1:

a1

T

F

Class +

3

1

Class -

1

4

Total

4

5

T error = 1 – max[3/4, 1/4] = 0.25

F error = 1 – max[1/5, 4/5]   = 0.2

Weighted average error = 4/9*0.25 + 5/9*0.2 = 0.333

For a2:

A2

T

F

Class +

2

2

Class -

3

2

Total

5

4

T error = 1 – max[2/5, 3/5] = 0.4

F error = 1 – max[2/4, 2/4]   = 0.5

Weighted average error = 5/9*0.4 + 4/9*0.5 = 0.4444

Therefore the best split is the lowest error a1 = 0.333

f)        What is the best split (between a1 and a2) according to the Gini index?

Answer:

For a1:

a1

T

F

Class +

3

1

Class -

1

4

Total

4

5

T Gini = 1 – (3/4)^2 – (1/4)^2 = 0.375

F error = 1 – (1/5)^2 – (4/5)^2 = 0.32

Weighted average Gini = 4/9*0.375 + 5/9*0.32 = 0.344

For a2:

A2

T

F

Class +

2

2

Class -

3

2

Total

5

4

T Gini = 1 – (2/5)^2 – (3/5)^2 = 0.48

F Gini = 1 – (2/4)^2 – (2/4)^2 = 0.5

Weighted average Gini = 4/9*0.48 + 5/9*0.5 = 0.49

Therefore the best split is the lowest Gini a1 = 0.344

Question 6:

a)      Compute the two level decision tree using the greedy approach, described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree?

Answer:

For X:

 

X0

X1

Class1

60

40

Class2

60

40

Total

120

80

X0 error = 1 – max[60/120, 60/120] = 0.5

X1 error = 1 – max[40/80, 40/80]   = 0.5

Weighted error for X = 120/200*0.5 + 80/200*0.5 = 0.5

For Y:

 

Y0

Y1

Class1

40

60

Class2

60

50

Total

100

110

Y0 error = 1 – max[40/100, 60/100] = 0.6

Y1 error = 1 – max[60/110, 50/110]   = 0.45

Weighted error for Y = 100/210*0.6 + 110/210*0.45 = 0.52

For Z:

 

Z0

Z1

Class1

30

70

Class2

70

30

Total

100

100

Z0 error = 1 – max[30/100, 70/100] = 0.3

Z1 error = 1 – max[70/100, 30/100]   = 0.3

Weighted error for Z = 100/200*0.3 + 100/200*0.3 = 0.3

The greedy algorithm will select the attribute with a lowest value for the first split in this case: Z = 0.3

Now we have:

 

Child Z0

Child Z1

Class1

30

70

Class2

70

30

Total

100

100

Child Z0 error = 1 – max[30/100, 70/100] = 0.3

Child Z1 error = 1 – max[70/100, 30/100]   = 0.3

Weighted error for Children Z = 100/200*0.3 + 100/200*0.3 = 0.3

 

b)      Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate for the induced tree?

Answer:

For X:

 

X0

X1

Class1

60

40

Class2

60

40

 

120

80

Since the error rate is 0.5 there are an equal number of record in each class 100 then the error rate is:

Weighted error rate = 120/200*.5 + 80/200*0.5 = 0.5

At this point I am not sure the logic in the text-book

c)      Compare the results of parts (a) and (b). Comment the suitability of the greedy heuristic used for splitting attribute selection.