__Problem 2.6:__

** **

__Question 18:__

This exercise compares and contrasts some similarity and distance measures.).

a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.

X = 0101010001

Y = 0100011000

Answer:

__The
Hamming distance__: is the number of bits that are different between two
objects that have only binary attribute.

The number of bits that are different between two vector X = 0101010001 and Y = 0100011000 is 3 bits.

__The
Jaccard similarity__: is measured as:

J =
f_{11}/ f_{10 }+ f_{10} + f_{11}

Therefore J = 2/1 + 2 + 2 = 2/5

b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain. (Note: The Hamming measure is a distance, while other three measures are similarities, but don’t let this confuse you.)

Answer:

The Simple Matching Coefficient (SMC) is defined as:

Number of matching attribute values / number of attribute

The SMC measures the similarity of objects. While the Hamming distance measures the differences of objects. The two approaches have the same principle but both work in the opposite direction.

The cosine measure is defined as:

This is very similar to the Jaccard measurement. This measures the similarity between data objects. A similarity measure for objects must also be able to handle non-binary vectors.

c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

I think that the more appropriate measurement method is the Hamming method. The way can be used to do a classification to cluster of similarities together of the different features of the different species. This way one can do a comparison of the genetic makeup of the two species.

d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Answer:

In this case I would use the Jaccard coefficient to do this comparison. The method is used to measure the difference between the numbers of presences.

__Question 19:__

For the following vectors, x and y, calculate the indicated
similarity or distance measures.

**a) x = (1,1,1,1), y = (2,2,2,2) cosine, correlation, Euclidean**

__Cosine:__

x.y = 1*2+1*2+1*2+1*2 = 8

||x|| = Ö(1*1+1*1+1*1+1*1) = 2

||y|| = Ö(2*2+2*2+2*2+2*2) = 4

cos(x,y) = x.y/||x|| ||y|| = 8 / 2 4 = **1**

__Correlation:__

There is no linear relation ship between x and y. therefore the similarity is equal 0.

__Euclidean:__

Point |
X |
Y |

P1 |
1 |
2 |

P2 |
1 |
2 |

P3 |
1 |
2 |

P4 |
1 |
2 |

dist = Ö(S(P_{k}
– Q_{k})) where k =1 to 4.

dist = Ö(4 * (-1)^{2)} =
2.

**b) x = (0,1,0,1), y = (1,0,1,0) cosine, correlation,
Euclidean, Jaccard**.

__Cosine:__

x.y = 0*1+1*0+0*1+1*0 = 0 ==> cosine = 0.

__Correlation:__

Covariance(x,y) = 1 / (4-1) * ((-1 /2)* (1 /2)) = -1/3.

Standard_deviation(x) = 0.57

Standard_deviation(y) = 0.57

Corr(x,y) = (-1/3) / 0.57*0.57 = 1.026

__Euclidean:__

dist = Ö4 = 2.

__Jaccard:__

J = 0 / 2+2 = 0.

**c) x = (0,-1,0,1), y = (1,0,-1,0) cosine, correlation,
Euclidean**

__Cosine:__

x.y = 0*1+(-1)*0+0*(-1)+1*0 = 0 ==> cosine = 0.

__Correlation:__

Covariance = 0 ==> correlation = 0

__Euclidean:__

dist = Ö4 = 2

**d) x = (1,1,0,1,0,1), Y = (1,1,1,0,0,1) cosine,
correlation, Jaccard**

__Cosine:__

x.y = 1*1+1*1+0*1+1*0+0*0+1*1 = 3

||x|| = 2

||y|| = 2

cos(x,y) = 3 / 4 = 0.75.

__Correlation:__

Covariance = 2/3

Standard_deviation(x) = 0.51

Standard_deviation(y) = 0.51

Corr(x,y) = 0.66 / 0.26 = 2.5.

__Jaccard:__

J = 3 / 1 + 1 +3 = 0.6

**e) x = (2,-1,0,2,0,-3), y = (-1,1,-1,0,0,-1) cosine,
correlation**

__Cosine:__

x.y = 2*(-1)+(-1)*1+0*(-1)+2*0+0*0+(-3)*(-1) = 0 ==> cosine = 0.

__Correlation:__

Covariance = 0 ==> correlation = 0

__Problem 3.6:__

__Question 1:__

Obtain one of the data sets available at the UCI Machine Learning Repository
and apply as many of the different visualization techniques described in the
chapter as possible. The bibliographic notes and book Web site provide pointers
to visualization software.

I chose the Database for Fitting Contact Lenses Data.

This is a histogram showing the different ages for the fitting of different types of contact lenses.

This is a pie chart showing the same information.