Weighted Jaccard similarity

Introduction

One of the most popular similarity index is Jaccard similarity index (or coefficient) that measures similarity between finite sample set by intersection and union of samples.

\[J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A|+|B|-|A \cap B|}\]

The result is \(0 \leq J(A,B) \leq 1\).

Jaccard distance is simple \(D(A,B) = 1 - J(A,B)\).

There are several implementation of Jaccard similarity/distance calculation in R (clusteval, proxy, prabclus, vegdist, ade4 etc.). And Jaccard similarity can built up with basic function just see this forum. But these works for binary datasets only.

If one need to measure weighted Jaccard coefficient than there is some difficulty to find implementations. Weighted Jaccard definition is simple

\[J(x,y) = \frac{\sum\nolimits_{i}min(x_i,y_i)}{\sum\nolimits_{i}max(x_i,y_i)}\]

Fortunately write some code is simple as well.

Making Jaccard similarity matrix

The dataset is a matrix.

A <- matrix(c(2,5,2,1,0,0,0,0,1,0,0,0,0,1,3,5,6,0,0,1,0,0,0,2,0,0,1,2,7,2,4,6,2,5,1,0,0,1,0,0,0,1,0,0,3,5,4,0,0,1,0,0,1,0,0,2,0,3,5,7,3,1,4,0,1,0,0,0,0,2,0,0,0,1,3,4,6,0,0,1), byrow=T, nrow=8, ncol=10)
colnames(A) <- letters[1:10]
rownames(A) <- LETTERS[1:8]
print(A)
##   a b c d e f g h i j
## A 2 5 2 1 0 0 0 0 1 0
## B 0 0 0 1 3 5 6 0 0 1
## C 0 0 0 2 0 0 1 2 7 2
## D 4 6 2 5 1 0 0 1 0 0
## E 0 1 0 0 3 5 4 0 0 1
## F 0 0 1 0 0 2 0 3 5 7
## G 3 1 4 0 1 0 0 0 0 2
## H 0 0 0 1 3 4 6 0 0 1

Jaccard similarity of rows of matrix can be seen bellow. The code chunk makes similarity matrix for rows.

#weighted jaccard similarity matrix setup
sim.jac <- matrix(0, nrow=nrow(A), ncol=nrow(A))
rownames(sim.jac) <- rownames(A)
colnames(sim.jac) <- rownames(A)

#weighted jaccard
pairs <- t(combn(1:nrow(A), 2))
for (i in 1:nrow(pairs)){
  num <- sum(sapply(1:ncol(A), function(x)(min(A[pairs[i,1],x],A[pairs[i,2],x]))))
  den <- sum(sapply(1:ncol(A), function(x)(max(A[pairs[i,1],x],A[pairs[i,2],x]))))
  sim.jac[pairs[i,1],pairs[i,2]] <- num/den
  sim.jac[pairs[i,2],pairs[i,1]] <- num/den  
}
sim.jac[which(is.na(sim.jac))] <- 0
diag(sim.jac) <- 1

The last two commands are corrections that mean:

  • if \(max(x_i,y_i) = 0\) for every \(i\) than the denominator is 0 which has no means in this case (definition of Jaccard suggest 1 in this case but in real cases I think better to keep 0)
  • every elements similar to oneself

Application of Jaccard similarity matrix

Hierarchical clustering

dist.jac <- as.dist(1-sim.jac)

hc <- hclust(dist.jac, method = "ward.D2")

cut <- as.data.frame(cutree(hc, k=3))
cut$names <- rownames(cut)
names(cut) <- c("cut", "names")
library(ggplot2)
library(ggdendro)
library(dplyr)

hcdata <- dendro_data(hc, type="triangle")

hcdata$labels <- left_join(hcdata$labels, cut, by=c("label"="names"))
ggplot(hcdata$segments) + 
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
  geom_text(data = hcdata$labels, aes(x, y, label = label, colour=factor(cut)), 
            hjust = 1, size = 4) +
  scale_color_manual(values=c("red", "blue", "green"), guide_legend(title="clusters")) +
  labs(x="", y="") + coord_flip() + theme_bw()

Clusters with MDS

Multi-dimensional scaling is an often used dimension reduction process that suitable to visualize samples in two dimension and to recognize clusters.

mds <- as.data.frame(cmdscale(dist.jac))
mds$names <- rownames(mds)
mds$cut <- cutree(hc, k=3)
ggplot(mds, aes(V1, V2, label=names)) + 
  geom_point(aes(colour=factor(cut)), size=2) +
  geom_text(aes(colour=factor(cut)), check_overlap = FALSE, size=2.3, 
            hjust = "center", vjust = "bottom", nudge_x = 0.005, nudge_y = 0.02) + 
  scale_color_manual(values=c("red", "blue", "green"), guide_legend(title="clusters")) +
  xlab("") + ylab("") +  theme_bw()

As figure shows clusters with weighted Jaccard similarity/dissimilarity are well match with clusters that shows MDS.


Be happyR! :)

See also:

Jaccard definition in Wikipedia

No comments:

Post a Comment