Multidimensional scaling (MDS)

Introduction

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot. Types and details about MDS on Wikipedia.

Suppose we have a matrix below and we want to see clusters of rows in a 2D scatterplot. MDS scatterplot feasible to clustering and to check clusters as well.

A <- matrix(c(2,5,2,1,0,0,0,0,1,0,0,0,0,1,3,5,6,0,0,1,0,0,0,2,0,0,1,2,7,2,4,6,2,5,1,0,0,1,0,0,0,1,0,0,3,5,4,0,0,1,0,0,1,0,0,2,0,3,5,7,3,1,4,0,1,0,0,0,0,2,0,0,0,1,3,4,6,0,0,1), byrow=T, nrow=8, ncol=10)
colnames(A) <- letters[1:10]
rownames(A) <- LETTERS[1:8]
print(A)
##   a b c d e f g h i j
## A 2 5 2 1 0 0 0 0 1 0
## B 0 0 0 1 3 5 6 0 0 1
## C 0 0 0 2 0 0 1 2 7 2
## D 4 6 2 5 1 0 0 1 0 0
## E 0 1 0 0 3 5 4 0 0 1
## F 0 0 1 0 0 2 0 3 5 7
## G 3 1 4 0 1 0 0 0 0 2
## H 0 0 0 1 3 4 6 0 0 1

Distance matrix

Every types of MDS require a distance matrix as main argument to work with. Fortunately R provides a lot of possibility to create distance matrix and there are a lot of type of distances so lets play a bit with them. In this post jaccard, cosine, euclidean distance matrices will be used

#weighted jaccard similarity matrix setup
sim.jac <- matrix(0, nrow=nrow(A), ncol=nrow(A))
rownames(sim.jac) <- rownames(A)
colnames(sim.jac) <- rownames(A)

#weighted jaccard function
pairs <- t(combn(1:nrow(A), 2))
for (i in 1:nrow(pairs)){
  num <- sum(sapply(1:ncol(A), function(x)(min(A[pairs[i,1],x],A[pairs[i,2],x]))))
  den <- sum(sapply(1:ncol(A), function(x)(max(A[pairs[i,1],x],A[pairs[i,2],x]))))
  sim.jac[pairs[i,1],pairs[i,2]] <- num/den
  sim.jac[pairs[i,2],pairs[i,1]] <- num/den  
}
sim.jac[which(is.na(sim.jac))] <- 0
diag(sim.jac) <- 1

#weighted jaccard distance
dist.jac <- 1-sim.jac
#weighted cosine distance
library(SnowballC); library(lsa)
dist.cos <- cosine(t(A))
diag(dist.cos) <- 0
#weighted euclidean distance
dist.eu <- as.matrix(dist(A, method = "euclidean"))

MDS functions

There are a really good collection of metric MDS functions on r-blogger by Gaston Sanchez.

Lets try some of them.

Classical MDS with cmdscale() in {stat} package

mds.jac <- as.data.frame(cmdscale(dist.jac))
mds.jac$names <- rownames(mds.jac)

mds.cos <- as.data.frame(cmdscale(dist.cos))
mds.cos$names <- rownames(mds.cos)

mds.eu <- as.data.frame(cmdscale(dist.eu))
mds.eu$names <- rownames(mds.eu)
library(ggplot2)
library(gridExtra)

plot1 <- ggplot(mds.jac, aes(V1, V2, label=names)) + 
  geom_point(colour="blue", size=2) +
  geom_text(colour="blue", check_overlap = TRUE, size=2.5, 
            hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.025) + 
  labs(x="", y="", title="MDS by Jaccard") + theme_bw()

plot2 <- ggplot(mds.cos, aes(V1, V2, label=names)) + 
  geom_point(colour="red", size=2) +
  geom_text(colour="red", check_overlap = TRUE, size=2.5, 
            hjust = "left", vjust = "bottom", nudge_x = 0.02, nudge_y = 0) + 
  labs(x="", y="", title="MDS by cosine") + theme_bw()
  
plot3 <- ggplot(mds.eu, aes(V1, V2, label=names)) + 
  geom_point(colour="green", size=2) +
  geom_text(colour="green", check_overlap = TRUE, size=2.5, 
            hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.4) + 
  labs(x="", y="", title="MDS by Euclidean") + theme_bw()

grid.arrange(plot1, plot2, plot3, ncol=3)

Not so miraculously the MDS by Euclidean distance similar to the first two eigen vector of PCA of original matrix, just multiplied by (-1).

prcomp(A)$x[,1:2]
##         PC1        PC2
## A -3.872709 -2.7896109
## B  5.863396 -0.9031562
## C -2.397281  4.9975096
## D -4.977864 -5.0032304
## E  4.491605 -0.9683761
## F -1.827234  6.7701307
## G -2.643388 -1.1950046
## H  5.363475 -0.9082621
mds.eu
##          V1         V2 names
## A  3.872709  2.7896109     A
## B -5.863396  0.9031562     B
## C  2.397281 -4.9975096     C
## D  4.977864  5.0032304     D
## E -4.491605  0.9683761     E
## F  1.827234 -6.7701307     F
## G  2.643388  1.1950046     G
## H -5.363475  0.9082621     H

Non metric MDS with isoMDS() in {MASS} package

Nonmetric MDS is performed by the isoMDS() in {MASS} package.

library(MASS)
iso.mds.jac <- isoMDS(dist.jac, k=2)
## initial  value 4.547008 
## iter   5 value 0.723930
## iter  10 value 0.468269
## iter  15 value 0.234417
## iter  20 value 0.073340
## iter  25 value 0.019527
## iter  30 value 0.010233
## iter  30 value 0.008087
## iter  30 value 0.005111
## final  value 0.005111 
## converged
longData <- iso.mds.jac$points
longData <- as.data.frame(longData)
longData$names <- rownames(iso.mds.jac$points)

plot1 <- ggplot(longData, aes(V1, V2, label=names)) + 
  geom_point(colour="red", size=2) +
  geom_text(colour="red", check_overlap = TRUE, size=2.5, 
            hjust = "left", vjust = "bottom", nudge_x = 0.02, nudge_y = 0) + 
  labs(x="", y="", title="MDS by isoMDS()") + theme_bw()

plot2 <- ggplot(mds.jac, aes(V1, V2, label=names)) + 
  geom_point(colour="blue", size=2) +
  geom_text(colour="blue", check_overlap = TRUE, size=2.5, 
            hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.01) + 
  labs(x="", y="", title="MDS by cmdscale()") + theme_bw()

grid.arrange(plot1, plot2, ncol=2)

It seems that tere are just 3 data points on left figure but there are all. If you see data points you can recognize overlapping points: A and D,G; B and E,H; C and F that are very close with the metric type cmdscale() function as well.

iso.mds.jac$points
##         [,1]        [,2]
## A  0.4313749 -0.32012400
## B -0.5286107 -0.09526603
## C  0.1460758  0.62326014
## D  0.4311805 -0.32029813
## E -0.5284727 -0.09541529
## F  0.1459975  0.62339164
## G  0.4310571 -0.32016603
## H -0.5286023 -0.09538230

Individual difference scaling

With indscal() function in {SensoMineR} package

It is a special case of MDS family which require Napping data type, i.e products are positioned on a tableclothe by panelist. Their coordinates are used as input.

See details on rdocumentation.org

With smacof algorithm (MDS by majorization)

smacofRect() function not require distance matrix it can work with any data frame. Matrix or dataframe of preferences, ratings, dissimilarities. There is a good example on r-blogger.

smacofSym() function is the simplest funtion in the package, and mds() function is the same. In this post I use tis one.

There is theoritical background of smacof on JSS.

Here we compere smacof algorithm with jaccard distance matrix.

library(smacof)
mds.smacof <- smacofSym(dist.jac)
plotdata <- as.data.frame(mds.smacof$conf)
plotdata$names <- rownames(mds.smacof$conf)

plot1 <- ggplot(plotdata, aes(D1, D2, label=names)) + 
  geom_point(colour="red", size=2) +
  geom_text(colour="red", check_overlap = TRUE, size=2.5, 
            hjust = "left", vjust = "bottom", nudge_x = 0.02, nudge_y = 0) + 
  labs(x="", y="", title="MDS by smacofSym()") + theme_bw()

plot2 <- ggplot(mds.jac, aes(V1, V2, label=names)) + 
  geom_point(colour="blue", size=2) +
  geom_text(colour="blue", check_overlap = TRUE, size=2.5, 
            hjust = "center", vjust = "top", nudge_x = 0, nudge_y = -0.01) + 
  labs(x="", y="", title="MDS by cmdscale()") + theme_bw()

grid.arrange(plot1, plot2, ncol=2)


Be happyR! :)

No comments:

Post a Comment