MDS in practice
Introduction
The antecedent of this post is MDS in theory. Now continue this with an example which is the Groceries dataset in {arules} package.
Unfortunately blogger cannot run properly the full html file that render RStudio that’s why some some figures is missing from the post just R code can be seen.
Load Groceries dataset
The Groceries dataset represents products that was bought together by buyers. There are 169 products and 9835 purchase transactions in tha dataset. In this chapter we want to cluster produts with Jaccard similarity and hierarchical clustering and investigate results of clustering with MDS.
library(arules)
data(Groceries)
Groceries <- as(Groceries, "matrix")*1
Hierarchical clustering on produts
library(ggdendro)
library(dplyr)
library(ggplot2)
gr.jac <- dissimilarity(t(Groceries), method = "jaccard") #jaccard dissimilarity of columns, works with just binary data
hc <- hclust(gr.jac, method="ward.D2")
cut <- as.data.frame(cutree(hc, k=7))
names(cut) <- "cut"
cut$names <- rownames(cut)
hcdata <- dendro_data(hc, type="triangle")
hcdata$labels <- left_join(hcdata$labels, cut, by=c("label"="names"))
ggplot(hcdata$segments) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend))+
geom_text(data = hcdata$labels, aes(x, y, label = label, colour=factor(cut)),
hjust = 1, size = 1.8) + scale_colour_discrete(name = "clusters") +
labs(x="", y="") + coord_flip() + ylim(-0.5, 2) + xlim(0,170) + theme_bw()
MDS
MDS with cmdscale()
mds.cmdscale <- as.data.frame(cmdscale(as.matrix(gr.jac)))
mds.cmdscale$names <- rownames(mds.cmdscale)
mds.cmdscale$cut <- cut$cut
ggplot(mds.cmdscale, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2,
hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.005) +
scale_colour_discrete(name = "clusters") +
labs(x="", y="", title="MDS by Jaccard and cmdscale()") + theme_bw()
MDS with smacofSym()
library(smacof)
mds.smacof <- smacofSym(as.matrix(gr.jac))
plotdata <- as.data.frame(mds.smacof$conf)
plotdata$names <- rownames(mds.smacof$conf)
plotdata$cut <- cut$cut
ggplot(plotdata, aes(D1, D2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2,
hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.015) +
scale_colour_discrete(name = "clusters") +
labs(x="", y="", title="MDS by Jaccard and smacofSym()") + theme_bw()
MDS with isoMDS()
library(MASS)
mds.isomds <- isoMDS(as.matrix(gr.jac), k=2)
## initial value 54.878156
## iter 5 value 39.817833
## iter 10 value 35.378864
## iter 15 value 34.412360
## final value 34.325742
## converged
plotdata <- mds.isomds$points
plotdata <- as.data.frame(plotdata)
plotdata$names <- rownames(mds.isomds$points)
plotdata$cut <- cut$cut
ggplot(plotdata, aes(V1, V2, label=names)) +
geom_point(aes(colour=factor(cut)), size=2.3) +
geom_text(aes(colour=factor(cut)), check_overlap = TRUE, size=2.2,
hjust = "center", vjust = "bottom", nudge_x = 0, nudge_y = 0.05) +
scale_colour_discrete(name = "clusters") +
labs(x="", y="", title="MDS by Jaccard and isoMDS()") + theme_bw()
There are some indiscriminate points on the plot but in total view good clusters can get with this method. Of course clustering method can be improved for better clusters. There are some point of improvement:
- similarity/dissimilarity method (here was Jaccard)
- clustering method (here was hierarhical clustering with ward.D method)
- MDS (here was PCA like MDS with cmdscale() function, then with smacofSmy() function, and then with isoMDS() function)
Be happyR! :)