I know dendrograms are quite popular. However if there are quite large number of observations and classes it hard to follow. However sometime I feel that there should be better way to present the same thing. I got an idea but do not know how to implement it.
Consider the following dendrogram.
> data(mtcars)
> plot(hclust(dist(mtcars)))
Can plot it like a scatter plot. In which the distance between two points is plotted with line, while sperate clusters (assumed threshold) are colored and circle size is determined by value of some variable.
You are describing a fairly typical way of going about cluster analysis:
Use a clustering algorithm (in this case hierarchical clustering)
Decide on the number of clusters
Project the data in a two-dimensional plane using some form or principal component analysis
The code:
hc <- hclust(dist(mtcars))
cluster <- cutree(hc, k=3)
xy <- data.frame(cmdscale(dist(mtcars)), factor(cluster))
names(xy) <- c("x", "y", "cluster")
xy$model <- rownames(xy)
library(ggplot2)
ggplot(xy, aes(x, y)) + geom_point(aes(colour=cluster), size=3)
What happens next is that you get a skilled statistician to help explain what the x and y axes mean. This usually involves projecting the data to the axes and extracting the factor loadings.
The plot:
Related
I'm new to R, I searched but I find outdate info only.
I've done a simple single linkage clustering process.
d<-dist(scale(DATA),method="euclidean",diag=TRUE,upper=TRUE)
hls<-hclust(d,method="complete")
How can I plot a scatterplot which uses a color each cluster?
Exactly like this example
I created some sample data to work with. If your data looks different, please provide some sample data as part of your question.
To create a scatter plot colored by group, first create your groups using the cutree function. You can specify an integer value to indicate how may groups you want to create.
Next use your favorite graphing package (e.g. ggplot) to create the scatter plot.
# Sample data
rData <- data.frame(x=c(1,1,3,4), y=c(1,2,5,4))
print(rData)
# Cluster
d <- dist(scale(rData), method="euclidean", diag=TRUE, upper=TRUE)
hls <- hclust(d, method="complete")
# Create groups
cluster <- cutree(hls, 2)
# Create scatter plot
ggData <- cbind(rData, cluster)
ggData$cluster <- as.factor(ggData$cluster)
print(ggData)
ggplot(ggData, aes(x=x, y=y, color=cluster)) + geom_point(size=5)
I would recommend exploring http://www.cookbook-r.com/Graphs/ to learn more about ggplot.
I have 10 matrices where each one represent the eigenvalues of correlation matrices of wavelet coefficients of a number of time series.
I would like to generate a heatmap of my data to see if there are any overlapping significant events across the different scales.
So far I have hacked together the following image using the graphics package as described here by #Josliber.
I am still getting to grips with R so excuse the nasty code but it works for me for now. I have yet to mess around with the labels and formatting but it's a quick and dirty representation.
#plotting the eigenvalues for each of the wavelet coefficient scales
w1mat <- matrix(w1eigen[1,])
w2mat <- matrix(w2eigen[1,])
w3mat <- matrix(w3eigen[1,])
w4mat <- matrix(w4eigen[1,])
w5mat <- matrix(w5eigen[1,])
w6mat <- matrix(w6eigen[1,])
w7mat <- matrix(w7eigen[1,])
w8mat <- matrix(w8eigen[1,])
w9mat <- matrix(w9eigen[1,])
w10mat <- matrix(w10eigen[1,])
#plots the eigenvalues for each of the scales
par(mfrow=c(10,1))
imageW1 <- image(w1mat)
imageW2 <- image(w2mat)
imageW3 <- image(w3mat)
imageW4 <- image(w4mat)
imageW5 <- image(w5mat)
imageW6 <- image(w6mat)
imageW7 <- image(w7mat)
imageW8 <- image(w8mat)
imageW9 <- image(w9mat)
imageW10 <- image(w10mat)
As you can see I used the image function in the graphics package to create this and I am sure I can achieve the same in ggplot with greater control.
Ultimately I wish to create a plot similar to the one below where there is a common Y axis and they're sitting right on top of one another with no white space.
What I would like to know is whether using the stacked image function in graphics is the best approach or is there a better way to visualise the data in an alternative package?
To create a parallel coordinate plot I wanted to use ggparcoord() function in package GGally. The following codes show a reproducible example.
set.seed(3674)
k <- rep(1:3, each=30)
x <- k + rnorm(mean=10, sd=.2,n=90)
y <- -2*k + rnorm(mean=10, sd=.4,n=90)
z <- 3*k + rnorm(mean=10, sd=.6,n=90)
dat <- data.frame(group=factor(k),x,y,z)
library(GGally)
ggparcoord(dat,columns=1:4,groupColumn = 1)
Notice in the picture that the color for group was continuous even though I have the group variable as a factor. Is there any way I can display the plot with three discrete color instead?
I have looked at some other posts where they discuss various other ways of doing parallel coordinate plots in here. But I really wanted to do this in ggparcoord() function of package GGally. I appreciate your time in thinking about this problem.
Your code was almost correct. I spotted that columns=1:4 was not right in this case. You need to drop the column for groupColumn in columns
ggparcoord(dat,columns=2:4,groupColumn = 1)
I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)
I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)