So I've been working on a scatter plot for some data that I have. I used to be able to get the scatter plot function to work, but now I can't and I don't understand what my error is. My data looks has 5 values and a column that assigns each to a cluster (I used k-means in this particular case).
closedmi uncertin certknow sourknow justknow fit3.cluster
1 3.166667 6.125 2.571429 4.500 3.375 1
2 3.666667 4.250 3.428571 4.000 4.750 2
3 1.833333 5.750 1.428571 3.375 2.125 2
4 3.500000 4.500 1.857143 4.250 3.125 3
I'm looking to try to plot my data in 3 dimensions using the first three principle components and see the clusters. Here is my code to find the principal components, and then attach the cluster column to the principle components into a new data frame.
#Find the 5 principal components of the data matrix
pcdf <- princomp(pre2, cor=T, score=T)
pre4 <- data.frame(pcdf$scores, cluster=fit3$cluster)
#Making a 3D plot of the Solution
scatter3d(pre4$Comp.1, pre4$Comp.2, pre4$Comp.3, groups=pre4$cluster,
surface=FALSE, grid=FALSE, ellipsoid=TRUE)
So then try to use scatter3d to plot the individuals using the cluster column as a grouping factor and I end up with an error. I've been using this source for the code to get the right syntax, but I still end up with the error.
Error in scatter3d.default(pre4$Comp.1, pre4$Comp.2, pre4$Comp.3, groups = pre4$cluster: groups variable must be a factor
but it is. It's in the data frame, I can call the column using pre4$cluster. Is there some formatting or syntax error I can't see? Am I just going mad?
I was able to get this to work just last week and now I'm not able to. I know I can use plot3d to get the visualization, but I like the visualization better using scatter3d and would like to be able to use it.
Try this:
scatter3d(pre4$Comp.1, pre4$Comp.2, pre4$Comp.3, groups=as.factor(pre4$cluster),
surface=FALSE, grid=FALSE, ellipsoid=TRUE)
That will solve the error message regarding factors. Beyond that, just make sure that your leading minor is positive definite.
Related
I would like to plot in R following contour plots representing two dimensional cumulative distributon functions (CDF)
A CDF in 2 or more dimensions is not unique (Lopes et al. The two-dimensional Kolmogorov-Smirnov test) that's why there are 4 alternative plots (and probably some more).
So far I have no R/Matlab code to show. I don't think it's difficult but most likely very time consuming. There might out there something I could use.
EDIT
Type 1 & 4 are more or less covered, but any help with 2 & 3 would be really appreciated.
EDIT2
Types 2 & 3 using geom_rect - as simple as it gets!
The sequence of recangles is ordered wrt the eucleadian distance of the data. This means, if we assume this generic ordering, there is possibly only one version of defining a 2D CDF instead of two. That would confirm the statement of Lopes et al.and other that there are only 2^N-1 (here 3) ways to define the CDF.
Any thoughts?
So I have plotted a curve, and have had a look in both my book and on stack but can not seem to find any code to instruct R to tell me the value of y when along curve at 70 x.
curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
)
So in short, I would like to know how to command R to give me the number of salmon at 70 years, and at other times.
Edit:
To clarify, I was curious how to command R to show multiple Y values for X at an increase of 5.
salmon <- data.frame(curve(
20*1.05^x,
from=0, to=140,
xlab='Time passed since 1890',
ylab='Population of Salmon',
main='Growth of Salmon since 1890'
))
salmon$y[salmon$x==70]
1 608.5285
This salmon data.frame gives you all of the data.
head(salmon)
x y
1 0.0 20.00000
2 1.4 21.41386
3 2.8 22.92768
4 4.2 24.54851
5 5.6 26.28392
6 7.0 28.14201
If you can also use inequalities to check the number of salmon in given ranges using the syntax above.
It's also simple to answer the 2nd part of your question using this object:
salmon$z <- salmon$y*5 # I am using * instead of + to make the plot more clear
plot(x=salmon$x,y=salmon$z, xlab='Time passed since 1890', ylab='Population of Salmon',type="l")
lines(salmon$x,salmon$y, col="blue")
curve is plotting the function 20*1.05^x
so just plug any value you want in that function instead of x, e.g.
> 20*1.05^70
[1] 608.5285
>
20*1.05^(seq(from=0, to=70, by=10))
Was all I had to do, I had forgotten until Ed posted his reply that I could type a function directly into R.
I have a dataset containing 1599 observations and 10 attributes on which iIneed to do kmeans clustering. I have done the kmeans with 6 clusters and I can see the cluster centers, size, etc. and which observation lies in which cluster. Now, I need to plot these results such that I have in a single plot the following information: On x-axis, I want 1 of the 10 attributes of my original data, on y-axis I want another attribute and in the plot, I want all 1599 observations, but I want them in 6 different colors for each cluster they belong. So, I will have 10C2 = 45 plots. Basically, this should give me the information that cluster 1 is high/medium/low in terms of a particular attribute while cluster 2 is so and so.....for all 6 clusters.
I tried the function plotcluster from fpc package but from what I understood, it maps the data into 2D, using PCA, and then plots the clusters in terms of 2 dimensions which are different from the original attributes. So now when I will say cluster 1 is low, in dim1, it wouldn't really make much sense.
Is there a function to do what I want, or should I just append the '$cluster' information from the kmeans output with my original data and try to plot taking 2 columns from my data at a time using the basic function plot()?
I suggest one solution, probably not the simplest one (with a for loop) but it seems to answer what you need:
df=mtcars
df$cluster = factor( kmeans(df, centers=6)$clust )
mycomb <- combn(1:ncol(df), 2)
for (xy in 1:45 ) {
plot(x=df[, mycomb[1,xy]],
y=df[, mycomb[2,xy]],
col=as.numeric(df$clust),
xlab=names(df)[mycomb[1,xy]],
ylab=names(df)[mycomb[2,xy]])
}
Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning
I have a data frame with three columns and I'd like to make a image/heatmap of the data.
The three columns are pe, vix, and ret with pe and vix being x and y and ret being z.
There are 220 lines in the data frame so i'd like to bin the data if possible, the ranges are below.
Any suggestions for how to bin the x and y data and also create a matrix for use in an image()?
> range(matr$pe)
[1] 13.32 44.20
> range(matr$vix)
[1] 10.42 59.89
> range(matr$ret)
[1] -0.09274936 0.04693118
> class(matr)
[1] "data.frame"
> head(matr)
pe vix ret
1 20.86 13.16 -0.002931561
2 20.46 12.53 -0.003546889
3 20.52 12.42 0.006339165
4 20.61 13.47 0.009683174
5 20.57 11.26 -0.002666668
6 20.81 11.73 0.002895003
Here's what I ended up doing. I used the interp() function in the akima package to create the appropriately binned matrix object. It seems to do the work of binning and 'matricizing' of the data frame. On a side note, in order to make the heatmap WITH a legend, I ended up using the image.plot() method from the fields package. Here's the
code:
par(bg = 3)
image.plot(s,xlab="P/E Ratio", ylab="VIX",
main="Contour Map of SPY Returns vs P/E Ratio and Vix")
abline(v=(seq(0,100,5)), col=6, lty="dotted")
abline(h=(seq(0,100,5)), col=6, lty="dotted")
contour(s, add=TRUE)
and resulting product for anyone interested:
Thanks to everyone for their help and suggestions.
You could use e.g. cutlike this:
matr$binnedpe<-cut(matr$pe, breaks=10)
matr$binnedvix<-cut(matr$vix, breaks=10)
Next you can use e.g. ddply (from package plyr) to get the means per bin:
binneddata<-ddply(matr, .(binnedpe, binnedvix), function(d){c(d$binnendpe, d$binnedvix, mean(d$ret))})
Finally, you use this last data.frame to draw your heat map. I haven't tested any of the above, but it should be close enough to get you going.
you should take a spin through the raster package. In particular, the function rasterfromXYZ() should do most of what you want. It's pretty easy, either with the base graphics tools or the raster package, to setup a 'heatmap' color range for the raster object.