I am working with California Housing Dataset. The dataset has 20640 observations and 10 attributes. I am using R to make biplot but the figure I obtained is not very readable. The output is as followenter image description here
I am using a simple code to make this output.
biplot(housingpr,scale = 0)
Is there anyway to make this biplot look readble.
There is not much you can do if you want to plot 20,640 observations except to make the points smaller. Here is an example with the iris data:
data(iris)
iris.pca <- prcomp(iris[, -5], scale.=TRUE)
biplot(iris.pca, xlabs=rep("*", nrow(iris)), cex=.75)
The xlabs= argument sets the text for each point with the default value being the row number. This replaces the default with an asterisk for each value. If there are still too many points you can replace the asterisk with a period. The cex= argument controls the size of the labels with the default value of 1 being full-size.
Related
I want show cluster wise boxplot distribution from complexheatmap. I was able to do row-wise distribution but how do I implement the cluster-wise distribution attached as example.
In the dummy example it creates a subgroup which it shows in the distribution. Similar manner I have already in my datafile made cluster which is represented in the first column.
How do I implement this in my dataframe using this example code
I'm not sure how do I make subgroup in case of my dataframe.
Any suggestion or help would be really appreciated.
This is the output i would like to see:
This is the output I have:
The dataset is this one: small_data
And my code:
df <- read.csv("small_data.txt",header = TRUE)
heat <- t(scale(t(df[,3:ncol(df)])))
myBreaks <- seq(-1.5, 1.5, length.out=100)
hmap <- Heatmap(heat)
hmap
How do i implement the cluster specific distribution ? as it is shown in the first pic. The second figure is what I'm getting now
I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.
I would like to create an interactive 3D surface plot of depths in a lake, ideally using the plotly or rgl libraries. I have extracted my data from a SpatialLinesDataFrame of contour lines in Gauss-Krueger/EPSG:31468 CRS, i.e. metric units. Now each contour line produces a set of coordinates with the same depth value. The resulting data frame is rather large, but looks something like this:
set.seed(41)
xx <- rnorm(100,4448929,100)
yy <- rnorm(100,5308097,100)
zz <- c(rep(-10,10),rep(-20,10),rep(-30,10),rep(-40,10),rep(-50,10),rep(-60,10),rep(-70,10),rep(-80,10),rep(-90,10),rep(-100,10))
df <- data.frame(xx,yy,zz)
I have tried plotting the data with plotly as in this example and with rgl as in this post. In both cases I get error messages relating to my data not being in a matrix format, i.e. where x- and y-values are represented as row- and column-numbers.
What does work, is using the add_trace command in plotly:
plot_ly() %>% add_trace(df,x = ~df$xx, y = ~df$yy, z = ~df$zz,type="mesh3d")
However, the resulting graph not only lacks the fancy colour legend of the add_surface command, but more importantly, warps the x- and y-values in relation to the z-values. The z-values are shown much too large, although all have the same metric unit.
I have also tried reshaping the data frame to a matrix as in this post, but it either doesn't work at all, or gives me a matrix consisting almost entirely of NAs. I can only speculate that the number of coordinates that have depth values attached is very small in comparison to all x-y-combinations of coordinates in that range?
Any suggestions will be much appreciated - thanks!
Those are randomly located points, so rgl::persp3d can't handle them directly. However, you can follow the example in ?rgl::persp3d.deldir to triangulate them and then plot. For example,
dxyz <- deldir::deldir(df$xx, df$yy, z = df$zz, suppressMsgs=TRUE)
persp3d(dxyz, col = "lightblue")
This results in a pretty ugly picture, but with some work (e.g. fixing the axis labels, using real data) you should get something reasonable.
I am trying to do PCA with R.
My Data has 10,000 columns and 90 rows
I used the prcomp function to do PCA.
Trying to prepare a biplot with the prcomp results, I ran into the problem that the 10,000 plotted vectors cover my datapoints. Is there any option for the biplot to hide the vectors' representation?
OR
I can use plot to get the PCA results. But I am not sure how to label these points according to my datapoints, which are numbered 1 to 90.
Sample<-read.table(file.choose(),header=F,sep="\t")
Sample.scaled<-data.frame(apply(Sample_2XY,2,scale))
Sample_scaled.2<-data.frame(t(na.omit(t(Sample_2XY.scaled))))
pca.Sample<-prcomp(Sample_2XY.scaled.2,retx=TRUE)
pdf("Sample_plot.pdf")
plot(pca.Sample$x)
dev.off()
If you do a help(prcomp) or ?prcomp, the help file tells us all the things contained in the prcomp() object returned by the function. We just need to pick which things we want to plot and do it with some function that gives us more control than biplot().
A more general trick for cases when the help file doesn't clarify things is to do a str() on the prcomp object (in your case pca.Sample) to see all its parts and find what we want ( str() compactly displays the internal structure of an R object. )
Here is an example with some of R's sample data:
# do a pca of arrests in different states
p<-prcomp(USArrests, scale = TRUE)
str(p) gives me something ugly and too long to include, but I can see that p$x has the states as rownames and their locations on the principal components as columns. Armed with this, we can plot it any way we want, such as with plot() and text() (for labels):
# plot and add labels
plot(p$x[,1],p$x[,2])
text(p$x[,1],p$x[,2],labels=rownames(p$x))
If we are making a scatterplot with many observations, the labels may not be readable. We therefore might want to only label more extreme values, which we can identify with quantile():
#make a new dataframe with the info from p we want to plot
df <- data.frame(PC1=p$x[,1],PC2=p$x[,2],labels=rownames(p$x))
#make sure labels are not factors, so we can easily reassign them
df$labels <- as.character(df$labels)
# use quantile() to identify which ones are within 25-75 percentile on both
# PC and blank their labels out
df[ df$PC1 > quantile(df$PC1)["25%"] &
df$PC1 < quantile(df$PC1)["75%"] &
df$PC2 > quantile(df$PC2)["25%"] &
df$PC2 < quantile(df$PC2)["75%"],]$labels <- ""
# plot
plot(df$PC1,df$PC2)
text(df$PC1,df$PC2,labels=df$labels)
I am a newbie to R and I am trying to do some clustering on a data table where rows represent individual objects and columns represent the features that have been measured for these objects. I've worked through some clustering tutorials and I do get some output, however, the heatmap that I get after clustering does not correspond at all to the heatmap produced from the same data table with another programme. While the heatmap of that programme does indicate clear differences in marker expression between the objects, my heatmap doesn't show much differences and I cannot recognize any clustering (i.e., colour) pattern on the heatmap, it just seems to be a randomly jumbled set of colours that are close to each other (no big contrast). Here is an example of the code I am using, maybe someone has an idea on what I might be doing wrong.
mydata <- read.table("mydata.csv")
datamat <- as.matrix(mydata)
datalog <- log(datamat)
I am using log values for the clustering because I know that the other programme does so, too
library(gplots)
hr <- hclust(as.dist(1-cor(t(datalog), method="pearson")), method="complete")
mycl <- cutree(hr, k=7)
mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]
heatmap(datamat, Rowv=as.dendrogram(hr), Colv=NA,
col=colorpanel(40, "black","yellow","green"),
scale="column", RowSideColors=mycol)
Again, I plot the original colours but use the log-clusters because I know that this is what the other programme does.
I tried to play around with the methods, but I don't get anything that would at least somehow look like a clustered heatmap. When I take out the scaling, the heatmap becomes extremely dark (and I am actually quite sure that I have somehow to scale or normalize the data by column). I also tried to cluster with k-means, but again, this didn't help. My idea was that the colour scale might not be used completely because of two outliers, but although removing them slightly increased the range of colours plotted on the heatmap, this still did not reveal proper clusters.
Is there anything else I could play around with?
And is it possible to change the colour scale with heatmap so that outliers are found in the last bin that has a range of "everything greater than a particular value"? I tried to do this with heatmap.2 (argument "breaks"), but I didn't quite succeed and also I didn't manage to put the row side colours that I use with the heatmap function.
If you are okay with using heatmap.2 from the gplots package that will allow you to add breaks to assign colors to ranges represented in your heatmap.
For example if you had 3 colors blue, white, and red with the values going from low to high you could do something like this:
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
In this case you have 3 sets of values that correspond to the 3 colors, the values will differ of course depending on what values you have with your data.
One thing you are doing in your program is to call hclust on your data then to call heatmap on it, however if you look in the heatmap manual page it states:
Defaults to hclust.
So I don't think you need to do that. You might want to take a look at some similar questions that I had asked that might help to point you in the right direction:
Heatmap Question 1
Heatmap Question 2
If you post an image of the heatmap you get and an image of the heatmap that the other program is making it will be easier for us to help you out more.