Related
I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.
I have run cluster analysis on some time series data using permuco in R. (Permutes the labels of control/treatment conditions and calculates the F statistic as to how likely it is that these time clusters of significant differences occurred by chance.)
So far so good.
I have produced a number of plots using the inbuilt function plot.clusterlm that comes with this package. However, the data come from different groups, and the F values on the y axis get rescaled in each plot, i.e. the values and ticks are reset depending on how strong the effects are.
This is problematic, because the different plots based on different cluster analyses are not visually comparable.
I would like to rescale the y axis, so that all clusters are visualised along the same F values (0-10 for example).
I haven't been able to do that, and I was wondering if there is a way to pass any additional functions into the plot.clusterlm to do this.
This is the usage of the function, but I don't see a way to rescale the y axis. (Although rescaling the x axis is possible by manipulating the nbbaselinepts & nbptsperunit, but that's not what I want...)
plot(x, effect = "all", type = "statistic",
multcomp = "clustermass", alternative = "two.sided",
enhanced_stat = FALSE, nbbaselinepts = 0, nbptsperunit = 1, ...)
If you have any ideas on this, please let me know.
Thank you!
Thanks for using permuco! I opened an issue on GitHub to have a solution for implementing these features. You can expect changes in further releases of permuco.
However, the plot() method shows the F statistic which is not a good measure of effect size. A better measure of effect size is the partial-eta square which is implemented in the afex package
In the base R plotting device axes are altered like this:
x<-1:10; y=x*x
# Simple graph
plot(x, y)
# Enlarge the scale
plot(x, y, xlim=c(1,15), ylim=c(1,150))
# Log scale
plot(x, y, log="y")
This is an example from STHDA where you can find many helpful tutorials.
I would like to build a very simple rectangular surface in R that would have a logistic trend. The values at the top would have the highest values (1) and at the bottom the lowest (0). I have drafted an image that shows example of the surface that I have in mind, with help of not the prettiest trend lines so you have an idea what is needed. I do not have any data, it is supposed to be a theoretical surface with logistic trend, that I am later going to modify.
Any help with how to start/approach it, or helpful packages in R would be highly appreciated!
Consider this as a hint.
library("graphics")
plot(0:1, type = "n",xaxt="n", ann=FALSE)
abline(h = c(seq(0,1,.1))
or
abline(h = c(0,.1,.2,.3,.6,.7,.8,.9))
abline(h = c(0.4,.5), col="red")
The only thing you have to do is place the variable, as you call it, with the “logistic trend,” in place of ‘0:1’
A second hint
df = as.matrix(c(0.131313131,0.111111111,0.090909091,
0.080808081,0.060606061,0.050505051,
0.060606061,0.080808081,0.090909091,
0.111111111,0.131313131))
barplot(prop.table(df, 2) )
this results in
I have a cluster plot by R while I want to optimize the "elbow criterion" of clustering with a wss plot, so I drew a wss plot for my cluster, but is looks really strange and I do not know how many elbows should I cluster, anyone could help me?
Here is my data:
Friendly<-c(0.533,0.854,0.9585,0.925,0.9125,0.9815,0.9645,0.981,0.9935,0.9585,0.996,0.956,0.9415)
Polite<-c(0,0.45,0.977,0.9915,0.929,0.981,0.9895,0.9875,1,0.96,0.996,0.873,0.9125)
Praising<-c(0,0,0.437,0.9585,0.9415,0.9605,0.998,0.998,0.8915,1,1,1,0.977)
Joking<-c(0,0,0,0.617,0.942,0.9665,0.9935,0.992,0.935,0.987,0.975,0.9915,0.9665)
Sincere<-c(0,0,0,0,0.617,0.8335,0.985,0.9895,0.977,0.9205,1,0.9585,0.8895)
Serious<-c(0,0,0,0,1,0.642,0.975,0.9605,0.9645,0.9895,0.8125,0.9605,0.925)
Hostile<-c(0,0,0,0,0,0,0.629,0.656,0.948,0.9705,0.9645,0.998,0.9685)
Rude<-c(0,0,0,0,0,0,0,0.687,0.979,0.954,0.954,0.996,0.956)
Irony<-c(0,0,0,0,0,0,0,0,0.354,0.9815,0.996,1,0.971)
Insincere<-c(0,0,0,0,0,0,0,0,1,0.396,0.996,0.9915,0.9415)
Commanding<-c(0,0,0,0,0,0,0,0,0,1,0.462,0.9605,0.9165)
Suggesting<-c(0,0,0,0,0,0,0,0,0,0,0,0.867,0.775)
Neutral<-c(0,0,0,0,0,0,0,0,0,0,0,0,0.283)
data <- data.frame(Friendly,Polite,Praising,Joking,Sincere,Serious,Hostile,Rude,Blaming,Insincere,Commanding,Suggesting,Neutral)
And here is my code of clustering: the method is given by Gavin in the last line of :How to draw the plot of within-cluster sum-of-squares for a cluster?
##cluster analysis
dist<-as.dist(data)
hc<-hclust(dist, method="average")
plot(hc, main="", sub='Method="Average"', ann=T, axes=T, hang=0.2)
##draw a wss plot
res <- sapply(seq.int(1, 13), wrap, h = hc, x = data)
plot(seq_along(res), res, type="b", pch=19)
But it looks like this, anyone can explain why this happened and how to decide the "elbow criterion"?
Why do you expect that WSS will decline smoothly with increasing numbers of clusters? It need not, as you found out. Only with well-behaved data have I seen nicely behaved scree plots.
There is a big drop in the WSS with 7 clusters which might suggest you want to stop there. However, you should also look at the dendrogram when you evaluate this.
I have two related problems.
Problem 1: I'm currently using the code below to generate a histogram overlayed with a density plot:
hist(x,prob=T,col="gray")
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(density(x))
I've pasted the data (i.e. x above) here.
I have two issues with the code as it stands:
the last tick and label (100) of the x-axis does not appear on the histogram/plot. How can I put these on?
I'd like the y-axis to be of count or frequency rather than density, but I'd like to retain the density plot as an overlay on the histogram. How can I do this?
Problem 2: using a similar solution to problem 1, I now want to overlay three density plots (not histograms), again with frequency on the y-axis instead of density. The three data sets are at:
http://pastebin.com/z5X7yTLS
http://pastebin.com/Qg8mHg6D
http://pastebin.com/aqfC42fL
Here's your first 2 questions:
myhist <- hist(x,prob=FALSE,col="gray",xlim=c(0,100))
dens <- density(x)
axis(side=1, at=seq(0,100, 20), labels=seq(0,100,20))
lines(dens$x,dens$y*(1/sum(myhist$density))*length(x))
The histogram has a bin width of 5, which is also equal to 1/sum(myhist$density), whereas the density(x)$x are in small jumps, around .2 in your case (512 even steps). sum(density(x)$y) is some strange number definitely not 1, but that is because it goes in small steps, when divided by the x interval it is approximately 1: sum(density(x)$y)/(1/diff(density(x)$x)[1]) . You don't need to do this later because it's already matched up with its own odd x values. Scale 1) for the bin width of hist() and 2) for the frequency of x length(x), as DWin says. The last axis tick became visible after setting the xlim argument.
To do your problem 2, set up a plot with the correct dimensions (xlim and ylim), with type = "n", then draw 3 lines for the densities, scaled using something similar to the density line above. Think however about whether you want those semi continuous lines to reflect the heights of imaginary bars with bin width 5... You see how that might make the density lines exaggerate the counts at any particular point?
Although this is an aged thread, if anyone catches this. I would only think it is a 'good idea' to forego translating the y density to count scales based on what the user is attempting to do.
There are perfectly good reasons for using frequency as the y value. One idea in particular that comes to mind is that using counts for the y scale value can give an analyst a good idea about where to begin the 'data hunt' for stratifying heterogenous data, if a mixed distribution model cannot soundly or intuitively be applied.
In practice, overlaying a density estimate over the observed histogram can be very useful in data quality checks. For example, in the above, if I were looking at the above graphic as a single source of data with the assumption that it describes "1 thing" and I wish to model this as "1 thing", I have an issue. That is, I have heterogeneous data which may require some level of stratification. The density overlay then becomes a simple visual tool for detecting heterogeneity (apart from using log transformations to smooth between-interval variation), and a direction (locations of the mixed distributions) for stratifying the data.