Related
I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.
I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?
Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?
Large Dataset Visualisation
I have tried splitting the plot up using a for loop...
for(i in data[1:20]) {
fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}
But this doesn't work.
Edited Reproducible example:
This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.
# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets
dataset_pca <- prcomp(frets)
dataset_pca$x
fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))
However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.
Update:
I have performed what was said by #G5W below...
data <- prcomp(data, scale = TRUE, center = TRUE)
POEV = data$sdev^2 / sum(data$sdev^2)
barplot(POEV, ylim=c(0,0.22))
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
pos=3)
and I have now got a graph as follows...
Graph
But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?
I am not 100% sure what you want as your result,
but I am 100% sure that you need to take more control over
what is being plotted, i.e. do more of it yourself.
So let me show an example of doing that. The frets data
that you used has only 4 dimensions so it is hard to illustrate
what to do with more dimensions, so I will instead use the
nuclear data - also available in the boot package. I am going
to start by reproducing the type of graph that you displayed
and then altering it.
library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)
The basic plot of a prcomp object is similar to the fviz_eig
plot that you displayed but has three main differences. First,
it is showing the actual variances - not the percent of variance
explained. Second, it does not contain the line that connects
the tops of the bars. Third, it does not have the text labels
that tell the heights of the boxes.
Percent of Variance Explained. The return from prcomp contains
the raw information. str(N_PCA) shows that it has the standard
deviations, not the variances - and we want the proportion of total
variation. So we just create that and plot it.
POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))
This addresses the first difference from the fviz_eig plot.
Regarding the line, you can easily add that if you feel you need it,
but I recommend against it. What does that line tell you that you
can't already see from the barplot? If you are concerned about too
much clutter obscuring the information, get rid of the line. But
just in case, you really want it, you can add the line with
lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
However, I will leave it out as I just view it as clutter.
Finally, you can add the text with
text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)
This is also somewhat redundant, but particularly if you change
scales (as I am about to do), it could be helpful for making comparisons.
OK, now that we have the substance of your original graph, it is easy
to separate it into several parts. For my data, the first two bars are
big so the rest are hard to see. In fact, PC's 5-11 show up as zero.
Let's separate out the first 4 and then the rest.
barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
pos=3)
barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
pos=3, cex=0.8)
Now we can see that even though PC 5 is much smaller that any of 1-4,
it is a good bit bigger than 6-11.
I don't know what you want to show with your data, but if you
can find an appropriate way to group your components, you can
zoom in on whichever PCs you want.
Plotting several series in a same plot display is possible and also several subplots in a display. But I want several plots which can be completely different things (not necessarily a series or graph of a map) to be displayed exactly in one frame. How can I do that? In Maple you assign names for each plot like
P1:=...:, P2:= ...: and then using plots:-display(P1,P2,...); and it works. But I want to do this in Julia. Let's say I have the following plots as an example;
using Plots
pyplot()
x=[1,2,2,1,1]
y=[1,1,2,2,1]
plot(x,y)
p1=plot(x,y,fill=(0, :orange))
x2=[2,3,3,2,2]
y2=[2,2,3,3,2]
p2=plot(x2,y2,fill=(0, :yellow))
Now how to have both P1 and P2 in one plot? I don't one a shortcut or trick to write the output of this specific example with one plot line, note that my question is general, for example p2 can be a curve or something else, or I may have a forflow which generates a plot in each step and then I want to put all those shapes in one plot display at the end of the for loop.
Code for a simple example of trying to use plot!() for adding to a plot with arbitrary order.
using Plots
pyplot()
x=[1,2,2,1,1]
y=[1,1,2,2,1]
p1=plot(x,y,fill=(0, :orange))
x2=[2,3,3,2,2]
y2=[2,2,3,3,2]
p2=plot!(x2,y2,fill=(0, :orange))
p3=plot(x,y)
display(p2)
p5=plot!([1,2,2,1,1],[2,2,3,3,2],fill=(0, :green))
By running the above code I see the following plots respectively.
But what I expected to see is a plot with the green rectangle added inside the plot with the two orange rectangles.
The way to plot several series within the same set of axes is with the plot! function. Note the exclamation mark! It's part of the function name. While plot creates a new plot each time it is invoked, plot! will add the series to the current plot. Example:
plot(x, y)
plot!(x, z)
And if you are creating several plots at once, you can name them and refer to them in plot!:
p1 = plot(x, y)
plot!(p1, x, z)
Well, if you do that, what you will have is subplots, technically. That's what it means.
The syntax is
plot(p1, p2)
Sorry, I don't know how to plot a whole plot (conversely to a series) over an other plot.. For what it concerns the order of the plots, you can create as many plots as you want without display them and then display them wherever you want, e.g.:
using Plots
pyplot()
# Here we create independent plots, without displaying them:
x=[1,2,2,1,1]
y=[1,1,2,2,1]
p1=plot(x,y,fill=(0, :orange));
x2=[2,3,3,2,2]
y2=[2,2,3,3,2]
p2=plot(x2,y2,fill=(0, :orange));
p3=plot(x,y);
p5=plot([1,2,2,1,1],[2,2,3,3,2],fill=(0, :green));
# Here we display the plots (in the order we want):
println("P2:")
display(p2)
println("P3:")
display(p3)
println("P5:")
display(p5)
println("P1:")
display(p1)
I am trying to plot a time series graph, but am having issues getting it to be a line graph while showing the decades at the bottom.
My data set has the decades (as factors) next to performance (integer)
If I write
plot(StockPerformance$Decade, StockPerformance$Performance)
I will get a graph that has horizontal lines in it
PLOT PICTURE
adding,
type ="o"
like this:
plot(StockPerformance$Decade, StockPerformance$Performance, type ="o")
doesn't change it....
In R, when you read/create a data frame using read.table (or a variant thereof) or make it using data.frame, it tries to figure out what you have, and treat it appropriately. Specifically, inputs with character vectors (like "1830s" get converted to factors.
Factors are a way to efficiently store character strings - which was a lot more important when R was first created than now. The important thing for you is that characters don't have any order to them unless you put it there, so R automatically makes boxplots out of them. That's why you are seeing lines - they are boxplots with only one point.
To get around this, you need to convert them to numbers for the purpose of plotting. Then, you need to "fix" the axes afterwards. Here's how:
plot(Performance ~ as.numeric(Decade),
data = StockPerformance,
xlab = "Decade", # otherwise we have "as.numeric(Decade)
xaxt = 'n', # removes default axis ticks and labels
pch = 1 # default open circle. Change the number to get other options. 16 and 20 are both closed circles (20 is small, 16 is big)
)
with(StockPerformance, # This just makes it so I don't have to type StockPerformance twice below.
axis(1, at = 1:nlevels(Decade),
value = levels(Decade)
))
For my thesis i want to create a histogram on standardized earnings. This histogram should ideally have the following properties:
The histogram should be able to have the intervals of the data
(bins) played with.
Since i have my data in a spreadsheet. Is it possible to consider
more than one column?
Also it should have the ability to set the range of the data that is
included in the histogram for example from -50 mio. to 200 mio. (But
i could do this in my input)
Sadly I was not able to perform this task my own.
I have downloaded the data from orbis in spreadsheet (xlsx). Afterwards I cleaned my data of symbols that R can't read, saved everything as a Tab separated .txt and imported it into R-Studio:
setwd("/path")
getwd()
df<- read.table("importFile", header = TRUE)
View(df)
This worked nicely.
Now i tried creating the histogram
library(ggplot2)
myplot=ggplot(df, aes(JuStandartisiert2007))
myplot+ stat_count(width = 1000)
Then i received the following warning:
position_stack requires non-overlapping x intervals
My histogram looks horrible:
This perplexes me, I tried making a histogram on the airquality dataset and it works without problems.
Also note that i have to use stat_count for my histogram in a youtube video i saw, they did it the following way:
myplot+ geom_histogram(binwidth = 10)
My questions are now:
What is wrong with my Data why i have overlapping x Values? To my naked eye my data looks the same than that from R's airquality dataset.
How can I sepparate my x values?
Can i set max and min values for the data that enters my Histogram?
Can I consider more than one column in my dataset.
Here is my Dataset as TAB separated txt file.
https://www.dropbox.com/sh/jbscj6cftpcqaxh/AADglvv_xnG2wWN-o2SIrTwpa?dl=0
I would rather begin with base plotting such as:
hist(df$JuStandartisiert2007,breaks=1000,xlim=c(-2,2))
you can also observe the limits for the x-axis.
In order to have the plot of two columns try :
plot(df$JuStandartisiert2007,df$BilanzsummeAktiva2007,xlim = c(-5,5),ylim=c(-1,1000))
Once again observe the x and y limits represented by: xlim and ylim
I have data in a csv file which I have imported to R. The data is in a test file available at http://www.cyclismo.org/tutorial/R/_static/trees91.csv
I have imported this using:
tree<-read.csv(file="trees91.csv",header=TRUE,sep=",");
I can then extract two rows as follows
m<-tree[1,4:28]
n<-tree[2,4:28]
I would then like to plot these two sets of data as a scatter graph. I am using the command:
plot(x,y)
However, this doesn't give me a scatter graph. Instead I get a plot with 25x25 small squares each with a small circle in. The ones on the diagonal contain a number in them. The left hand y-axis and top x-axis have the same labels (0.25,0.20, 0.25, 25, 25, 0.10, 0.5, 0.4, 0.08,0.15,0.10,0.10) whilst the other two axes have the labels (0.6,0.08,1.5,0.6, 12,0.15,0.1,0.8,0.08,0.04,0.20,0.08,0.08). I have tried this with both a header row and without a header row in the csv file (setting header =FALSE in the input command) and get the same problem.
Using the same approach but extracting two columns, I am able to plot a scatter graph, so I have no idea why R won't plot a scatter graph from rows in a csv file. This seems like a fairly basic thing to want to do.
Are you after this:
plot(unlist(m),unlist(n))
tree is a dataframe, and so are m and n as subsets of it. The default for dataframes is to plot each column against each column, so you get 25x25 plots as you saw. Unlist converts the dataframe to a vector, so you see the plotting behaviour that you might be expecting.
See:
?plot.default for what you want.
?plot.data.frame for what you're getting.