Related
So I do a PCA analysis, and I usually plotted the results with ggplot2, but I just recently discovered ggbiplot which can show arrows with the variables.
ggbiplot seems to be working ok, though it shows some problems (like the imposibility of changing point size, hence the whole layer thing I do in the MWE).
The problem I am facing now is that, while ggplot2 plots adjust the plot width to the plotting area, ggbiplot does not. With my data, the ggbiplot is horribly narrow and leaves horribly wide vertical margins, even though it expands the same x axis interval as the ggplot2 plot (it is, in fact, the same plot).
I am using the iris data here, so I had to make the png width extra large so the problem I am facing becomes evident. Please check the MWE below:
data(iris)
head(iris)
pca.obj <- prcomp(iris[,1:4],center=TRUE,scale.=TRUE)
pca.df <- data.frame(Species=iris$Species, as.data.frame(pca.obj$x))
rownames(pca.df) <- NULL
png(filename="test1.png", height=500, width=1000)
print(#or ggsave()
ggplot(pca.df, aes(x=PC1, y=PC2)) +
geom_point(aes(color=Species), cex=3)
)
dev.off()
P <- ggbiplot(pca.obj,
obs.scale = 1,
var.scale=1,
ellipse=T,
circle=F,
varname.size=3,
groups=iris$Species, #no need for coloring, I'm making the points invisible
alpha=0) #invisible points, I add them below
P$layers <- c(geom_point(aes(color=iris$Species), cex=3), P$layers) #add geom_point in a layer underneath (only way I have to change the size of the points in ggbiplot)
png(filename="test2.png", height=500, width=1000)
print(#or ggsave()
P
)
dev.off()
This code produces the following two images.
ggplot2 output (desired plot width):
ggbiplot output (plot too narrow for plotting area):
See how, while ggplot2 adjusts the plot width, to the plot area, ggbiplot does not. With my data, the ggbiplot plot is extremely narrow and leaves large vertical margins.
My question here is: How to make ggbiplot behave as ggplot2? How can I adjust the plot width to my desired plotting area (png size) with ggbiplot? Thanks!
Change the ratio argument in coord_equal() to some value smaller than 1 (default in ggbiplot()) and add it to your plot. From the function description: "Ratios higher than one make units on the y axis longer than units on the x-axis, and vice versa."
P + coord_equal(ratio = 0.5)
NOTE: as #Brian noted in the comments, "changing the aspect ratio would bias the interpretation of the length of the principal component vectors, which is why it's set to 1."
I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.
Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.
Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.
A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this:
I created a simple Dotplot() using this data:
d <- data.frame(emot=rep(c("happy","angry"),each=2),
exp=rep(c("exp","non-exp"),2), accuracy=c(0.477,0.587,0.659,0.736),
Lo=c(0.4508,0.564,0.641,0.719), Hi=c(0.504,0.611,0.677,0.753))
and the code below:
library(Hmisc)
Dotplot(emot ~ Cbind(accuracy, Lo, Hi), groups=exp, data=d,
pch=c(1,16), aspect = "xy", par.settings = list(dot.line=list(col=0)))
What I want to do is to DECREASE the distance between y-axis ticks and decrease the distance between plot elements as well - so that happy/angry horizontal error lines will get closer to each other. I know I could probably achieve that by playing with scales=list(...) parameters (not sure how yet), but I would have to define labels again, etc. Is there a quicker way to do it? It seems like such a simple thing to solve, but I'm stuck.
Despite the fact that Hmisc ::Dotplot is using lattice, just adding a ylim argument seems to do the trick.You can figure out the default scale since those two values were factors with underlying 1/2 values:
Dotplot(emot ~ Cbind(accuracy, Lo, Hi), groups=exp, data=d, ylim=c(0,3),
pch=c(1,16), aspect = "xy", par.settings = list(dot.line=list(col=0)))
I want to plot a barplot of some data with some x-axis labels but so far I just keep running into the same problem, as the axis scaling is completely off limits and therefore my labels are wrongly positioned below the bars.
The most simple example I can think of:
x = c(1:81)
barplot(x)
axis(side=1,at=c(0,20,40,60,80),labels=c(20,40,60,80,100))
As you can see, the x-axis does not stretch along the whole plot but stops somewhere in between. It seems to me as if the problem is quite simple, but I somehow I am not able to fix it and I could not find any solution so far :(
Any help is greatly appreciated.
The problem is that barplot is really designed for plotting categorical, not numeric data, and as such it pretty much does its own thing in terms of setting up the horizontal axis scale. The main way to get around this is to recover the actual x-positions of the bar midpoints by saving the results of barplot to a variable, but as you can see below I haven't come up with an elegant way of doing what you want in base graphics. Maybe someone else can do better.
x = c(1:81)
b <- barplot(x)
## axis(side=1,at=c(0,20,40,60,80),labels=c(20,40,60,80,100))
head(b)
You can see here that the actual midpoint locations are 0.7, 1.9, 3.1, ... -- not 1, 2, 3 ...
This is pretty quick, if you don't want to extend the axis from 0 to 100:
b <- barplot(x)
axis(side=1,at=b[c(20,40,60,80)],labels=seq(20,80,by=20))
This is my best shot at doing it in base graphics:
b <- barplot(x,xlim=c(0,120))
bdiff <- diff(b)[1]
axis(side=1,at=c(b[1]-bdiff,b[c(20,40,60,80)],b[81]+19*bdiff),
labels=seq(0,100,by=20))
You can try this, but the bars aren't as pretty:
plot(x,type="h",lwd=4,col="gray",xlim=c(0,100))
Or in ggplot:
library(ggplot2)
d <- data.frame(x=1:81)
ggplot(d,aes(x=x,y=x))+geom_bar(stat="identity",fill="lightblue",
colour="gray")+xlim(c(0,100))
Most statistical graphics nerds will tell you that graphing quantitative (x,y) data is better done with points or lines rather than bars (non-data-ink, Tufte, blah blah blah :-) )
Not sure exactly what you wnat, but If it is to have the labels running from one end to the other evenly places (but not necessarily accurately), then:
x = c(1:81)
bp <- barplot(x)
axis(side=1,at=bp[1+c(0,20,40,60,80)],labels=c(20,40,60,80,100))
The puzzle for me was why you wanted to label "20" at 0. But this is one way to do it.
I run into the same annoying property of batplots - the x coordinates go wild. I would add one another way to show the problem, and that is adding more lines to the plot.
x = c(1:81)
barplot(x)
axis(side=1,at=c(0,20,40,60,80),labels=c(20,40,60,80,100))
lines(c(81,81), c(0, 100)) # this should cross the last bar, but it does not
The best I came with was to define a new barplot function that will take also the parameter "at" for plotting positions of the bars.
barplot_xscaled <- function(bar_heights, at = NA, width = 0.5, col = 'grey'){
if ( is.na(at) ){
at <- c(1:length(bar_heights))
}
plot(bar_heights, type="n", xlab="", ylab="",
ylim=c(0, max(bar_heights)), xlim=range(at), bty = 'n')
for ( i in 1:length(bar_heights)){
rect(at[i] - width, 0, at[i] + width, bar_heights[i], col = col)
}
}
barplot_xscaled(x)
lines(c(81, 81), c(0, 100))
The lines command crosses the last bar - the x scale works just as naively expected, but you could also now define whatever positions of the bars you would like (you could play more with the function a bit to have the same properties as other R plotting functions).
I have data that is mostly centered in a small range (1-10) but there is a significant number of points (say, 10%) which are in (10-1000). I would like to plot a histogram for this data that will focus on (1-10) but will also show the (10-1000) data. Something like a log-scale for th histogram.
Yes, i know this means not all bins are of equal size
A simple hist(x) gives
while hist(x,breaks=c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,3,4,5,7.5,10,15,20,50,100,200,500,1000,10000))) gives
none of which is what I want.
update
following the answers here I now produce something that is almost exactly what I want (I went with a continuous plot instead of bar-histogram):
breaks <- c(0,1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2,4,8)
ggplot(t,aes(x)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10('true size/predicted size', breaks = breaks, labels = breaks)![alt text][3]
the only problem is that I'd like to match between the scale and the actual bars plotted. There two options for doing that : the one is simply use the actual margins of the plotted bars (how?) then get "ugly" x-axis labels like 1.1754,1.2985 etc. The other, which I prefer, is to control the actual bins margins used so they will match the breaks.
Log scale histograms are easier with ggplot than with base graphics. Try something like
library(ggplot2)
dfr <- data.frame(x = rlnorm(100, sdlog = 3))
ggplot(dfr, aes(x)) + geom_histogram() + scale_x_log10()
If you are desperate for base graphics, you need to plot a log-scale histogram without axes, then manually add the axes afterwards.
h <- hist(log10(dfr$x), axes = FALSE)
Axis(side = 2)
Axis(at = h$breaks, labels = 10^h$breaks, side = 1)
For completeness, the lattice solution would be
library(lattice)
histogram(~x, dfr, scales = list(x = list(log = TRUE)))
AN EXPLANATION OF WHY LOG VALUES ARE NEEDED IN THE BASE CASE:
If you plot the data with no log-transformation, then most of the data are clumped into bars at the left.
hist(dfr$x)
The hist function ignores the log argument (because it interferes with the calculation of breaks), so this doesn't work.
hist(dfr$x, log = "y")
Neither does this.
par(xlog = TRUE)
hist(dfr$x)
That means that we need to log transform the data before we draw the plot.
hist(log10(dfr$x))
Unfortunately, this messes up the axes, which brings us to workaround above.
Using ggplot2 seems like the most easy option. If you want more control over your axes and your breaks, you can do something like the following :
EDIT : new code provided
x <- c(rexp(1000,0.5)+0.5,rexp(100,0.5)*100)
breaks<- c(0,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000,10000)
major <- c(0.1,1,10,100,1000,10000)
H <- hist(log10(x),plot=F)
plot(H$mids,H$counts,type="n",
xaxt="n",
xlab="X",ylab="Counts",
main="Histogram of X",
bg="lightgrey"
)
abline(v=log10(breaks),col="lightgrey",lty=2)
abline(v=log10(major),col="lightgrey")
abline(h=pretty(H$counts),col="lightgrey")
plot(H,add=T,freq=T,col="blue")
#Position of ticks
at <- log10(breaks)
#Creation X axis
axis(1,at=at,labels=10^at)
This is as close as I can get to the ggplot2. Putting the background grey is not that straightforward, but doable if you define a rectangle with the size of your plot screen and put the background as grey.
Check all the functions I used, and also ?par. It will allow you to build your own graphs. Hope this helps.
A dynamic graph would also help in this plot. Use the manipulate package from Rstudio to do a dynamic ranged histogram:
library(manipulate)
data_dist <- table(data)
manipulate(barplot(data_dist[x:y]), x = slider(1,length(data_dist)), y = slider(10, length(data_dist)))
Then you will be able to use sliders to see the particular distribution in a dynamically selected range like this: