ggplot2: Overlay density plots R - r

I want to overlay a few density plots in R and know that there are a few ways to do that, but they don't work for me for a reason or another ('sm' library doesn't install and i'm noob enough not to understand most of the code). I also tried plot and par but i would like to use qplot since it has more configuration options.
I have data saved in this form
library(ggplot2)
x <- read.csv("clipboard", sep="\t", header=FALSE)
x
V1 V2 V3
1 34 23 24
2 32 12 32
and I would like to create 3 overlaid plots with the values from V1, V2 and V3 using or tones of grey to fill in or using dotlines or something similar with a legend. Can you guys help me?
Thank you!

generally for ggplot and multiple variables you need to convert to long format from wide. I think it can be done without but that is the way the package is meant to work
Here is the solution, I generated some data (3 normal distributions centered around different points). I also did some histograms and boxplots in case you want those. The alpha parameters controls the degree of transparency of the fill, if you use color instead of fill you get only outlines
x <- data.frame(v1=rnorm(100),v2=rnorm(100,1,1),v3=rnorm(100,0,2))
library(ggplot2);library(reshape2)
data<- melt(x)
ggplot(data,aes(x=value, fill=variable)) + geom_density(alpha=0.25)
ggplot(data,aes(x=value, fill=variable)) + geom_histogram(alpha=0.25)
ggplot(data,aes(x=variable, y=value, fill=variable)) + geom_boxplot()

For the sake of completeness, the most basic way to overlay plots based on a factor is:
ggplot(data, aes(x=value)) + geom_density(aes(group=factor))
But as #user1617979 mentioned, aes(color=factor) and aes(fill=factor) are probably more useful in practice.

Some people have asked if you can do this when the distributions are of different lengths. The answer is yes, just use a list instead of a data frame.
library(ggplot2)
library(reshape2)
x <- list(v1=rnorm(100),v2=rnorm(50,1,1),v3=rnorm(75,0,2))
data<- melt(x)
ggplot(data,aes(x=value, fill=L1)) + geom_density(alpha=0.25)
ggplot(data,aes(x=value, fill=L1)) + geom_histogram(alpha=0.25)
ggplot(data,aes(x=L1, y=value, fill=L1)) + geom_boxplot()

Related

ggplot: How to increase space between axis labels for categorical data?

I love ggplot, but find it hard to customize some elements such as X axis labels and grid lines. The title of the question says it all, but here's a reproducible example to go with it:
Reproducible example
library(ggplot2)
library(dplyr)
# Make a dataset
set.seed(123)
x1 <- c('2015_46','2015_47','2015_48','2015_49'
,'2015_50','2015_51','2015_52','2016_01',
'2016_02','2016_03')
y1 <- runif(10,0.0,1.0)
y2 <- runif(10,0.5,2.0)
# Make the dataset ggplot friendly
df_wide <- data.table(x1, y1, y2)
df_long <- melt(df_wide, id = 'x1')
# Plot it
p <- ggplot(df_long, aes(x=x1,
y=value,
group=variable,
colour=variable )) + geom_line(size=1)
plot(p)
# Now, plot the same thing with the same lines and numbers,
# but with increased space between x-axis labels
# and / or space between x-axis grid lines.
Plot1
The plot looks like this, and doesn't look too bad in it's current form:
Plot2
The problem occurs when the dataset gets bigger, and the labels on the x-axis start overlapping each other like this:
What I've tried so far:
I've made several attempts using scale_x_discrete as suggested here, but I've had no luck so far. What really bugs me is that I saw some tutorial about these things a while back, but despite two days of intense googling I just can't find it. I'm going to update this section when I try new things.
I'm looking forward to your suggestions!
As mentioned above, assuming that x1 represents a year_day, ggplot provides sensible defaults for date scales.
First make x1 into a valid date format, then plot as you already did:
df_long$x1 <- strptime(as.character(df_long$x1), format="%Y_%j")
ggplot(df_long, aes(x=x1, y=value, group=variable, colour=variable)) +
geom_line(size=1)
The plot looks a little odd because of the disconnected time series, but scales_x_date() provides an easy way to customize the axis:
http://docs.ggplot2.org/current/scale_date.html

5 dimensional plot in r

I am trying to plot a 5 dimensional plot in R. I am currently using the rgl package to plot my data in 4 dimensions, using 3 variables as the x,y,z, coordinates, another variable as the color. I am wondering if I can add a fifth variable using this package, like for example the size or the shape of the points in the space. Here's an example of my data, and my current code:
set.seed(1)
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
colnames(df) <- c("var1","var2","var3","var4","var5")
require(rgl)
plot3d(df$var1, df$var2, df$var3, col=as.numeric(df$var4), size=0.5, type='s',xlab="var1",ylab="var2",zlab="var3")
I hope it is possible to do the 5th dimension.
Many thanks,
Here is a ggplot2 option. I usually shy away from 3D plots as they are hard to interpret properly. I also almost never put in 5 continuous variables in the same plot as I have here...
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12))
While this is a bit messy, you can actually reasonably read all 5 dimensions for most points.
A better approach to multi-dimensional plotting opens up if some of your variables are categorical. If all your variables are continuous, you can turn some of them to categorical with cut and then use facet_wrap or facet_grid to plot those.
For example, here I break up var3 and var4 into quintiles and use facet_grid on them. Note that I also keep the color aesthetics as well to highlight that most of the time turning a continuous variable to categorical in high dimensional plots is good enough to get the key points across (here you'll notice that the fill and border colors are pretty uniform within any given grid cell):
df$var4.cat <- cut(df$var4, quantile(df$var4, (0:5)/5), include.lowest=T)
df$var3.cat <- cut(df$var3, quantile(df$var3, (0:5)/5), include.lowest=T)
ggplot(df, aes(x=var1, y=var2, fill=var3, color=var4, size=var5^2)) +
geom_point(shape=21) +
scale_color_gradient(low="red", high="green") +
scale_size_continuous(range=c(1,12)) +
facet_grid(var3.cat ~ var4.cat)

Plot multiple histograms in one using ggplot2 in R

I am fairly new to R and ggplot2 and am having some trouble plotting multiple variables in the same histogram plot.
My data is already grouped and just needs to be plotted. The data is by week and I need to plot the number for each category (A, B, C and D).
Date A B C D
01-01-2011 11 0 11 1
08-01-2011 12 0 3 3
15-01-2011 9 0 2 6
I want the Dates as the x axis and the counts plotted as different colors according to a generic y axis.
I am able to plot just one of the categories at a time, but am not able to find an example like mine.
This is what I use to plot one category. I am pretty sure I need to use position="dodge" to plot multiple as I don't want it to be stacked.
ggplot(df, aes(x=Date, y=A)) + geom_histogram(stat="identity") +
labs(title = "Number in Category A") +
ylab("Number") +
xlab("Date") +
theme(axis.text.x = element_text(angle = 90))
Also, this gives me a histogram with spaces in between the bars. Is there any way to remove this? I tried spaces=0 as you would do when plotting bar graphs, but it didn't seem to work.
I read some previous questions similar to mine, but the data was in a different format and I couldn't adapt it to fit my data.
This is some of the help I looked at:
Creating a histogram with multiple data series using multhist in R
http://www.cookbook-r.com/Graphs/Plotting_distributions_%28ggplot2%29/
I'm also not quite sure what the bin width is. I think it is how the data should be spaced or grouped, which doesn't apply to my question since it is already grouped. Please advise me if I am wrong about this.
Any help would be appreciated.
Thanks in advance!
You're not really plotting histograms, you're just plotting a bar chart that looks kind of like a histogram. I personally think this is a good case for faceting:
library(ggplot2)
library(reshape2) # for melt()
melt_df <- melt(df)
head(melt_df) # so you can see it
ggplot(melt_df, aes(Date,value,fill=Date)) +
geom_bar() +
facet_wrap(~ variable)
However, I think in general, that changes over time are much better represented by a line chart:
ggplot(melt_df,aes(Date,value,group=variable,color=variable)) + geom_line()

R: Plot multiple box plots using columns from data frame

I would like to plot an INDIVIDUAL box plot for each unrelated column in a data frame. I thought I was on the right track with boxplot.matrix from the sfsmsic package, but it seems to do the same as boxplot(as.matrix(plotdata) which is to plot everything in a shared boxplot with a shared scale on the axis. I want (say) 5 individual plots.
I could do this by hand like:
par(mfrow=c(2,2))
boxplot(data$var1
boxplot(data$var2)
boxplot(data$var3)
boxplot(data$var4)
But there must be a way to use the data frame columns?
EDIT: I used iterations, see my answer.
You could use the reshape package to simplify things
data <- data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100))
library(reshape)
meltData <- melt(data)
boxplot(data=meltData, value~variable)
or even then use ggplot2 package to make things nicer
library(ggplot2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
From ?boxplot we see that we have the option to pass multiple vectors of data as elements of a list, and we will get multiple boxplots, one for each vector in our list.
So all we need to do is convert the columns of our matrix to a list:
m <- matrix(1:25,5,5)
boxplot(x = as.list(as.data.frame(m)))
If you really want separate panels each with a single boxplot (although, frankly, I don't see why you would want to do that), I would instead turn to ggplot and faceting:
m1 <- melt(as.data.frame(m))
library(ggplot2)
ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
I used iteration to do this. I think perhaps I wasn't clear in the original question. Thanks for the responses none the less.
par(mfrow=c(2,5))
for (i in 1:length(plotdata)) {
boxplot(plotdata[,i], main=names(plotdata[i]), type="l")
}

plotting two vectors of data on a GGPLOT2 scatter plot using R

I've been experimenting with both ggplot2 and lattice to graph panels of data. I'm having a little trouble wrapping my mind around the ggplot2 model. In particular, how do I plot a scatter plot with two sets of data on each panel:
in lattice I could do this:
xyplot(Predicted_value + Actual_value ~ x_value | State_CD, data=dd)
and that would give me a panel for each State_CD with each column
I can do one column with ggplot2:
pg <- ggplot(dd, aes(x_value, Predicted_value)) + geom_point(shape = 2)
+ facet_wrap(~ State_CD) + opts(aspect.ratio = 1)
print(pg)
What I can't grok is how to add Actual_value to the ggplot above.
EDIT Hadley pointed out that this really would be easier with a reproducible example. Here's code that seems to work. Is there a better or more concise way to do this with ggplot? Why is the syntax for adding another set of points to ggplot so different from adding the first set of data?
library(lattice)
library(ggplot2)
#make some example data
dd<-data.frame(matrix(rnorm(108),36,3),c(rep("A",24),rep("B",24),rep("C",24)))
colnames(dd) <- c("Predicted_value", "Actual_value", "x_value", "State_CD")
#plot with lattice
xyplot(Predicted_value + Actual_value ~ x_value | State_CD, data=dd)
#plot with ggplot
pg <- ggplot(dd, aes(x_value, Predicted_value)) + geom_point(shape = 2) + facet_wrap(~ State_CD) + opts(aspect.ratio = 1)
print(pg)
pg + geom_point(data=dd,aes(x_value, Actual_value,group=State_CD), colour="green")
The lattice output looks like this:
(source: cerebralmastication.com)
and ggplot looks like this:
(source: cerebralmastication.com)
Just following up on what Ian suggested: for ggplot2 you really want all the y-axis stuff in one column with another column as a factor indicating how you want to decorate it. It is easy to do this with melt. To wit:
qplot(x_value, value,
data = melt(dd, measure.vars=c("Predicted_value", "Actual_value")),
colour=variable) + facet_wrap(~State_CD)
Here's what it looks like for me:
(source: princeton.edu)
To get an idea of what melt is actually doing, here's the head:
> head(melt(dd, measure.vars=c("Predicted_value", "Actual_value")))
x_value State_CD variable value
1 1.2898779 A Predicted_value 1.0913712
2 0.1077710 A Predicted_value -2.2337188
3 -0.9430190 A Predicted_value 1.1409515
4 0.3698614 A Predicted_value -1.8260033
5 -0.3949606 A Predicted_value -0.3102753
6 -0.1275037 A Predicted_value -1.2945864
You see, it "melts" Predicted_value and Actual_value into one column called value and adds another column called variable letting you know what column it originally came from.
Update: several years on now, I almost always use Jonathan's method (via the tidyr package) with ggplot2. My answer below works in a pinch, but gets tedious fast when you have 3+ variables.
I'm sure Hadley will have a better answer, but - the syntax is different because the ggplot(dd,aes()) syntax is (I think) primarily intended for plotting just one variable. For two, I would use:
ggplot() +
geom_point(data=dd, aes(x_value, Actual_value, group=State_CD), colour="green") +
geom_point(data=dd, aes(x_value, Predicted_value, group=State_CD), shape = 2) +
facet_wrap(~ State_CD) +
theme(aspect.ratio = 1)
Pulling the first set of points out of the ggplot() gives it the same syntax as the second. I find this easier to deal with because the syntax is the same and it emphasizes the "Grammar of Graphics" that is at the core of ggplot2.
you might just want to change the form of your data a little bit, so that you have one y-axis variable, with an additional factor variable indicating whether it is a predicted or actual variable.
Is this something like what you are trying to do?
dd<-data.frame(type=rep(c("Predicted_value","Actual_value"),20),y_value=rnorm(40),
x_value=rnorm(40),State_CD=rnorm(40)>0)
qplot(x_value,y_value,data=dd,colour=type,facets=.~State_CD)
well after posting the question I ran across this R Help thread that may have helped me. It looks like I can do this:
pg + geom_line(data=dd,aes(x_value, Actual_value,group=State_CD), colour="green")
is that a good way of doing things? It odd to me because adding the second item has a totally different syntax than the first.

Resources