How to set y=rows and x=columns in ggplot2? - r

Before I get to my question, I should point out that I am new in R, and this question might be simplicity itself for an experienced user.
I want to use ggplot2 to take full advantage of all the functionalities therein. However, I have encountered a problem that I have not been able to solve.
If I have a data frame as follows:
df = as.data.frame(cbind(rnorm(100,35:65),rnorm(100,25:35),rnorm(100,15:20),rnorm(100,5:10),rnorm(100,0:5)))
header = c("A","B","C","D","E")
names(df) = make.names(header)
Plotting the data, where rows are Y and X is columns can readily be done in base R like e.g. this:
par(mfrow=c(2,0))
stripchart(df, vertical = TRUE, method = 'jitter')
boxplot(df)
The picture shows the stripchart & boxplot of the data
However, the same cannot readily be done in ggplot2, as x and y input are required. All examples I have found plots one column vs another column, or process the data into the column format. Yet, I want to set y as the rows in my df and the x as the columns. How can this be accomplished?

You'll need to reshape your data in order to get those graphs. I think this is what you're looking for:
> library(ggplot2)
> library(reshape2)
> df = as.data.frame(cbind(rnorm(100,35:65),rnorm(100,25:35),rnorm(100,15:20),rnorm(100,5:10),rnorm(100,0:5)))
> header = c("A","B","C","D","E")
> names(df) = make.names(header)
> df = melt(df)
No id variables; using all as measure variables
> head(df)
variable value
1 A 36.75505
2 A 35.68714
3 A 36.44952
4 A 38.77236
5 A 39.79136
6 A 39.39672
> ggplot(df, aes(x = variable, y = value))
> ggplot(df, aes(x = variable, y = value)) + geom_boxplot()
> ggplot(df, aes(x = variable, y = value)) + geom_point(shape = 0, size = 20)
Here is the box plot:
Here is the strip chart:
You can change the settings in aes() options. See here for more info.

Related

Plotting two overlapping density curves using ggplot

I have a dataframe in R consisting of 104 columns, appearing as so:
id vcr1 vcr2 vcr3 sim_vcr1 sim_vcr2 sim_vcr3 sim_vcr4 sim_vcr5 sim_vcr6 sim_vcr7
1 2913 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
2 1260 0.003768704 3.1577108 -0.758378208 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
3 2912 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
4 2914 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
5 2915 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
6 1261 2.372950077 -0.7022792 -4.951318264 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
The "sim_vcr*" variables go all the way through sim_vcr100
I need two overlapping density density curves contained within one plot, looking something like this (except here you see 5 instead of 2):
I need one of the density curves to consist of all values contained in columns vcr1, vcr2, and vcr3, and I need another density curve containing all values in all of the sim_vcr* columns (so 100 columns, sim_vcr1-sim_vcr100)
Because the two curves overlap, they need to be transparent, like in the attached image. I know that there is a pretty straightforward way to do this using the ggplot command, but I am having trouble with the syntax, as well as getting my data frame oriented correctly so that each histogram pulls from the proper columns.
Any help is much appreciated.
With df being the data you mentioned in your post, you can try this:
Separate dataframes with next code, then plot:
library(tidyverse)
library(gdata)
#Index
i1 <- which(startsWith(names(df),pattern = 'vcr'))
i2 <- which(startsWith(names(df),pattern = 'sim'))
#Isolate
df1 <- df[,c(1,i1)]
df2 <- df[,c(1,i2)]
#Melt
M1 <- pivot_longer(df1,cols = names(df1)[-1])
M2 <- pivot_longer(df2,cols = names(df2)[-1])
#Plot 1
ggplot(M1) + geom_density(aes(x=value,fill=name), alpha=.5)
#Plot 2
ggplot(M2) + geom_density(aes(x=value,fill=name), alpha=.5)
Update
Use next code for one plot:
#Unique plot
#Melt
M <- pivot_longer(df,cols = names(df)[-1])
#Mutate
M$var <- ifelse(startsWith(M$name,'vcr',),'vcr','sim_vcr')
#Plot 3
ggplot(M) + geom_density(aes(x=value,fill=var), alpha=.5)
Using the dplyr package, first you can convert your data to long format using the function pivot_longer as follows:
df %<>% pivot_longer(cols = c(starts_with('vcr'), starts_with('sim_vcr')),
names_to = c('type'),
values_to = c('values'))
After using filter function you can create separate plots for each value type
For vcr columns:
df %>%
filter(str_detect(type, '^vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above produces the following plot:
for sim_vcr columns:
df %>%
filter(str_detect(type, '^sim_vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above code produces the following plot:
Another simple way to subset and prepare your data for ggplot is with gather() from tidyr which you can read more about. Heres how I do it. df being your data frame provided.
# Load tidyr to use gather()
library(tidyr)
#Split appart the data you dont want on their own, the first three columns, and gather them
df_vcr <- gather(data = df[,2:4])
#Gather the other columns in the dataframe
df_sim<- gather(data = df[,-c(1:4)])
#Plot the first
ggplot() +
geom_density(data = df_vcr,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
#Plot the second
ggplot() +
geom_density(data = df_sim,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
However I am a little unclear on what you mean by "all values in all of the sim_vcr* columns". Perhaps you want all of those values in one density curve? To do this, simply do not give ggplot any grouping info in the second case.
ggplot() + geom_density(data = df_sim,
mapping = aes(value),
fill = "grey50",
alpha = 0.5)
Notice here I can still specify the 'fill' for the curve outside of the aes() function and it will apply it too all curves instead of give each group specified in 'key' a different color.

Reminding R that integer is a factor when producing Boxplot

Good afternoon,
This is my 1st question here and every attempt is being made to be thorough.
I am working with a large data set (casualtiesdf) in R and I am trying to produce a Boxplot, using ggplot2, with the variable Age_of Casualty by the Casualty_Severity variable. The problem is that R thinks that Casualty_Severity variable is integer. Casualty_Severity in the data is listed by numbers 1, 2,3.
Below you can see that I've tried to rename the integer into the named factor to which is corresponds and then converted the integer into a factor.
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 1] "Fatal"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 2]"Serious"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 3] "Slight"
casualtiesdf$Casualty_Severity <- as.factor(casualtiesdf$Casualty_Severity)
When I try doing the Boxplot, however...
> ggplot(data = casualtiesdf, aes(x = Age_of_Casualty,
+ y = casualtiesdf$Casualty_Severity)) +
+ geom_boxplot()
I get: "Warning message:position_dodge requires non-overlapping x intervals"
I typed this message into Google and stackflow seems to advise putting the categorical variable in the x axes (yes I'm still very confused with my x's and y's...) so I tried:
ggplot(data = casualtiesdf, aes(x = Casualtiesdf$Casualty_Severity,
y = Age_of_Casualty +
geom_boxplot()
and get error message "Error: object 'Age_of_Casualty' not found"
I then went for thinking that maybe I have to put the as.factor in the plot code:
ggplot(data = casualtiesdf, aes(x = casualtiesdf$Casualty_Severity
as.factor(casualtiesdf$Casualty_Severity))) y = Age_of_Casualty) +
geom_boxplot()
and get error message "unexpected symbol in: geom_boxplot() ggplot"
Any help with this is greatly appreciated!
Is Age_of_Casualty also part of the dataframe as well? if not, you might consider to merge or separate assignment to create a Age_of_Casualty column in the df as well.
I created a dummy dataframe, with two variables
casualtiesdf <- data.frame(Casualty_Severity=c(1,2,1,1,2,3,1,3),
Age_of_Casualty = c(31,32,32,33,33,33,35,35))
I then created another varialbe, to store the casualty_severity as factor
casualtiesdf$Casualty_Severity_factor <- factor(x = casualtiesdf$Casualty_Severity,
levels = c(1,2,3),
labels = c("Fatal","Serious","Slight"))
With that, I can then do the box plot, with the casualty_severity as X-axis
library("ggplot2")
ggplot(data = casualtiesdf,
aes(x= Casualty_Severity_factor, y = Age_of_Casualty)) +
geom_boxplot()
This should give you some plot like this
So it's expected to me that in your third example R is reporting that you have a syntax error: unexpected symbol in: geom_boxplot() means "I have no idea what to do with that ...))) y = business.
Your first example R mistakenly assigns Age_of_Casualty as the X - this is really the variable whose distribution you want to analyze (it should be the Y variable).
So you're right, you need to establish Casualty_Severity as a Factor and make sure to ascribe the two variables to X and Y correctly. Something like this:
# Creating dummy data
AC.rand <- sample(15:90, 500, replace = T)
CS.rand <- sample(1:3, 500, replace = T)
# Combine them into a dataframe, define the "Severity" variable as a Factor
casualtiesdf <- data.frame(Casualty_Severity = factor(CS.rand), Age_of_Casualty = AC.rand)
# Define the Levels for the "Severity" variable - not necessary
levels(casualtiesdf$Casualty_Severity)=c("Fatal", "Serious", "Slight")
g <- ggplot(data = casualtiesdf, aes(x = Casualty_Severity, y = Age_of_Casualty))
g <- g + geom_boxplot()
When I mocked up 500 rows of data I get something like:
I'm an SO noob, too, so let's learn together! :)

geom_line : How to connect only a few points

I have this dataframe and this plot :
df <- data.frame(Groupe = rep(c("A","B"),4),
Period = gl(4,2,8,c("t0","t1","t2","t3","t4")),
rate = c(0.83,0.96,0.75,0.93,0.67,0.82,0.65,0.73))
ggplot(data = df, mapping = aes(y = rate, x = Period ,group = Groupe, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
How could i organize my data so that the points between t1 and t2 are not connected with a line ? I'd like t0 and t1 to be connected (blue or red according to the group), t2 and t3 connected in the same way, but no lines between t1 and t2. I tried several things by looking at similar questions, but it always mess up my grouping colors :/
Creating a new grouping variable manually is mostly not the best way. So, a slightly different approach which requires less hardcoding:
# create new grouping variable
df$grp <- c(1,2)[df$Period %in% c("t2","t3","t4") + 1L]
# create the plot and use the interaction between 'Group' and 'grp' as group
ggplot(df, aes(x = Period, y = rate,
group = interaction(Groupe,grp),
colour = Groupe,
shape = Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
this gives the same plot as in the other answer:
The best way to handle a problem like this in ggplot is often to create an additional column in your data frame that indicates the grouping you want to work with in your data. For example, here I've added an extra column gp to your data frame:
df$gp <- c(1,2,1,2,3,4,3,4)
ggplot(data = df, aes(y = rate, x = Period, group = gp, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
The result is, I believe, what you are looking for:
If you make Period a numerical column rather than a character vector or factor, you can more easily generate a column like gp automatically rather than manually specifying it (perhaps using ifelse or cases to create it) - this would be useful if you wanted to do the same thing many times or with a large data frame.

dataframe2delta: how to plot a delta function directly from the dataframe using ggplot2

I looked for an answer throughout the former threads, but with no luck.
I was wondering if it could be possible, given a data frame having a structure similar to this one
df <- data.frame(x = rep(1:100, times = 2 ),
y = c(rnorm(100), rnorm(100, 10)),
group = rep(c("a", "b"), each = 100))
to plot directly the difference, between the observations of the two groups, instead of plotting the two samples using different colours, which is what I'm able to do so far using ggplot2. Of course I know I could do that using the base plotting system by simply using
plot(df[df$group == "a",]$y - df[df$group == "b",]$y)
but doing so I waste all the cool features of ggplot2.
Thanks in advance!
EB
You could try something like this:
library(reshape2)
library(ggplot2)
df <- dcast(df, x~group, value.var='y')
df$dif = df$a-df$b
ggplot(df, aes(x, dif)) + geom_line()
Or if you use data.table here is how to do it:
library(data.table)
dt=data.table(df)
dt<-dcast.data.table(dt, x~group, value.var='y')
dt[,dif:=a-b]
ggplot(dt, aes(x, dif)) + geom_line()
How does this look?
Another possibility using dplyr is the following:
ggplot(df %>% group_by(x) %>% summarise(delta = diff(y)),
aes(x = x, y = delta)) + geom_line()
In this case you can avoid the dcast using the function diff and assuming the order between the groups, otherwise you need to sort the factors or apply a dcast on your data frame. I am quite sure that you can do something very similar using data.table.
It's not completely solved, but it looks close to what I meant:
qplot( x = x,
y = diff,
data = dcast( data = df,
value.var = y,
formula = x ~ "diff",
fun.aggregate = function( x ) x[1] - x[2] )
It's quite tricky and strongly depends on what you have in your group variable, but works.
An alternative was to mutate the output of dcast, but in my case the group column was filled in with TRUE and FALSEvalues. Thus, using mutate to obtain diff=TRUE-FALSE returned a column of 1s, not very useful.

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')
The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

Resources