How do I create a grouped boxplot in R? - r

I have a data frame containing 5 probes which are my variables in a dataframe, cg02823866, cg13474877, cg14305799, cg15837913 and cg19724470. I want to create a boxplot that will group cg02823866 and cg14305799 into a group called 'GeneBody' and then cg13474877, cg14305799 and cg19724470 into a group called 'Promoter'. I then want to colour code the boxplots to represent the probe names. I can't figure out how to group those variables into groups to plot the graph.
I created an ungrouped boxplot of the five probes and it looked like this.
I want there to be the titles 'Promoter' and 'GeneBody' on the x axis. Above the 'GeneBody' title there are the 2 boxplots for the cg02823866 and cg14305799 probes. Then a 'Promoter' label with the boxplots for cg13474877, cg14305799 and cg19724470. I then want each boxplots colour coded to represent each different probe.
My data frame that I imported into RStudio looks like this: https://i.stack.imgur.com/r4gEC.png

Assuming you have some data with variable names Beta (your y axis), Probe (your current x axis), and group (either "GeneBody" or "Promoter"), you can do something like the following:
library(ggplot2)
ggplot(data, aes(x = group, y = Beta, fill = Probe)) +
geom_boxplot()
If you provide a reproducible set of data, I can probably do better.

Adding to Ben's answer the traditional iris-data.frame example,which you can easily load by data(iris):
ggplot(iris) +
aes(x = "", y = Sepal.Length, group = Species) +
geom_boxplot(shape = "circle", fill = "#112446") +
theme_minimal()
So you just need a column which indicates the group dependency.
It gets of course more difficult with uncleand data, where you might need to transpond the data first etc. But those are follow up questions i guess.
Also if you want to make your life easier, use esquisse R-Studio add-on
Boxplot

Related

visualize relationship between categorical variable and frequency of other variable in one graph?

how in R, should I have a histogram with a categorical variable in x-axis and
the frequency of a continuous variable on the y axis?
is this correct?
There are a couple of ways one could interpret "one graph" in the title of the question. That said, using the ggplot2 package, there are at least a couple of ways to render histograms with by groups on a single page of results.
First, we'll create data frame that contains a normally distributed random variable with a mean of 100 and a standard deviation of 20. We also include a group variable that has one of four values, A, B, C, or D.
set.seed(950141237) # for reproducibility of results
df <- data.frame(group = rep(c("A","B","C","D"),200),
y_value = rnorm(800,mean=100,sd = 20))
The resulting data frame has 800 rows of randomly generated values from a normal distribution, assigned into 4 groups of 200 observations.
Next, we will render this in ggplot2::ggplot() as a histogram, where the color of the bars is based on the value of group.
ggplot(data = df,aes(x = y_value, fill = group)) + geom_histogram()
...and the resulting chart looks like this:
In this style of histogram the values from each group are stacked atop each other(i.e. the frequency of group A is added to B, etc. before rendering the chart), which might not be what the original poster intended.
We can verify the "stacking" behavior by removing the fill = group argument from aes().
# verify the stacking behavior
ggplot(data = df,aes(x = y_value)) + geom_histogram()
...and the output, which looks just like the first chart, but drawn in a single color.
Another way to render the data is to use group with facet_wrap(), where each distribution appears in a different facet on one chart.
ggplot(data = df,aes(x = y_value)) + geom_histogram() + facet_wrap(~group)
The resulting chart looks like this:
The facet approach makes it easier to see differences in frequency of y values between the groups.

Using ggplot in R to display dataframe with two different colours depending on value of data. Issue with Data jumping

I am using the below code to plot a data frame on the same plot:
ggplot(df) + geom_line(aes(x = date, y = values, colour = X > 5))
The plot is working and looks great all except for the fact that when the values are bigger than 5, because I am using geom_line, it then starts connecting points that are above the threshold. like below. I do not want the lines connecting the blue data.
How do I stop this from happening?
Here's an example using the economics dataset included in ggplot2. You see the same thing if we highlight the line based on values above 8000:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000))
When a mapping is defined in your dataset, by default ggplot2 also groups your data based on this. This makes total sense if you're trying to plot a line where you have data in long form and want to draw separate lines for each different value in a column. In cases like this, you want ggplot2 to change the color of the line based on the data, but you want to tell ggplot2 not to group based on color. This is why you will need to override the group= aesthetic.
To override the group= aesthetic change that happens when you map your line geom, you can just say group=1 or really group= any constant value. This effectively sets every observation mapped to the same group, and the line will connect all your points, but be colored differently:
ggplot(economics, aes(date, unemploy)) +
geom_line(aes(color=unemploy > 8000, group=1))

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

R: ggplot - YX relationship with 2 groups - how to get only one (whole sample) slope? [duplicate]

I have multiple sources of data over three decades.
The data is discontiguous and overlaps in multiple places. I would like to plot the points for each data source in a different color but then add a single trendline that uses all of the data sources.
The included code has some sample data and two plot examples. The first call to ggplot, plots a single trendline for all of the data. the second ggplot call, plots each source distinctly in different colors with its own trendline.
library(ggplot2)
the.data <- read.table( header=TRUE, sep=",",
text="source,year,value
S1,1976,56.98
S1,1977,55.26
S1,1978,68.83
S1,1979,59.70
S1,1980,57.58
S1,1981,61.54
S1,1982,48.65
S1,1983,53.45
S1,1984,45.95
S1,1985,51.95
S1,1986,51.85
S1,1987,54.55
S1,1988,51.61
S1,1989,52.24
S1,1990,49.28
S1,1991,57.33
S1,1992,51.28
S1,1993,55.07
S1,1994,50.88
S2,1993,54.90
S2,1994,51.20
S2,1995,52.10
S2,1996,51.40
S3,2002,57.95
S3,2003,47.95
S3,2004,48.15
S3,2005,37.80
S3,2006,56.96
S3,2007,48.91
S3,2008,44.00
S3,2009,45.35
S3,2010,49.40
S3,2011,51.19")
ggplot( the.data, aes( the.data$year, the.data$value ) ) + geom_point() + geom_smooth()
#ggplot( the.data, aes( the.data$year, the.data$value, color=the.data$source ) ) + geom_point() + geom_smooth()
The second call displays the colored data points and I would like to add a single contiguous trendline representing all of the years.
Like this:
ggplot(the.data, aes( x = year, y = value ) ) +
geom_point(aes(colour = source)) +
geom_smooth(aes(group = 1))
A few notes:
Don't map aesthetics to an isolated vector like the.data$year. (Until you really know what you're doing, and know when to break that rule.) Just use the column names.
Map the aesthetics that you want in separate layers in their respective geom calls. In this case, I want the points colored differently, but for the smooth line, I want the data grouped all together (group = 1).

ggplot boxplots with scatterplot overlay (same variables)

I'm an undergrad researcher and I've been teaching myself R over the past few months. I just started trying ggplot, and have run into some trouble. I've made a series of boxplots looking at the depth of fish at different acoustic receiver stations. I'd like to add a scatterplot that shows the depths of the receiver stations. This is what I have so far:
data <- read.csv(".....MPS.csv", header=TRUE)
df <- data.frame(f1=factor(data$Tagging.location), #$
f2=factor(data$Station),data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), data$depth)
df$f1f2 <- interaction(df$f1, df$f2) #$
plot1 <- ggplot(aes(y = data$Detection.depth, x = f2, fill = f1), data = df) + #$
geom_boxplot() + stat_summary(fun.data = give.n, geom = "text",
position = position_dodge(height = 0, width = 0.75), size = 3)
plot1+xlab("MPS Station") + ylab("Depth(m)") +
theme(legend.title=element_blank()) + scale_y_reverse() +
coord_cartesian(ylim=c(150, -10))
plot2 <- ggplot(aes(y=data$depth, x=f2), data=df2) + geom_point()
plot2+scale_y_reverse() + coord_cartesian(ylim=c(150,-10)) +
xlab("MPS Station") + ylab("Depth (m)")
Unfortunately, since I'm a new user in this forum, I'm not allowed to upload images of these two plots. My x-axis is "Stations" (which has 12 options) and my y-axis is "Depth" (0-150 m). The boxplots are colour-coded by tagging site (which has 2 options). The depths are coming from two different columns in my spreadsheet, and they cannot be combined into one.
My goal is to to combine those two plots, by adding "plot2" (Station depth scatterplot) to "plot1" boxplots (Detection depths). They are both looking at the same variables (depth and station), and must be the same y-axis scale.
I think I could figure out a messy workaround if I were using the R base program, but I would like to learn ggplot properly, if possible. Any help is greatly appreciated!
Update: I was confused by the language used in the original post, and wrote a slightly more complicated answer than necessary. Here is the cleaned up version.
Step 1: Setting up. Here, we make sure the depth values in both data frames have the same variable name (for readability).
df <- data.frame(f1=factor(data$Tagging.location), f2=factor(data$Station), depth=data$Detection.depth)
df2 <- data.frame(f2=factor(data$Station), depth=data$depth)
Step 2: Now you can plot this with the 'ggplot' function and split the data by using the `col=f1`` argument. We'll plot the detection data separately, since that requires a boxplot, and then we'll plot the depths of the stations with colored points (assuming each station only has one depth). We specify the two different plots by referencing the data from within the 'geom' functions, instead of specifying the data inside the main 'ggplot' function. It should look something like this:
ggplot()+geom_boxplot(data=df, aes(x=f2, y=depth, col=f1)) + geom_point(data=df2, aes(x=f2, y=depth), colour="blue") + scale_y_reverse()
In this plot example, we use boxplots to represent the detection data and color those boxplots by the site label. The stations, however, we plot separately using a specific color of points, so we will be able to see them clearly in relation to the boxplots.
You should be able to adjust the plot from here to suit your needs.
I've created some dummy data and loaded into the chart to show you what it would look like. Keep in mind that this is purely random data and doesn't really make sense.

Resources