Strings in ggplot x-axis - r

I'm trying to create a graph in R like this:
I have three columns (online, offline and routes). However, when I add the following code:
library(ggplot2)
ggplot(coefroute, aes(routes,offline)) + geom_line()
I get the following message:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
sample of coefroute:
routes online offline
(Intercept) 210.4372 257.215
route10 7.543 30.0182
route100 18.3794 1.5313
route11 38.6537 78.8655
route12 66.501 94.8838
route13 -22.2391 -25.8448
route14 24.3652 177.7728
route15 48.5464 51.126 ...
routes: char, online and offline: num
Can anybody help me with putting strings in x-axis in R?
Thank you!

In the absence of sample data, here's some toy data that has the same structure as yours:
coefroute <- data.frame(routes = c("A","B","C","D","E"),
online = c(21,26,30,15,20),
offline = c(15,20,7,12,15))
To replicate your example graph in ggplot2 you would want your data in a long format, so that you can group on offline/online. See more here: Plotting multiple lines from a data frame with ggplot2 and http://ggplot2.tidyverse.org/reference/aes.html.
You can rearrange your data into a long format very easily with lots of different functions or packages, but a standard approach is to use gather from tidyr and group your series for online and offline into something called, say, status or whatever you want.
library(tidyr)
coefroute <- gather(coefroute, key = status, value = coef, online:offline)
Then you can plot this easily in ggplot:
library(ggplot2)
ggplot(coefroute, aes(x = routes, y = coef, group = status, colour = status))
+ geom_line() + scale_x_discrete()
That should create something like your example graph. You may want to modify the colours, captions, etc. There's lots of documentation about these things that's easy enough to find. I've added scale_x_discrete() here so that ggplot knows to treat your x variable as a discrete one.
Secondly, my suspicion is that a line plot may be less effective than geoms in communicating what you're trying to communicate here. I would perhaps use geom_bar(stat = "identity", position = "dodge") in place of geom_line. That would create a vertical bar chart for each coefficient with offline and online coefficients side by side.
ggplot(coefroute, aes(x = routes, y = coef, group = status, fill = status))
+ geom_bar(stat = "identity", position = "dodge") + scale_x_discrete()

There are two approaches:
Plotting the data in wide format (quick & dirty, not recommended)
plotting the data after reshaping from wide to long format (as shown by dshkol but using a different approach.
Plotting the data in wide format
# using dshkol's toy data
coefroute <- data.frame(routes = c("A","B","C","D","E"),
online = c(21,26,30,15,20),
offline = c(15,20,7,12,15))
library(ggplot2)
# plotting data in wide format (not recommended)
ggplot(coefroute, aes(x = routes, group = 1L)) +
geom_line(aes(y = online), colour = "blue") +
geom_line(aes(y = offline), colour = "orange")
This approach has several drawbacks. Each variable needs its own call to geom_line() and there is no legend.
Plotting reshaped data
For reshaping, the melt() is used which is available from the reshape2 package (the predecessor of the tidyr/dplyr packages) or in a faster implementation form the data.table package.
ggplot(data.table::melt(coefroute, id.var = "routes"),
aes(x = routes, y = value, group = variable, colour = variable)) +
geom_line()
Note that in both cases the group aesthetic has to be specified because the x-axis is discrete. This tells ggplot to consider the data points belonging to one series despite the discrete x values.

Related

Changing datastructure to create correct bar graph in ggplot

I would like to make a graph in R, which I managed to make in excel. It is a bargraph with species on the x-axis and the log number of observations on the y-axis. My current data structure in R is not suitable (I think) to make this graph, but I do not know how to change this (in a smart way).
I have (amongst others) a column 'camera_site' (site 1, site2..), 'species' (agouti, paca..), 'count'(1, 2..), with about 50.000 observations.
I tried making a dataframe with a column 'species" (with 18 species) and a column with 'log(total observation)' for each species (see dataframe) But then I can only make a point graph.
this is how I would like the graph to look:
desired graph made in excel
Your data seems to be in the correct format from what I can tell from your screenshot.
The minimum amount of code you would need to get a plot like that would be the following, assuming your data.frame is called df:
ggplot(df, aes(VRM_species, log_obs_count_vrm)) +
geom_col()
Many people intuitively try geom_bar(), but geom_col() is equivalent to geom_bar(stat = "identity"), which you would use if you've pre-computed observations and don't need ggplot to do the counting for you.
But you could probably decorate the plot a bit better with some additions:
ggplot(df, aes(VRM_species, log_obs_count_vrm)) +
geom_col() +
scale_x_discrete(name = "Species") +
scale_y_continuous(name = expression("Log"[10]*" Observations"),
expand = c(0,0,0.1,0)) +
theme(axis.text.x = element_text(angle = 90))
Of course, you could customize the theme anyway you would like.
Groetjes

How to draw a violin plot with the color showing the expression of gene value?

I am trying to plot the gene expression of "gene A" among several groups.
I use ggplot2 to draw, but I fail
p <- ggplot(MAPK_plot, aes(x = group, y = gene_A)) + geom_violin(trim = FALSE , aes( colour = gene_A)) + theme_classic()
And I want to get the figure like this from https://www.researchgate.net/publication/313728883_Neuropilin-1_Is_Expressed_on_Lymphoid_Tissue_Residing_LTi-like_Group_3_Innate_Lymphoid_Cells_and_Associated_with_Ectopic_Lymphoid_Aggregates
You would have to provide data to get a more specific answer, tailored to your problem. But, I do not want that you get demotivated by the down-votes you got so far and, based on your link, maybe this example can give you some food for thought.
Nice job on figuring out that you have to use geom_violin. Further, you will need some form of faceting / multi-panels. Finally, to do the full annotation like in the given link, you need to make use of the grid package functionality (which I do not use here).
I am not familiar with gene-expression data sets, but I use a IMDB movie rating data set for this example (stored in the package ggplot2movies).
library(ggplot2)
library(ggplot2movies)
library(data.table)
mv <- copy(movies)
setDT(mv)
# make some variables for our plotting example
mv[, year_10 := cut_width(year, 10)]
mv[, rating_10yr_avg := mean(rating), by = year_10]
mv[, length_3gr := cut_number(length, 3)]
ggplot(mv,
aes(x = year_10,
y = rating)) +
geom_violin(aes(fill = rating_10yr_avg),
scale = "width") +
facet_grid(rows = vars(length_3gr))
Please do not take this answer as a form on encouragement of not posting data relevant to your problem.

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

Colouring specific label in ggplot depending on the value of the id variable on a long data (irrespectively of the row number)

Let's say that I have a long data set and I would like to colour a specific label on the x-axis. In the case of the example below I would like to colour the label for Valiant.
# Packs
require(ggplot2)
require(reshape2)
# Data and trans
data(mtcars)
mtcars$model <- rownames(mtcars)
mtcars <- melt(mtcars, id.vars = "model")
# Some chart
ggplot(data = subset(x = mtcars, subset = mtcars$variable == "cyl"),
aes(x = model, y = value)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90,
colour =
ifelse(mtcars$model == "Valiant",
"red","black")))
The code produces the chart below that is erroneous as the wrong label is coloured.
The reason is fairly simple as what is created by ifelse does not match the order on the axis. I can fix the code by forcing ggplot to colour a specific row. The code below colours the right label as in the particular data.frame used for the chart the row with the Valiant value is 31.
# Fixed chart
ggplot(data = subset(x = mtcars, subset = mtcars$variable == "cyl"),
aes(x = model, y = value)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90,
colour =
ifelse(as.numeric(rownames(mtcars)) == 31,
"red","black")))
Clearly this solutions is extremely impractical. On the actual data I've a vast number of observations with multiple columns (geo, gender, indicator, value, etc.). That data is subsequently filtered via subset and different options are passed to the aes settings. Trying to figure out the row that should be coloured is a nightmare. I'm looking for a solution that would enable me to:
Relatively effortless indicate specific observation to be coloured without trying to use row numbers
Ideally I would like to use the id with some string as a way of indicating the text I wan to highlight
I would like to encapsulate the solution in the ggplot2 code, I don't want to create separate data subsets only to derive colouring vector as I will be doing this a number of times. This would unnecessary multiply objects.
In practice, I want solution that would work like that: irrespectively of what is on the chart, when you find this string on x-axis make it red
The reason the first one mismatches is that mtcars$model is much longer than the subset you are plotting, so the colour vector ifelse(mtcars$model == "Valiant","red","black") is of length 352 but the subset you are plotting is only of length 32. The same problem exists with your second example, though in this case the extra elements of colour (which are all "black" anyway) are dropped so you don't notice.
Unfortunately it looks like theme(...) doesn't get evaluated with the data column-names available to it (i.e. can't just do colour=ifelse(model == "Valiant", "red", "black") directly in the theme(...) call)
One alternative is to make model a factor and filter on levels(..) == "Valiant". If you have a long dataframe your id variable is most likely a factor anyway (or it would make sense for it to be one).
mtcars$model = factor(mtcars$model)
ggplot(data=subset(mtcars, variable == 'cyl'), aes(x=model, y=value)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=90,
colour=ifelse(levels(mtcars$model) == 'Valiant', 'red', 'black')))
(your problem stems from feeding subset() into ggplot as your data, and then not being able to refer back to that particular subset in the theme call. I don't know if there is a tricksy way to do this).

want to layer aes in ggplot2

I would like to plot another series of data on top of a current graph. The additional data only contains information for 3 (out of 6) spp, which are used in the facet_wraping.
The other series of data is currently a column (in the same data file).
Current graph:
ped.num <- ggplot(data, aes(ped.length, seeds.inflorstem))
ped.num + geom_point(size=2) + theme_bw() + facet_wrap(~spp, scales = "free_y")
Additional layer would be:
aes(ped.length, seeds.filled)
I feel I should be able to plot them using the same y-axis, because they have just slightly smaller values. How do I go about add this layer?
#ialm 's solution should work fine, but I recommend calling the aes function separately in each geom_* because it makes the code easier to read.
ped.num <- ggplot(data) +
geom_point(aes(x=ped.length, y=seeds.inflorstem), size=2) +
theme_bw() +
facet_wrap(~spp, scales="free_y") +
geom_point(aes(x=ped.length, y=seeds.filled))
(You'll always get better answers if you include example data, but I'll take a shot in the dark)
Since you want to plot two variables that are on the same data.frame, it's probably easiest to reshape the data before feeding it into ggplot:
library(reshape2)
# Melting data gives you exactly one observation per row - ggplot likes that
dat.melt <- melt(dat,
id.var = c("spp", "ped.length"),
measure.var = c("seeds.inflorstem", "seeds.filled")
)
# Plotting is slightly different - instead of explicitly naming each variable,
# you'll refer to "variable" and "value"
ggplot(dat.melt, aes(x = ped.length, y = value, color = variable)) +
geom_point(size=2) +
theme_bw() +
facet_wrap(~spp, scales = "free_y")
The seeds.filled values should plot only on the facets for the corresponding species.
I prefer this to Drew's (totally valid) approach of explicitly mapping different layers because you only need a single geom_point() whether you have two variables or twenty and it's easy to map a variety of aesthetics to variable.

Resources