Gene Expression Profile Plot in R - r

I'm trying to make Gene EXPRESSION PROFILE plot in R. My input data is a data frame where column 1 has gene names and next column2:18 are multiple cancer types. Here is a small set of data.
what I want is to make a graph that has samples on x-axis and from y=axis expression line of each gene.
something that looks like this.
but instead of timepoints on x-axis it should have Cancer types (columns)
so far I've tried ggplot() and geneprofiler() but i failed over and over.
any help will be greatly appreciated.

Data Format
The current format of the data is referred to as wide format, but ggplot requires long format data. The tidyr package (part of the tidyverse) has functions for converting between wide and long formats. In this case, you want the function tidyr::pivot_longer. For example, if you have the data in a data.frame (or tibble) called df_gene_expr, the pivot would go something like
library(tidyverse)
df_gene_expr %>%
pivot_longer(cols=2:18, names_to="cancer_type", values_to="gene_expr") %>%
filter(ID == "ABCA8") %>%
ggplot(aes(x=cancer_type, y=gene_expr)) +
geom_point()
where here we single out the one gene "ABCA8". Change the geom_point() to whatever geometry you actually want (perhaps geom_bar(stat='identity').
Discrete Trendline
I'm not sure that geom_smooth is entirely appropriate - it is designed with continuous-continuous data in mind. Instead, I'd recommend stat_summary.
There's a slight trick to this because the discrete cancer_type on the x-axis. Namely, the cancer_type variable should be a factor, but we will use the underlying codes for the x-values in stat_summary. Otherwise, it would complain that using a geom='line' doesn't make sense.
Something along the lines:
ggplot(df_long, aes(x=cancer_type, y=gene_expr)) +
geom_hline(yintercept=0, linetype=4, color="red") +
geom_line(aes(group=ID), size=0.5, alpha=0.3, color="black") +
stat_summary(aes(x=as.numeric(cancer_type)), fun=mean, geom='line',
size=2, color='orange')
Output from Fake Data
Technically, this same trick (aes(x=as.numeric(cancer_type))) could be equally-well applied to geom_smooth, but I think it still makes more sense to use the stat_summary which let's one explicitly pick the stat to be computed. For example, perhaps, median instead of mean might be more appropriate in this context for the summary function.

Related

Apply ggplot2 across columns

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.
Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg

How to plot parallel coordinates with multiple categorical variables in R

I am facing a difficulty while plotting a parallel coordinates plot using the ggparcoord from the GGally package. As there are two categorical variables, what I want to show in the visualisation is like the image below. I've found that in ggparcoord, groupColumn is only allowed to a single variable to group (colour) by, and surely I can use showPoints to mark the values on the axes, but i also need to vary the shape of these markers according to the categorical variables. Is there other package that can help me to realise my idea?
Any response will be appreciated! Thanks!
It's not that difficult to roll your own parallel coordinates plot in ggplot2, which will give you the flexibility to customize the aesthetics. Below is an illustration using the built-in diamonds data frame.
To get parallel coordinates, you need to add an ID column so you can identify each row of the data frame, which we'll use as a group aesthetic in ggplot. You also need to scale the numeric values so that they'll all be on the same vertical scale when we plot them. Then you need to take all the columns that you want on the x-axis and reshape them to "long" format. We do all that on the fly below with the tidyverse/dplyr pipe operator.
Even after limiting the number of category combinations, the lines are probably too intertwined for this plot to be easily interpretable, so consider this merely a "proof of concept". Hopefully, you can create something more useful with your data. I've used colour (for the lines) and fill (for the points) aesthetics below. You can use shape or linetype instead, depending on your needs.
library(tidyverse)
theme_set(theme_classic())
# Get 20 random rows from the diamonds data frame after limiting
# to two levels each of cut and color
set.seed(2)
ds = diamonds %>%
filter(color %in% c("D","J"), cut %in% c("Good", "Premium")) %>%
sample_n(20)
ggplot(ds %>%
mutate(ID = 1:n()) %>% # Add ID for each row
mutate_if(is.numeric, scale) %>% # Scale numeric columns
gather(key, value, c(1,5:10)), # Reshape to "long" format
aes(key, value, group=ID, colour=color, fill=cut)) +
geom_line() +
geom_point(size=2, shape=21, colour="grey50") +
scale_fill_manual(values=c("black","white"))
I haven't used ggparcoords before, but the only option that seemed straightforward (at least on my first try with the function) was to paste together two columns of data. Below is an example. Even with just four category combinations, the plot is confusing, but maybe it will be interpretable if there are strong patterns in your data:
library(GGally)
ds$group = with(ds, paste(cut, color, sep="-"))
ggparcoord(ds, columns=c(1, 5:10), groupColumn=11) +
theme(panel.grid.major.x=element_line(colour="grey70"))

graphing multiple data series in R ggplot

I am trying to plot (on the same graph) two sets of data versus date from two different data frames. Both data frames have the same exact dates for each of the two measurements. I would like to plot these two sets of data on the same graph, with different colors. However, I can't get them on the same graph at all. R is already reading the date as date. I tried this:
qplot( date , NO3, data=qual.arn)
+ qplot( qual.arn$date , qual.arn$DIS.O2, "O2(aq)" , add=T)
and received this error.
Error in add_ggplot(e1, e2, e2name) :
argument "e2" is missing, with no default
I tried using the ggplot function instead of qplot, but I couldn't even plot one graph this way.
ggplot(date=qual.no3.s, aes(date,NO3))
Error: ggplot2 doesn't know how to deal with data of class uneval
PLEASE HELP. Thank you!
Since you didn't provide any data (please do so in future), here's a made up dataset for demonstrate a solution. There are (at least) two ways to do this: the right way and the wrong way. Both yield equivalent results in this very simple case.
# set up minimum reproducible example
set.seed(1) # for reproducible example
dates <- seq(as.Date("2015-01-01"),as.Date("2015-06-01"), by=1)
df1 <- data.frame(date=dates, NO3=rpois(length(dates),25))
df2 <- data.frame(date=dates, DIS.O2=rnorm(length(dates),50,10))
ggplot is designed to use data in "long" format. This means that all the y-values (the concentrations) are in a single column, and there is separate column which identifies the corresponding category ("NO3" or "DIS.O2" in your case). So first we merge the two data-sets based on date, then use melt(...) to convert from "wide" (categories in separate columns) to "long" format. Then we let ggplot worry about legends, colors, etc.
library(ggplot2)
library(reshape2) # for melt(...)
# The right way: combine the data-sets, then plot
df.mrg <- merge(df1,df2, by="date", all=TRUE)
gg.df <- melt(df.mrg, id="date", variable.name="Component", value.name="Concentration")
ggplot(gg.df, aes(x=date, y=Concentration, color=Component)) +
geom_point() + labs(x=NULL)
The "wrong" way to do this is by making separate calls to geom_point(...) for each layer. In your particular case this might be simpler, but in the long run it's better to use the other method.
# The wrong way: plot two sets of points
ggplot() +
geom_point(data=df1, aes(x=date, y=NO3, color="NO2")) +
geom_point(data=df2, aes(x=date, y=DIS.O2, color="DIS.O2")) +
scale_color_manual(name="Component",values=c("red", "blue")) +
labs(x=NULL, y="Concentration")

geom_line only connects points on horizontal lines instead all points

I've written something in R using ggplot2 and don't know why it behaves as it does.
If I plot my data using geom_point and geom_line it is supposed to draw lines trough those points. but instead of connecting all the points it only connects those that are on a horizontal line. I don't know how to handle this.
This is a simple version of the code:
date<-c("2014-07-01","2014-07-02","2014-07-03",
"2014-07-04","2014-07-05","2014-07-06",
"2014-07-07")
mbR<- c(160,163,169,169,169,169,169)
mbL<- c(166,166,166,166,NA, NA, NA)
mb<-data.frame(mbR,mbL)
mb<-data.frame(t(as.Date(date)),mb)
colnames(mb)<-c("Datum","R","L")
mb$Datum<-date
plot1<-ggplot(mb,aes(x=mb$Datum,y=mb$R))+
geom_point(data=mb,aes(x=mb$Datum,y=mb$R,color="R",size=2),
group=mb$R,position="dodge")+
geom_line(data=mb,aes(y=mb$R,color="R",group=mb$R))+
geom_point(aes(y=mb$L,color="L",size=2),position="dodge")
plot1
I used group, otherwise I wouldn't have been able to draw any lines, still it doesn't do what I intended.
I hope you guys can help me out a little. :) It may be a minor fault.
First, melt your data to long format and then plot it. The column called variable in the melted data is the category (R or L). The column called value stores the data values for each instance of R and L. We group and color the data by variable in the call to ggplot, which gives us separate lines/points for R and L.
Also, you only need to provide the data frame and column mappings in the initial call to ggplot. They will carry through to geom_point and geom_line. Furthermore, when you provide the column names, you don't need to (and shouldn't) include the name of the data frame, because you've already specified the data frame in the data argument to ggplot.
library(reshape2)
mb.l = melt(mb, id.var="Datum")
ggplot(data=mb.l, aes(x=Datum, y=value, group=variable, color=variable)) +
geom_point(size=2) +
geom_line()

normalizing ggplot2 densities with facet_wrap in R

I am making a series of density plots with geom_density from a dataframe, and showing it by condition using facet_wrap, as in:
ggplot(iris) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
When I do this, the y-axis scale seems to not represent percent of each Species in a panel, but rather the percent of all the total datapoints across all species.
My question is: How can I make it so the ..count.. variable in geom_density refers to the count of items in each Species set of each panel, so that the panel for virginica has a y-axis corresponding to "Fraction of virginica data points"?
Also, is there a way to get ggplot2 to output the values it uses for ..count.. and sum(..count..) so that I can verify what numbers it is using?
edit: I misunderstood geom_density it looks like even for a single Species, ..count../sum(..count..) is not a percentage:
ggplot(iris[iris$Species == 'virginica',]) + geom_density(aes(x=Sepal.Width, colour=Species, y=..count../sum(..count..))) + facet_wrap(~Species)
so my revised question: how can I get the density plot to be the fraction of data in each bin? Do I have to use stat_density for this or geom_histogram? I just want the y-axis to be percentage / fraction of data points
Unfortunately, what you are asking ggplot2 to do is define separate y's for each facet, which it syntactically cannot do AFAIK.
So, in response to your mentioning in the comment thread that you "just want a histogram fundamentally", I would suggest instead using geom_histogram or, if you're partial to lines instead of bars, geom_freqpoly:
ggplot(iris, aes(Sepal.Width, ..count..)) +
geom_histogram(aes(colour=Species, fill=Species), binwidth=.2) +
geom_freqpoly(colour="black", binwidth=.2) +
facet_wrap(~Species)
**Note: geom_freqpoly works just as well in place of geom_histogram in my above example. I just added both in one plot for sake of efficiency.
Hope this helps.
EDIT: Alright, I managed to work out a quick-and-dirty way of getting what you want. It requires that you install and load plyr. Apologies in advance; this is likely not the most efficient way to do this in terms of RAM usage, but it works.
First, let's get iris out in the open (I use RStudio so I'm used to seeing all my objects in a window):
d <- iris
Now, we can use ddply to count the number of individuals belonging to each unique measurement of what will become your x-axis (here I used Sepal.Length instead of Sepal.Width, to give myself a bit more range, simply for seeing a bigger difference between groups when plotted).
new <- ddply(d, c("Species", "Sepal.Length"), summarize, count=length(Sepal.Length))
Note that ddply automatically sorts the output data.frame according to the quoted variables.
Then we can divvy up the data.frame into each of its unique conditions--in the case of iris, each of the three species (I'm sure there's a much smoother way to go about this, and if you're working with really large amounts of data it's not advisable to keep creating subsets of the same data.frame because you could max out your RAM)...
set <- new[which(new$Species%in%"setosa"),]
ver <- new[which(new$Species%in%"versicolor"),]
vgn <- new[which(new$Species%in%"virginica"),]
... and use ddply again to calculate proportions of individuals falling under each measurement, but separately for each species.
prop <- rbind(ddply(set, c("Species"), summarize, prop=set$count/sum(set$count)),
ddply(ver, c("Species"), summarize, prop=ver$count/sum(ver$count)),
ddply(vgn, c("Species"), summarize, prop=vgn$count/sum(vgn$count)))
Then we just put everything we need into one dataset and remove all the junk from our workspace.
new$prop <- prop$prop
rm(list=ls()[which(!ls()%in%c("new", "d"))])
And we can make our figure with facet-specific proportions on the y. Note that I'm now using geom_line since ddply has automatically ordered your data.frame.
ggplot(new, aes(Sepal.Length, prop)) +
geom_line(aes(colour=new$Species)) +
facet_wrap(~Species)
# let's check our work. each should equal 50
sum(new$count[which(new$Species%in%"setosa")])
sum(new$count[which(new$Species%in%"versicolor")])
sum(new$count[which(new$Species%in%"versicolor")])
#... and each of these should equal 1
sum(new$prop[which(new$Species%in%"setosa")])
sum(new$prop[which(new$Species%in%"versicolor")])
sum(new$prop[which(new$Species%in%"versicolor")])
Maybe using table() and barplot() you might be able to get what you need. I'm still not sure if this is what you are after...
barplot(table(iris[iris$Species == 'virginica',1]))
With ggplot2
tb <- table(iris[iris$Species == 'virginica',1])
tb <- as.data.frame(tb)
ggplot(tb, aes(x=Var1, y=Freq)) + geom_bar()
Passing the argument scales='free_y' to facet_wrap() should do the trick.

Resources