Creating a boxplot showing the spread of gene-expression within different samples - r

So I have a dataframe containing the 10 most upregulated genes in cancer-samples compared to control-samples and the 10 most downregulated genes.
It looks like this:
I want to create a neat boxplot to compare the spread of each gene between patient-samples and control-samples (there are 4 samples of each type).
The problems I have is that I don't get all boxes along-side each other in a row/in the same graph, but like this:
I would also like it to show the gene's boxes sorted by the "log2FC-value", and not be in alphabetical order. Does anyone know how to fix this??
This is my code I used:
#Im using the dataframe called "Dataframe"
#Make a column for the genenames
Dataframe$genenames <- rownames(Dataframe)
#Get data into a long-format
long_Dataframe <- gather(Dataframe, key="samples",
value="values", -c(log2FC, gennames,))
#Creating a new column called "group", stating if each row belongs to patient/control
long_Dataframe$group <- rep(c("Control", "Patient"), each=40)
#Order rows by log2FC - from lowest to highest
long_Dataframe <- long_Dataframe[order(long_Dataframe$log2FC), ,
drop=FALSE]
#Use long data for boxplot of top 20 up/downregulated genes
Boxplot_top20 <- ggplot(long_Dataframe, aes(x=genenames, y=values, fill=group)) +
geom_boxplot() +
scale_fill_manual(values=c("green", "red")) +
theme_light() +
facet_wrap(~genenames, scales="free")

You may use geom_boxplot(position=position_dodge()) instead of facet_wrap() to place your boxplots by pair within group.

Related

Summarising data before plotting with geom_tile() renders different results

I have noticed that when plotting with ggplot2's geom_tile(), summarising the data before plotting renders a completely different result than when it is not pre-summarised. I don't understand why.
For a dataframe with three columns, year (character), state (character) and profit (numeric), consider the following examples:
# Plot straight away
data %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit))
# Summarise before plotting
data %>% group_by(year, state) %>% summarize(profit_mean = mean(profit)) %>%
ungroup() %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit_mean))
These two examples render two different tile plots - the values are quite different. I thought that these two methods of plotting would be analogous and that ggplot2 would take a mean automatically - is that not so?
I tried reproducing this error on a smaller subset of data, but it didn't appear. What could be going on here?
OP, this was a very interesting question.
First, let's get this out of the way. It is clear what plotting the summary of your data is plotting just that: the summary. You are summarizing via mean, so what is plotted equals the mean of the values for each tile.
The actual question here is: If you have a dataset containing more than one value per tile, what is the result of plotting the "non-summarized" dataset?
User #akrun is correct: the default stat used for geom_tile is stat="identity", but it might not be clear what that exactly means. It says it "leaves the data unchanged"... but that's not clear what that means here.
Illustrative Example Dataset
For purposes of demonstration, I'll create an illustrative dataset, which will answer the question very clearly. I'm creating two individual datasets df1 and df2, which each contain 4 "tiles" of data. The difference between these is that the values themselves for the tiles are different. I've include text labels on each tile for more clarity.
library(ggplot2)
library(cowplot)
df1 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(5,15,20,25)
)
df2 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(10,5,25,15)
)
tile1 <- ggplot(df1, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df1")
tile2 <- ggplot(df2, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df2")
plot_grid(tile1, tile2)
Plotting the Combined Data Frame
Each of the data frames df1 and df2 contain only one value per tile, so in order to see how that changes when we have more than one value per tile, we need to combine them into one so that each tile will contain 2 values. In this example, we are going to combine them in two ways: first df1 then df2, and the other way is df2 first, then df1.
df12 <- rbind(df1, df2)
df21 <- rbind(df2, df1)
Now, if we plot each of those as before and compare, the reason for the discrepancy the OP posted should be quite obvious. I'm including the value for each tile for each originating dataset to make things super-clear.
tile12 <- ggplot(df12, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df1, then df2") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
tile21 <- ggplot(df21, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df2, then df1") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
plot_grid(tile12, tile21)
Note that the legend colorbar value does not change, so it's not doing an addition. Plus, since we know it's stat="identity", we know this should not be the case. When we use the dataset that contains first observations from df1, then observations from df2, the value plotted is the one from df2. When we use the dataset that contains observations first from df2, then from df1, the value plotted is the one from df1.
Given this piece of information, it can be clear that the value shown in geom_tile() when using stat="identity" (default argument) corresponds to the last observation for that particular tile represented in the data frame.
So, that's the reason why your plot looks odd OP. You can either summarize beforehand as you have done, or use stat_summary(geom="tile"... to do the transformation in one go within ggplot.

Barplot from Sums of Columns in Data Frame

I have a data frame with 190 observations (rows). There are five columns, in which every entry either has the value 0 or 1.
How do I get a barplot that has the name of the five columns on the x-Axis and the number of 1's (i.e. the sum) of every column as height of the bars - preferably with ggplot?
I'm sorry I don't have any sample data, I couldn't figure out how to produce a smaller dataframe that fits the descriptions.
### Load ggplot & create sample data
library(ggplot2)
sampledata=data.frame(a=rbinom(n,1,0.7),b=rbinom(n,1,0.6),
c=rbinom(n,1,0.5),d=rbinom(n,1,0.2),e=rbinom(n,1,0.3))
### Sum the data using apply & use this for beautiful barplot
sumdata=data.frame(value=apply(sampledata,2,sum))
sumdata$key=rownames(sumdata)
ggplot(data=sumdata, aes(x=key, y=value, fill=key)) +
geom_bar(colour="black", stat="identity")
Just take the column sums and make a barplot.
barplot(colSums(iris[,1:4]))

geom_density(aes(y=..count..)) plot for multiple groups show a wrong x-axis count

My data frame (df) consists of 5 columns with 2,000 numerical values for each one.
Using reshape I reformatted my data frame to two columns: 1st containing the values (df$Values) (a total of 10,000) and a 2nd containing the name of the column (df$Labels) from where the value in col 1 is coming from.
I will use the 2nd column as a group factor.
I generated a mycolor and myshapes for coloring and setting the shape of lines.
With ggplot I tried to generate a density plot containing the density plot for the five factors.
The problem is that the x-axis show the counts, which maximum is 10,000. This value does not make any sense because the maximum possible counts for each plot must be 2,000. Anyone knows what is going on? Which is code I need to use to properly correct the x-axis?
ggplot2, geom_density() plot:
Here is the code:
ggplot(df, aes(x=Values, colour=Labels, linetype=Labels))+
geom_density(aes(y=..count..))+
theme_classic()+
scale_colour_manual(values = mycolor)+
scale_linetype_manual(values = myshapes)+
ggtitle("Title")+
scale_x_continuous(limits = c(0.5,1.5))

How to barplot select rows of data from a dataframe in R?

This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!
It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")

Overlay multiple lines from data frame with index column onto existing plot

I have a dataframe with 3 columns, (Id, Lat, Long), you can construct a small section of this with the following data:
df <- data.frame(
Id=c(1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Lat=c(58.12550, 58.17426, 58.46461, 58.45812, 58.45207, 58.44512, 58.43358, 58.42727, 57.77700, 57.76034, 57.73614, 57.72411, 57.70498, 57.68453),
Long=c(-5.098068, -5.314452, -4.914108, -4.899922, -4.887067, -4.873312, -4.852384, -4.840817, -5.666568, -5.648711, -5.617588, -5.594681, -5.557740, -5.509405))
The Id column is an index column. So all the rows with the same Id number have the coordinates for a single line. In my data frame this Id number varies from 1 through to 7696. So I have 7696 lines to plot.
Each Id number relates to an individual separate line of Lat and Long coordinates. What I want to do is overlay onto an existing plot all of these 7696 individual lines.
With the example data above this contains the Lat & Long coordinates for lines 1, 2, 3.
What is the best way to overlay all these lines onto an existing plot, I was thinking maybe some kind of loop?
Using ggplot2:
#dummy data
df <- data.frame(
Id=c(1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Lat=c(58.12550, 58.17426, 58.46461, 58.45812, 58.45207, 58.44512, 58.43358, 58.42727, 57.77700, 57.76034, 57.73614, 57.72411, 57.70498, 57.68453),
Long=c(-5.098068, -5.314452, -4.914108, -4.899922, -4.887067, -4.873312, -4.852384, -4.840817, -5.666568, -5.648711, -5.617588, -5.594681, -5.557740, -5.509405))
library(ggplot2)
#plot
ggplot(data=df,aes(Lat,Long,colour=as.factor(Id))) +
geom_line()
Using base R:
#plot blank
with(df,plot(Lat,Long,type="n"))
#plot lines
for(i in unique(df$Id))
with(df[ df$Id==i,],lines(Lat,Long,col=i))
To be honest, I think that any approach to take is going to result in a very cluttered plot since you have so many Ids (unless their lines do not overlap much). Either way, I would probably use ggplot2 for this.
##
if( !("ggplot2" %in% installed.packages()[,1]) ){
install.packages("ggplot2",dependencies=TRUE)
}
library(ggplot2)
##
D <- data.frame(
Id=Id,
Lat=Lat,
Long=Long
)
##
ggplot(data=D,aes(x=Lat,y=Long,group=Id,color=Id))+
geom_point()+ ## you might want to omit geom_point() in your plot
geom_line()
##
The reason I used group=Id, color=Id in aes() rather than passing Id as a factor to aes() and just using color=Id is that you will end up with a legend containing 7000+ factor levels (the majority of which will not be visible in the plot area).

Resources