ggplot geom_boxplot for gene expression data - r

I am trying to get boxplots for 4 different genes with the expression data for each gene across multiple patients.
I've tried multiple ways and just keep hitting errors. I can do it using the base boxplot() function, but can't figure it out in ggplot and I can't see anywhere to help - spent hours reading other answers and questions yesterday! Mostly all other data seems to be as 2 columns so can specify x = column a and y = column b. However, I want to plot all 4 columns of my entire df and I couldn't find any help with that. I can do one at a time in ggplot but not all 4 together.
The data I have, BCON_sig_genes, is 4 genes each with values between 3-6 for 152 samples. The df is 152 obs of 4 variables, where the 4 columns are headed each of the gene names and all the cells are values as shown below.
CD3E LAT ZAP70 LCK
1002 4.214679 5.652482 4.788204 5.393783
1022 4.424925 5.776641 4.864269 5.593587
8035 4.327270 5.725364 4.509920 4.961659
8037 4.415715 5.494048 4.435241 5.081846
9004 4.290078 5.265329 4.799106 5.275424
9005 4.233490 5.338098 4.666506 5.069394
The following code gets me one gene at a time, by substituting in the name of the gene.
BCON_sig_genes %>% ggplot(aes(y = CD3E, x = "CD3E"))+ geom_boxplot()
ggplot boxplot 1 gene only
I have tried gene <- colnames(BCON_sig_genes) and then inputting x = gene but it doesn't work and comes up with the following error message:
Error: Aesthetics must be either length 1 or the same as the data (152): x
I think I need to sort out what y is. I tried leaving blank so it would take all the data and sort for each column but no luck.
I tried using a gather() function and making key and value but I couldn't quite figure it out without getting errors... but this felt like I was on the right track!
With the base function all I have to do it boxplot(BCON_sig_genes) and it just plots all 4 genes on a graph with the correct values. base function boxplot all genes
I think I need to wrangle the data better for ggplot so I can tell it that y is just all the expression values for each column but I'm not sure how.
Any help would be much appreciated!!
Thanks, Vicky

For ggplot to work, you need to get the data in a long format. Which basically means you get the gene names in column 1 and their expression in column 2. You had the right idea with gather but gather is being replaced with pivot_longer.
library(tidyverse)
data %>%
pivot_longer(cols = CD3E:LCK,
names_to = "gene",
values_to = "expression") %>%
ggplot(aes(x = gene,
y = expression)) +
geom_boxplot()

Related

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

How to prepare my data for spaghetti plots [duplicate]

This question already has answers here:
Plot multiple lines in one graph [duplicate]
(3 answers)
Closed 2 years ago.
I would like to create a spaghetti plot similar to this one here
.
Unfortunately my data looks like this
.
I have 11 columns that have NA's, so I remove them with
neuron1 <- drop_na(neuron)
Then I have a datatable with 13 columns and 169 rows. My goal is to display the expression of each gene across these 169 rows. Basically I would only need the "area" on the x-axis and on the y-axis the 11 genes. I am able to plot the data, but only when selecting the genes specifically e.g with this code:
ggplot(neuron1, aes(area)) +
geom_line(aes(y=MAP2, group=1)) +
geom_line(aes(y=REEP1, group=1, color="red"))
It would be okay to repeat this 11 times but I have some datasets with more genes so it would really be nice to be able to group them properly and then run a short code.
Thank you very much in advance!
To long for comment. Lacking your data, this code may be a bit of a shot into the dark.
Try something like this:
library (tidyverse)
yourdata %>%
pivot_longer(cols = c(-area, -region), names_to = "key", values_to = "value") %>%
ggplot(aes(area, value) +
geom_line(aes(group = key))
Not sure how this will work with your area as an x, because it's a categorical variable (therefore not sure if geom_line is the right choice for visualisation)

How to Plot line graph in R with the following Data

I want a line graph of around 145 data observations using R, the format of data is as below
Date Total Confirmed Total Deceased
3-Mar 6 0
4-Mar 28 0
5-Mar 30 5
.
.
.
141 more obs like this
I'm new to ggplot 2 in R so i don't know how to get the graph, I tried plotting the graph, but the dates
in x-axis becomes overlaped and were not visible. I want line graph of Total confirmed column and the Total Deceased column together with dates in the x- axis, please help and please also tell me how to colour the line graph, i want a colorfull graph, so... Please Do help in your busy schedule.. thank you so much...
Similar questions like this gives a lot of error, so I would like an answer for my specific requirements.
There are a lot of resources to help you create what you are looking to do - and even quite a few questions already answered here. However, I understand it's tough starting out, so here's a quick example to get you started.
Sample Data:
df <- data.frame(
dates=c('2020-01-01','2020-02-01','2020-03-03','2020-03-14','2020-04-01'),
var1=c(13,15,18,29,40),
var2=c(5,8,11,13,18)
)
If you are plotting by date on your x axis, you need to ensure that df$dates is formatted as a "Date" class (or one of the other date-like classes). You can do that via:
df$dates <- as.Date(df$dates, format='%Y-%m-%d')
The format= argument of as.Date() should follow the conventions indicated in strptime(). Just type ?striptime in your console and you can see in the help for that function how the various terms are defined for format=.
The next step is very important, which is to recognize that the data is in "wide" format, not "long" format. You will always want your data in what is known as Tidy Data format - convenient for any analysis, but necessary for ggplot2 and the related packages. In your data, the measure itself is numbers of cases and deaths. The measure itself is number of people. The type of the measure is either cases or deaths. So "number of people" is spread over two columns and the information on "type of measure" is stuck as a name for each column when it should be a variable in the dataset. Your goal should be to gather() those two columns together and create two new columns: (1) one to indicate if the number is "cases" or "deaths", and (2) the number itself. In the example I've shown you can do this via:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key='var_name', value='number', -dates)
The result is that the data frame has columns for:
dates: unchanged
var_name: contains either var1 or var2 as a character class
number: the actual number
Finally, for the plot, the code is quite simple. You apply dates to the x aesthetic, number to y, and use var_name to differentiate color for the line geom:
ggplot(df, aes(x=dates, y=number)) +
geom_line(aes(color=var_name))

How do I plot grouped data sorted by the number of entries in each group, in R?

I have data that look like this sample here: http://pastebin.com/5MPCFGWK
I need to plot each id as a timeline, thus I do something like this.
ggplot(df, aes(x=relative_timestamp, y=id, color=action))
which kind of works, except that it's not the most helpful chart. I figured I'd try to sort the groups by how many events they have, but I can't figure out how. I tried my hand at dplyr but I got confused with the docs, and barely managed to group the dataframe by id. Ideas?
EDIT I added a sample CSV. My goal is to plot those timelines sorted by how many entries they have, so in this case 0 is the one with the least amount, and 1 is the one with the biggest amount. Extra good would be to plot them (separate plot, not the same as above) sorted by the time the last CLOSE action occurs (there should be exactly one in each group anyway).
You will need to convert id from numeric to a factor, and then order those factors by whatever metric you are interested in. Here, I used dplyr to create a data.frame called forSort that holds the id's and a set of things you might want to sort on:
forSort <-
testDF %>%
group_by(id) %>%
summarise(n = n()
, max = max(relative_timestamp))
forSort
# id n max
# 1 0 12 244753
# 2 1 85 447680
# 3 2 22 156005
By number of actions:
ggplot(testDF %>%
mutate(id = factor(id, levels = forSort$id[order(forSort$n)]))
, aes(x=relative_timestamp
, y= id
, color=action)) +
geom_point()
By time of last action:
ggplot(testDF %>%
mutate(id = factor(id, levels = forSort$id[order(forSort$max)]) )
, aes(x=relative_timestamp
, y= id
, color=action)) +
geom_point()

ggplot holes in stacked area chart

Here is a link to my data.
I use the following code:
#read in data
data = read.csv("ggplot_data.csv")
#order by group then year
data = arrange(data, group, year)
#generage ggplot stacked area chart
plot = ggplot(data, aes(x=year,y=value, fill=group)) +
geom_area()
plot
That produces the following chart:
As you can see, there are odd holes in three different parts of this chart.
I previously had this issue and asked about it, and the answer provided then was that I needed to sort my data by group and then year. At the time, that answer fixed my holes. However, it doesn't seem to eliminate all the holes this time. Any help?
The reason for the gaps is that some time series start later than others. When the first non-vanishing value appears, the new area starts with an non-continuous jump. The area just above is however connected to the next point by linear interpolation. This result in the gap.
For example, look at the left-most gap. The olive region starts just after the gap with a vertical jump in 1982. The green area, however, increases linearly from the value in 1981 (where the olive area is zero) to the value in 1982 (where the olive area suddenly contributes).
What you could do is, for instance, add a value of zero at the beginning of each time series that starts after 1975. I use dplyr functionality to create a data frame of these additional first years:
first_years <- group_by(data, group, group_id) %>%
summarise(year = min(year) - 1) %>%
filter(year > 1974) %>%
mutate(value = 0, value_pct = 0)
first_years
## Source: local data frame [3 x 5]
## Groups: group [3]
##
## group group_id year value value_pct
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 c 10006 1981 0 0
## 2 e 10022 2010 0 0
## 3 i 24060 2002 0 0
As you can see, these three new values fit exactly the three gaps in your plot. Now, you can combine these new data frames with your data and sort in the same way as before:
data_complete <- bind_rows(data, first_years) %>%
arrange(year, group)
And the plot then has no gaps:
ggplot(data_complete, aes(x=year,y=value, fill=group)) +
geom_area()
#Stibu's answer is probably best, but for those of us who are not very R-savvy and don't know how to go through a dataset with R to find missing rows and fill them with zeros, I solved this issue with a bit of a different approach.
For my case, I created a dummy dataset with zeroes for all years and all groups, then appended it to my original dataset. This way I added rows for years where before there was simply no rows of data. After aggregating by year and group, my aggregated dataset then contained rows with zero, as opposed to no rows existing at all. This removed all those weird gaps for me.
Best is to simply add: pos = "identity", e.g. from your code above:
ggplot(aes(x=year,y=value, fill=group), pos = "identity")
I found it simpler to save my table into csv and use python's matplotlib function stackplot(demo), which does not seem to have issues with negative numbers.

Resources