How to prepare my data for spaghetti plots [duplicate] - r

This question already has answers here:
Plot multiple lines in one graph [duplicate]
(3 answers)
Closed 2 years ago.
I would like to create a spaghetti plot similar to this one here
.
Unfortunately my data looks like this
.
I have 11 columns that have NA's, so I remove them with
neuron1 <- drop_na(neuron)
Then I have a datatable with 13 columns and 169 rows. My goal is to display the expression of each gene across these 169 rows. Basically I would only need the "area" on the x-axis and on the y-axis the 11 genes. I am able to plot the data, but only when selecting the genes specifically e.g with this code:
ggplot(neuron1, aes(area)) +
geom_line(aes(y=MAP2, group=1)) +
geom_line(aes(y=REEP1, group=1, color="red"))
It would be okay to repeat this 11 times but I have some datasets with more genes so it would really be nice to be able to group them properly and then run a short code.
Thank you very much in advance!

To long for comment. Lacking your data, this code may be a bit of a shot into the dark.
Try something like this:
library (tidyverse)
yourdata %>%
pivot_longer(cols = c(-area, -region), names_to = "key", values_to = "value") %>%
ggplot(aes(area, value) +
geom_line(aes(group = key))
Not sure how this will work with your area as an x, because it's a categorical variable (therefore not sure if geom_line is the right choice for visualisation)

Related

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

How to Plot line graph in R with the following Data

I want a line graph of around 145 data observations using R, the format of data is as below
Date Total Confirmed Total Deceased
3-Mar 6 0
4-Mar 28 0
5-Mar 30 5
.
.
.
141 more obs like this
I'm new to ggplot 2 in R so i don't know how to get the graph, I tried plotting the graph, but the dates
in x-axis becomes overlaped and were not visible. I want line graph of Total confirmed column and the Total Deceased column together with dates in the x- axis, please help and please also tell me how to colour the line graph, i want a colorfull graph, so... Please Do help in your busy schedule.. thank you so much...
Similar questions like this gives a lot of error, so I would like an answer for my specific requirements.
There are a lot of resources to help you create what you are looking to do - and even quite a few questions already answered here. However, I understand it's tough starting out, so here's a quick example to get you started.
Sample Data:
df <- data.frame(
dates=c('2020-01-01','2020-02-01','2020-03-03','2020-03-14','2020-04-01'),
var1=c(13,15,18,29,40),
var2=c(5,8,11,13,18)
)
If you are plotting by date on your x axis, you need to ensure that df$dates is formatted as a "Date" class (or one of the other date-like classes). You can do that via:
df$dates <- as.Date(df$dates, format='%Y-%m-%d')
The format= argument of as.Date() should follow the conventions indicated in strptime(). Just type ?striptime in your console and you can see in the help for that function how the various terms are defined for format=.
The next step is very important, which is to recognize that the data is in "wide" format, not "long" format. You will always want your data in what is known as Tidy Data format - convenient for any analysis, but necessary for ggplot2 and the related packages. In your data, the measure itself is numbers of cases and deaths. The measure itself is number of people. The type of the measure is either cases or deaths. So "number of people" is spread over two columns and the information on "type of measure" is stuck as a name for each column when it should be a variable in the dataset. Your goal should be to gather() those two columns together and create two new columns: (1) one to indicate if the number is "cases" or "deaths", and (2) the number itself. In the example I've shown you can do this via:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key='var_name', value='number', -dates)
The result is that the data frame has columns for:
dates: unchanged
var_name: contains either var1 or var2 as a character class
number: the actual number
Finally, for the plot, the code is quite simple. You apply dates to the x aesthetic, number to y, and use var_name to differentiate color for the line geom:
ggplot(df, aes(x=dates, y=number)) +
geom_line(aes(color=var_name))

ggplot geom_boxplot for gene expression data

I am trying to get boxplots for 4 different genes with the expression data for each gene across multiple patients.
I've tried multiple ways and just keep hitting errors. I can do it using the base boxplot() function, but can't figure it out in ggplot and I can't see anywhere to help - spent hours reading other answers and questions yesterday! Mostly all other data seems to be as 2 columns so can specify x = column a and y = column b. However, I want to plot all 4 columns of my entire df and I couldn't find any help with that. I can do one at a time in ggplot but not all 4 together.
The data I have, BCON_sig_genes, is 4 genes each with values between 3-6 for 152 samples. The df is 152 obs of 4 variables, where the 4 columns are headed each of the gene names and all the cells are values as shown below.
CD3E LAT ZAP70 LCK
1002 4.214679 5.652482 4.788204 5.393783
1022 4.424925 5.776641 4.864269 5.593587
8035 4.327270 5.725364 4.509920 4.961659
8037 4.415715 5.494048 4.435241 5.081846
9004 4.290078 5.265329 4.799106 5.275424
9005 4.233490 5.338098 4.666506 5.069394
The following code gets me one gene at a time, by substituting in the name of the gene.
BCON_sig_genes %>% ggplot(aes(y = CD3E, x = "CD3E"))+ geom_boxplot()
ggplot boxplot 1 gene only
I have tried gene <- colnames(BCON_sig_genes) and then inputting x = gene but it doesn't work and comes up with the following error message:
Error: Aesthetics must be either length 1 or the same as the data (152): x
I think I need to sort out what y is. I tried leaving blank so it would take all the data and sort for each column but no luck.
I tried using a gather() function and making key and value but I couldn't quite figure it out without getting errors... but this felt like I was on the right track!
With the base function all I have to do it boxplot(BCON_sig_genes) and it just plots all 4 genes on a graph with the correct values. base function boxplot all genes
I think I need to wrangle the data better for ggplot so I can tell it that y is just all the expression values for each column but I'm not sure how.
Any help would be much appreciated!!
Thanks, Vicky
For ggplot to work, you need to get the data in a long format. Which basically means you get the gene names in column 1 and their expression in column 2. You had the right idea with gather but gather is being replaced with pivot_longer.
library(tidyverse)
data %>%
pivot_longer(cols = CD3E:LCK,
names_to = "gene",
values_to = "expression") %>%
ggplot(aes(x = gene,
y = expression)) +
geom_boxplot()

R, ggplot bar, all bars same width? [duplicate]

This question already has answers here:
Don't drop zero count: dodged barplot
(6 answers)
Closed 6 years ago.
I'm trying to find a ggplot specific work around so that I can generate bar plots in which all the bars are the same width. I know that this is because I am "missing values" and because the bar width fills in side-to-side over a blank. BUT I'm working with very large data sets and using reshape to make the data wide and then inserting place holder values to eliminate blanks is not something I want to do.
Test data:
df<-data.frame(tax=c("type1","type1","type1","type1","type2","type2"),Gene=c("a","b","c","c","a","b"),logFC=c(-2,-4,2,1,3,-1))
ggplot code, which gives me an extra wide bar for "c"
bar<-ggplot(df, aes(x=Gene, order=Gene,y=logFC,fill=tax))+ geom_bar(stat="identity",position="dodge")
Any suggestions that don't require me to change any values in the input df?
**This question is not a duplicate. I am looking for an ALTERNATIVE solution to what has been given before. Previous solutions DO NOT WORK. I cannot simply dcast (with fill=0) and re-melt my data frame (trust me, I've been trying this for weeks).
I am looking for a ggplot specific answer.
I think it will remain as a wide bar because c has only type1 twice and it doesn't have type 2
If you use facet_wrap, it will remain the same width
ggplot(df, aes(x=Gene, y=logFC, color = tax))+
geom_bar(stat = "identity", position="dodge", width=.5) +
facet_wrap(~tax)

ggplot2 stacked bar graph using rows as datapoints [duplicate]

This question already has an answer here:
Grouping & Visualizing cumulative features in R
(1 answer)
Closed 6 years ago.
I have a set of data that I would like to plot like this:
Now this is plotted using LibreOffice Calc in Ubunutu. I have tried to do this in R using following code:
ggplot(DATA, aes(x="Samples", y="Count", fill=factor(Sample1)))+geom_bar(stat="identity")
This does not give me a stacked bar graph for each sample, but rather one single graph. I have had a similar question, that used a different dataframe, that was answered here. However, in this problem I don't have just one sample, but information for at least three. In LibreOffice Calc or Excel I can choose the stacked bar graph option and then choose to use rows as the data series. How can I achieve a similar graph in ggplot2?
Here is the dataframe/object for which I am trying to produce the graph:
Aminoacid Sequence,Sample1,Sample2,Sample3
Sequence 1,16,10,33
Sequence 2,2,2,7
Sequence 3,1,1,6
Sequence 4,4,1,1
Sequence 5,1,2,4
Sequence 6,4,3,14
Sequence 7,2,2,2
Sequence 8,8,5,12
Sequence 9,1,3,17
Sequence 10,7,1,4
Sequence 11,1,1,1
Sequence 12,1,1,2
Sequence 13,1,1,1
Sequence 14,1,2,2
Sequence 15,5,4,7
Sequence 16,3,1,8
Sequence 17,7,5,20
Sequence 18,3,3,21
Sequence 19,2,1,5
Sequence 20,1,1,1
Sequence 21,2,2,5
Sequence 22,1,1,3
Sequence 23,4,2,9
Sequence 24,2,1,1
Sequence 25,4,4,3
Sequence 26,4,1,3
I copied the content of a .csv file, is that reproducible enough? It worked for me to just use read.csv(.file) in R.
Edit:
Thank you for redirecting me to another post with a very similar problem, I did not find that before. That post brought me a lot closer to the solution. I had to change the code just a little to fit my problem, but here is the solution:
df <- read.csv("example.csv")
df2 <- melt(example, id="Aminoacid.Sequence")
ggplot(df2, aes(x=variable, y=value, fill=Aminoacid.Sequence))+geom_bar(stat="identity")
Using variable as on the x-axis makes bar graph for each sample (Sample1-Sample3 in the example). Using y=value uses the value in each cell for that sample on the y-axis. And most importantly, using fill="Aminoacid.Sequence" stacks the values for each sequence on top of each other giving me the same graph as seen in the screenshot above!
Thank you for your help!
Try something along the following lines:
library(reshape2)
df <- melt(DATA) # you probably need to adjust the id.vars here...
ggplot(df, aes(x=variable, y=value) + geom_bar(stat="identity")
Note that you need to adjust the ggplot and the melt code somewhat, but since you haven't provided sample data, no one can provide the actual code necessary. The above provides the basic approach on how to deal with these multiple columns representing your samples, though. melt will "stack" the columns on top of each other, and create a column with the old variable name. This you can then use as x for ggplot.
Note that if you have other data in the data frame as well, melt will also stack these. For that reason you will need to adjust the commands to fit your data.
Edit: using your data:
library(reshape2)
library(ggplot2)
### reading your data:
# df <- read.table(file="clipboard", header=T, sep=",")
df2 <- melt(df)
head(df2)
Aminoacid.Sequence variable value
1 Sequence 1 preDLI 16
2 Sequence 2 preDLI 2
3 Sequence 3 preDLI 1
4 Sequence 4 preDLI 4
5 Sequence 5 preDLI 1
6 Sequence 6 preDLI 4
This can be used as in:
ggplot(df2, aes(x=variable, y=value, fill=Aminoacid.Sequence)) + geom_bar(stat="identity")
I am sure you want to change some details about the graph, such as the colors etc, but this should answer your inital question.

Resources