I have data currently structured like so:
set.seed(100)
require(ggplot2)
require(reshape2)
d<-data.frame("ID" = 1:30,
"Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
"Score1" = rnorm(30)^2,
"Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
"Score2" = rnorm(30)^2,
"Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
"Score3" = rnorm(30)^2)
Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.
d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))
I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:
ggplot(d.melt)+
geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))
But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels...
Any help getting my head around this problem would be great. Thank you in advance
The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment and their Score for each treatment. I've used one of the map functions from the purrr package (which is part of the tidyverse suite of packages) for this.
The code steps through each of the three pairs of treatments/scores, adds a column called Treatment indicating the treatment number and returns the stacked (long format) data frame.
library(tidyverse)
dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2),
function(t,s) {
data.frame(ID = d[,"ID"],
Treatment = gsub(".*([0-9]$)", "\\1", names(d)[t]),
Treat_Flag = d[,t],
Score = d[,s])
})
Now we plot the data using Treatment on the x-axis to mark the treatment number and color by Treat_Flag to provide separate box plots based on whether a given subject received a given treatment.
ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
geom_boxplot() +
theme_classic() +
labs(colour="Treatment Indicator")
Here's another way to reshape the data. The code below uses functions from tidyr rather than from reshape2 (tidyr is the successor to reshape2). In the code below, gather(d, key, value, -ID) is essentially equivalent to melt(d, id.var="ID"). You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse paradigm for data reshaping, but I find it a bit less intuitive than the map approach above.
dr = gather(d, key, value, -ID) %>%
separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
spread(key, value) %>%
rename(Treatment=value2, Treat_Flag=Treatment)
Related
I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.
I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks
I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:
We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
I have time series data where measurements of 7 variables (Var1:Var7) were taken on 15 individuals (denoted by a unique ID). These individuals were sampled from 3 different Locations. Note that the number of observations is different for each individual. I believe the individuals within each Location will be more similar to each other than individuals in other Locations, both in value and trend. For each Variable within each Location, I want to plot the average time series (to get an idea of what the group looks like as a whole) up to the point where Time is the same for each individual (so the length of the x-axis will only be as long as the shortest individual).
How can I do this and add error bars for each Time point to see how much variation exists between individuals?
Here is some sample data:
set.seed(123)
ID = factor(letters[seq(15)])
Time = c(1000,1200,1234,980,1300,1020,1180,1908,1303,
1045,1373,1111,1097,1167,1423)
df <- data.frame(ID = rep(ID, Time), Time = sequence(Time))
df$Location = rep(c("NY","WA","MA"), c(5714,7829,4798))
df[paste0('Var', c(1:7))] <- rnorm(sum(Time))
The values of all your variables are the same, so I did the following to make it more random:
for(i in 1:7) df[paste0('Var', i)] <- rnorm(sum(Time))
Then the following code gives a time-series plot for each of the 7 variables averaged over the three locations.
df %>%
pivot_longer(cols = Var1:Var7, names_to="Variable") %>%
group_by(Location, Variable, Time) %>%
summarise(mval=mean(value)) %>%
ggplot(aes(y=mval, x=Time, color=Variable)) +
geom_line() +
facet_grid(~Location) # , scales="free" # ?
I'm not sure if this is what you had in mind though.
In the time series data created below data, individuals (denoted by a unique ID) were sampled from 2 populations (NC and SC). All individuals have the same number of observations. I want to average the data for each respective "time point" for all individuals that belong to the same "State" (the average line) and I want to plot the average lines from each state against each other. I want it to look something like this:
library(tidyverse)
set.seed(123)
ID <- rep(1:10, each = 500)
Time = rep(c(1:500),10)
Location = rep(c("NC","SC"), each = 2500)
Var <- rnorm(5000)
data <- data.frame(
ID = factor(ID),
Time = Time,
State = Location,
Variable = Var
)
I would recommend getting familiar with the various dplyr functions. Specifically, group_by and summarise. You may want to read through: Introduction to dplyr or going through this series of blog posts.
In short, we are grouping the data by the Time and State variable and then summarizing that data with an average (i.e., mean(Variable)). To plot the data, we put Time on our x-axis, the newly created avg_var on our y-axis, and use State to represent color. These are assigned as our chart's aesthetics (i.e., aes(...). Finally, we add the line geom with geom_line() to render the lines on our visualization.
data %>%
group_by(Time, State) %>%
summarise(avg_var = mean(Variable)) %>%
ggplot(aes(x = Time, y = avg_var, color = State)) +
geom_line()
Here is what I am after:
Let's use the ToothGrowth dataset that comes with R as a simple example. In this dataset there are 3 columns: length, supplement, dose. Both dose and supplement are explanatory variables for length. It's easy enough to, say, plot dose against length and use the supplement as a factor. For instance, using qplot you would just do this:
qplot(x = ToothGrowth$dose , y = ToothGrowth$len, color = ToothGrowth$supp)
The next thing I'd want to do is see the trend of the average growth for each supplement as dose increased. I.e., construct a very similar plot, except I want the y variable to be the average of the values based on the dose and supplement.
I'm not sure how to do that in place with a call to qplot. It occurred to me that perhaps the thing to do was to compute a new column or something, but I'm also not sure how to use something like mutate to build a new column based on multiple explanatory variables.
I think this may be what you are looking for but you may need to clarify. Here is how you can generate the averages using dplyr
Avg_ToothGrowth <- ToothGrowth %>%
group_by(supp, dose) %>%
summarise(avg_len = mean(len)) %>%
ungroup
qplot(dose, avg_len, data = Avg_ToothGrowth, color = supp)
This should get you close but you may have to go through a dplyr tutorial to better understand the use of group_by and summarise. I used the ungroup to strip off the remaining groupings as they are not needed (there may be a better way to do this).
EDIT:
You can also plot the original data with a trend line for each group
# With confidence interval
qplot(dose, len, data = ToothGrowth, color = supp, geom = c('smooth', 'point'), method = 'lm')
# Without confidence interval
qplot(dose, len, data = ToothGrowth, color = supp, geom = c('smooth', 'point'), method = 'lm', se=FALSE)
I personally prefer to use dplyr as steveb did, but in case you are not familiar with the package, a solution without it might be easier to understand. The function aggregate() can help you:
tg <- aggregate(len ~ dose + supp, mean, data = ToothGrowth)
The first argument is a formula that tells the function that it should aggregate the value of the column len for all rows that have the same values for dose and supp. The second argument gives the function to use for the aggregation, which is mean. So, what is actually done is the following:
Rows of the data frame are grouped together by dose and supp. All rows within a group have thus the same values for dose and supp.
Then, for each group, the function mean() is applied to the column len.
This is exactly what is happening in the dplyr solution, but there, the two steps are more clearly spelled out.
The resulting data frame can then be plotted:
qplot(dose, len, colour = supp, data = tg)