How to graph mean of each date for different groups? - r

I have a data set with 3 columns: date, weight, and location. I want to make a graph with time on the x-axis and weight on the y-axis with a different line for each location, where each point on the line is the mean weight of all samples from that location at that date. The only ways I've been able to come up with to do this would take way too long and require more lines of code than seems reasonable just to make a graph. For instance I tried subsetting like this:
A <- df$Location == "A"
Aug10_19 <- df$Date == "2019/07/10"
ind <- Aug10_19 & A
mean(df$Weight[ind])
But then I would have to do this manually for every individual combination of date and location and then force all the means into a new data frame. What is the shorter way to accomplish this?

You can use ggplot2 to quickly create summary plots.
library(dplyr)
library(ggplot2)
df <- transmute(
iris,
Location = Species,
Date = as.Date(as.character(
cut(Sepal.Length, breaks = 3,
labels = c("2019-07-10", "2019-07-12", "2019-07-15")))),
Weight = Sepal.Width)
ggplot(data = df,
mapping = aes(x = Date, y = Weight, colour = Location)) +
stat_summary(fun = "mean", geom = "line") +
theme_bw()

Related

Time series data using ggplot: how use different color for each time point and also connect with lines data belonging to each subject?

I have data from several cells which I tested in several conditions: a few times before and also a few times after treatment. In ggplot, I use color to indicate different times of testing.
Additionally, I would like to connect with lines all data points which belong to the same cell. Is that possible?...
Here is my example data (https://www.dropbox.com/s/eqvgm4yu6epijgm/df.csv?dl=0) and a simplified code for the plot:
df$condition = as.factor(df$condition)
df$cell = as.factor(df$cell)
df$condition <- factor(df$condition, levels = c("before1", "before2", "after1", "after2", "after3")
windows(width=8,height=5)
ggplot(df, aes(x=condition, y=test_variable, color=condition)) +
labs(title="", x = "Condition", y = "test_variable", color="Condition") +
geom_point(aes(color=condition),size=2,shape=17, position = position_jitter(w = 0.1, h = 0))
I think you get in the wrong direction for your code, you should instead group and colored each points based on the column Cell. Then, if I'm right, you are looking to see the evolution of the variable for each cell before and after a treatment, so you can order the x variable using scale_x_discrete.
Altogether, you can do something like that:
library(ggplot2)
ggplot(df, aes(x = condition, y = variable, group = Cell)) +
geom_point(aes(color = condition))+
geom_line(aes(color = condition))+
scale_x_discrete(limits = c("before1","before2","after1","after2","after3"))
Does it look what you are expecting ?
Data
df = data.frame(Cell = c(rep("13a",5),rep("1b",5)),
condition = rep(c("before1","before2","after1","after2","after3"),2),
variable = c(58,55,36,29,53,57,53,54,52,52))

R ggplot automatic recalculation with geom tile when subsetting

I am attempting to create heat maps with a large data set that has several factors. I'd like to get a birds eye view first, by plotting the heat map of all values and all factors. THEN, I'd like to subset the heat map plot by a variety of factors - but have ggplot2::geom_tile re-calculate the heat map so it plots the relative abundance based on whatever factors I've subsampled.
library(reshape2)
library(ggplot2)
library(dplyr)
#Test data
df <- data.frame(
Measurement = c(1:30),
CA = rep(rnorm(30, mean=20, sd=5)),
TX = rep(rnorm(30, mean=18, sd=5)),
NY = rep(rnorm(30, mean=34, sd=2))
)
df.melt <- melt(df,id = c("Measurement"))
Basic heat map plot code. My actual data includes several factors/columns from which I want to pull data for various comparisons.
#Basic plot
ggplot(data = df.melt,
aes(x = variable, y = Measurement, colors = value, fill = value)) +
geom_tile(color = "black") +
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
I want the output colors to correspond to relative abundance by measurement. So I can look at Relative changes across CA, TX, and NY. This would be my "Base plot".
df.melt.reabun <- df.melt %>% group_by(Measurement) %>%
mutate(RelAbun = value/sum(value))
df.melt.reabun <- as.data.frame(df.melt.reabun)
#New plot with relative abundance
ggplot(data = df.melt.reabun,
aes(x = variable, y = Measurement, colors = RelAbun, fill = RelAbun)) +
geom_tile(color = "black") +
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
What I also want to do is be able to re-plot however I want and the relative abundance to automatically calculate within ggplot tile.
#Assign plot object
heat <- ggplot(data = df.melt.reabun,
aes(x = variable, y = Measurement, colors = RelAbun, fill = RelAbun)) +
geom_tile(color = "black")+
scale_fill_gradientn(colors = c("lightyellow", "darkred"))
#Select variable to subset data
alt <- c("CA", "TX")
#Subset ggplot object
heat %+% subset(df.melt.reabun, variable %in% alt)
But this output is incorrect, because it is only showing relative abundance from the calculation that included CA, TX, and NY.
I want the relative abundance to re-calculate every time I subset the df to plot at this step: heat %+% subset()
I have a feeling I can smoothly combine group_by and geom_tile to do this automatically.. but I can't quite figure it out. Any help would be appreciated. I have MANY MANY combinations of heat maps I want to look at and I do NOT want to re-calculate the relative abundance "manually" each time.
It's generally advisable to do your data wranglings before passing the data frame to ggplot. In this case, something like the following could work:
subsetFun <- function(df, var.filter){
return(df %>%
filter(variable %in% var.filter) %>%
group_by(Measurement) %>%
mutate(RelAbun = value / sum(value)) %>%
ungroup())
}
heat %+% subsetFun(df.melt.reabun, alt)

Stack different variables on on graph using stat_summary (ggplot)

Context: I want to compare graphically the evolution of workload and trust over time during an experiment. Time is represented by 2 blocks.
Issue: I'm trying to plot different variables with different units on the same graph to compare the evolution. I only found that it works with geom_line, but it doesn't for stat_summary.
Data: x is "Block" (2 blocks) representing time. Variables used for y are "Workload" and "Trust" (both from 1 to 5, obtained by asking the subject).
To give some data:
data = data.frame("Subject" = c(1,1,2,2,3,3), "Block" = c(1,2,1,2,1,2), "Workload" = c(1,5,2,4,3,3), "Trust" = c(4,1,3,2,2,1))
I tried this, it works:
ggplot(data, aes(Block)) + geom_line(aes(y = Trust)) + geom_line(aes(y = Workload))
However it does not produce a convincing result: since I have multiple points, it links them for each value so that I obtain only vertical traits. And it's perfectly normal considering what geom_line is supposed to do.
So I can still compute the mean for each block and each variable, however I was wondering if it is possible to obtain a direct result with stat_summary, using something like:
ggplot(data, aes(Block)) + stat_summary(fun.y = mean, geom = line, aes(y = Trust)) + stat_summary(fun.y = mean, geom = line, aes(y = Workload))
Thank you for anyone dedicating even a little of their time trying to answer that.
Have a nice day!
Pyxel
I'd recommend summarizing your data before plotting. Consider here:
library(tidyverse)
df <- data_frame("Subject" = c(1,1,2,2,3,3),
"Block" = c(1,2,1,2,1,2),
"Workload" = c(1,5,2,4,3,3),
"Trust" = c(4,1,3,2,2,1))
grouped <-
df %>%
group_by(Block) %>%
summarise(trust = mean(Trust),
workload = mean(Workload))
ggplot(grouped, aes(x = Block)) +
geom_line(aes(y = trust)) +
geom_line(aes(y = workload))

How would one create scatterplots based on characteristics in multiple columns of a data frame?

age <- rnorm(100, 0:100)
freq <- rnorm(100, 0:1)
char1<-stringi::stri_rand_strings(100, length = 1, pattern = "[abc]")
char2<-stringi::stri_rand_strings(100, length = 1, pattern = "[def]")
char3<-stringi::stri_rand_strings(100, length = 1, pattern = "[def]")
char3<-stringi::stri_rand_strings(100, length = 1, pattern = "[ghi]")
dftest <- data.frame(age, freq, char1, char2, char3)
dflist <- list(dftest, dftest, dftest, dftest, dftest)
This creates a sample data frame that demonstrates the problem I am having.
I would like to create scatterplots for age vs freq for each of the data frames in this list, but I want a different color for the points based on the value in columns "char#". I also need a separate trend line for values in each of these separate characteristics.
I also want to be able to do this based on combinations of different characteristics from different char columns. An example of this is 3*3=9 separate colors for each of the combinations, each with a different trend line.
How would this be done?
I hope this was reproducible and clear enough. I have only posted a few times, so I am still getting used to the format.
Thanks!
Let's start by creating a data frame that will allow us to show points with different colors:
df2 <- data.frame(age=rnorm(200,0:100),
freq=rnorm(200,0:1),id=rep(1:2,each=100))
Then we can plot like so:
plot(dflist2$age,dflist2$freq, col=dflist2$id, pch=16)
We set col (color) equal to id (this would represent each data frame). pch is the point type (solid dots).
You can try dplyr for data preparing and ggplot for plotting. All functions are loaded via the tidyverse package:
library(tidyverse)
# age vs freq plus trendline for char1
as.tbl(dftest) %>%
ggplot(aes(age, freq, color=char1)) +
geom_point() +
geom_smooth(method = "lm")
# age vs freq plus trendline for combinations of char columns
as.tbl(dftest) %>%
unite(combi, char1, char2, char3, sep="-") %>%
ggplot(aes(age, freq, color=combi)) +
geom_point() +
geom_smooth(method = "lm")
# no plot as too many combinations make the plot to busy
dflist %>%
bind_rows( .id = "df_source") %>%
ggplot(aes(age, freq, color=char1)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
facet_wrap(~df_source)

plot factor frequency by group (a yield plot)

I have a data frame containing the test_outcome (PASS/FAIL) for each test_type performed on each test_subject. For example:
test_subject, test_type, test_outcome
person_a, height, PASS
person_b, height, PASS
person_c, height, FAIL
person_d, height, PASS
person_a, weight, FAIL
person_b, weight, FAIL
person_c, weight, PASS
person_d, weight, PASS
I would like to prepare a yield plot by test_type and test_subject.
Y-axis = yield i.e. num pass/(num pass + num fail)
X-axis = test_subject
fill: = A line for each test_type.
I would prefer to use ggplot2, can you please recommend the best approach here? e.g. how to reshape the data before plotting?
A quick dplyr answer, you will want to tidy up the graph based on your desired colours etc.
library(dplyr)
library(ggplot2)
dat <- dat %>% group_by(test_subject, test_type) %>%
summarise(passrate = sum(test_outcome=="PASS") / n())
ggplot(dat, aes(x = test_subject, y = passrate, fill = test_type)) +
geom_bar(stat = "identity", position = "dodge")
Edit: a line graph was requested. Normally, categorical groups shouldn't be connected by a line graph - as there is no reason to order them in a particular way.
ggplot(dat, aes(x = test_subject, y = passrate, col = test_type)) +
geom_line(aes(group = test_type)) +
geom_point()

Resources