How to make a boxplot with 3D array with ggplot? - r

I have technical question for you please.
Here are my observed data. :
observed <- structure(c(4.06530084555243e-05, 4.34037362577724e-05, 5.25472735118296e-05,
5.75250282219017e-05, 5.33322813829422e-05, 4.31323519093776e-05,
2.93059438168564e-05, 3.2907253754896e-05, 3.93244409813805e-05,
4.44607200813546e-05, 4.28121839343577e-05, 4.41339340180233e-05,
2.45819615043229e-05, 2.77652788697063e-05, 3.471280169582e-05,
4.0759303004447e-05, 4.1444945573338e-05, 3.91053759171617e-05
), .Dim = c(6L, 3L))
After a simulation I have this dataset :
simul <- structure(c(4.19400641566714e-05, 4.34037362577724e-05, 5.21778240776188e-05,
5.72766282640455e-05, 5.33322813829422e-05, 4.4984474595369e-05,
3.04758260711529e-05, 3.35466566427138e-05, 4.07527347018512e-05,
4.51672959887775e-05, 4.42496416020706e-05, 4.41339340180233e-05,
2.38725672336555e-05, 2.78960210968267e-05, 3.42390390339277e-05,
4.0759303004447e-05, 4.1444945573338e-05, 4.16181419135288e-05,
4.06530084555243e-05, 4.52163381730998e-05, 5.37744538705153e-05,
5.75250282219017e-05, 5.44384786782902e-05, 4.27640158845638e-05,
2.93059438168564e-05, 3.16988003284864e-05, 3.88757470111112e-05,
4.16839537839391e-05, 4.1923490779897e-05, 4.43697930071784e-05,
2.53312977844189e-05, 2.82780740113101e-05, 3.49483644305925e-05,
4.23308636691264e-05, 4.36574393087853e-05, 3.91053759171617e-05,
3.97856427517231e-05, 4.25485977213641e-05, 5.21380124071012e-05,
5.62879076217168e-05, 5.18161751345512e-05, 4.22404154190924e-05,
2.84842421189343e-05, 3.2907253754896e-05, 3.93244409813805e-05,
4.28921326811218e-05, 4.2391125283836e-05, 4.28233487269764e-05,
2.45819615043229e-05, 2.67311845213199e-05, 3.3715109777394e-05,
4.00991849427121e-05, 4.07259705233212e-05, 3.62825448554739e-05,
3.95854341194398e-05, 4.23930151174446e-05, 5.25472735118296e-05,
5.76202168197769e-05, 5.23957149070388e-05, 4.31323519093776e-05,
2.90350657890489e-05, 3.22693947104228e-05, 3.90988677457566e-05,
4.44607200813546e-05, 4.28121839343577e-05, 4.28542288317551e-05,
2.56149959419174e-05, 2.77652788697063e-05, 3.49302533009518e-05,
4.13777396322285e-05, 4.12908495437265e-05, 3.92084109551252e-05,
4.14887591359563e-05, 4.39273564362111e-05, 5.31197050290816e-05,
5.77484133948985e-05, 5.36319646972061e-05, 4.62472643466539e-05,
3.06756490605887e-05, 3.49917045844483e-05, 4.15936967740209e-05,
4.66221720234964e-05, 4.48785430220286e-05, 4.44766996381653e-05,
2.36916432633518e-05, 2.69248181080789e-05, 3.471280169582e-05,
3.94762090257435e-05, 4.17765202936009e-05, 3.8021359310749e-05
), .Dim = c(6L, 3L, 5L))
This is a 3D array with 3 dimensions. The columns correspond to the study areas, and the rows to the "months" followed. The third dimension corresponds to the values of the simulation.
My question : Is it possible, with ggplot, to present a multipanel graph (grid) - 1 panel for 1 study area - of boxplots simulations (values of the 3rd dimension) with months at "x axis" please (= 6 boxplots per panel) ? I would also like to draw the lines of the values observed through the boxplots of each panel. Thank you !

I hope I understood it right: for each type of study - make boxplots for each month, summarizing values obtained from all of 5 simulations.
First I gave dimension names to array:
attributes(simul)$dimnames <- list(
month = month.abb[1:6],
study = letters[1:3],
simval = 1:5
)
After that I converted the named array to the cube_tibble, and further into the tibble so I can plot data using usual tidyverse routine:
library(tidyverse)
library(magrittr)
as.tbl_cube(simul) %>%
as_tibble() %>%
rename('value' = simul) %>%
mutate(
study = factor(paste('Study', study)),
month = factor(month, levels = month.abb[1:6])
) %T>%
print %>%
ggplot(aes(x = month, y = value)) +
geom_boxplot(outlier.colour = 'red') +
facet_wrap(~ study, nrow = 1, scale = 'free_y') +
ggthemes::theme_few()
# # A tibble: 90 x 4
# month study simval value
# <fct> <fct> <int> <dbl>
# 1 Jan Study a 1 0.0000419
# 2 Feb Study a 1 0.0000434
# 3 Mar Study a 1 0.0000522
# 4 Apr Study a 1 0.0000573
# 5 May Study a 1 0.0000533
# 6 Jun Study a 1 0.0000450
# 7 Jan Study b 1 0.0000305
# 8 Feb Study b 1 0.0000335
# 9 Mar Study b 1 0.0000408
# 10 Apr Study b 1 0.0000452
# # ... with 80 more rows

Related

Generating repeated measures dataset

I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2

X limits with continuous character values in R ggplot

I am creating a bar graph with continuous x-labels of 'Fiscal Years', such as "2009/10", "2010/11", etc. I have a column in my dataset with a specific Fiscal Year that I would like the x-labels to begin at (see example image below). Then, I would like the x-labels to be every continuous Fiscal Year until the present. The last x-label should be "2018/19". When I try to set the limits with scale_x_continuous, I receive an error of Error: Discrete value supplied to continuous scale. However, if I use 'scale_x_discrete', I get a graph with only two bars: my chosen "Start" date and the "End" of 2018/19.
Start<-Project_x$Start[c(1)]
End<-"2018/2019"
ggplot(Project_x, (aes(x=`FY`, y=Amount)), na.rm=TRUE)+
geom_bar(stat="identity", position="stack")+
scale_x_continuous(limits = c(Start,End))
` Error: Discrete value supplied to continuous scale `
Thank you.
My data is:
df <- data.frame(Project = c(5, 6, 5, 5, 9, 5),
FY = c("2010/11","2017/18","2012/13","2011/12","2003/04","2000/01"),
Start=c("2010/11", "2011/12", "2010/11", "2010/11", "2001/02", "2010/11"),
Amount = c(500,502,788,100,78,NA))
To use the code in the answer below, I need to base my Start_Year off of my Start column rather than the FY column, and the graph should just be for Project #5.
as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start)))
xlabel_start<-subset(df$Start_Year, Project == 5)
xlabel_end<-2018
filter(between(Start_Year,xlabel_start,xlabel_end)) %>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
When running this, my xlabel_start is NULL.
In ggplot, continuous is dedicated for numerical values. Here, your fiscal year are character (or factor) format and so they are considered as discrete values and are sorted alphabetically by ggplot2.
One possible solution to get your expected plot is to create a new variable containing the starting year of the fiscal year and filter for values between 2010 and 2018.
But first, we are going to isolate the project and the starting year of interest by creating a new dataframe:
library(dplyr)
xlabel_start <- as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start))) %>%
distinct(Project, Start_Year) %>%
filter(Project == 5)
# A tibble: 1 x 2
Project Start_Year
<dbl> <dbl>
1 5 2010
Now, using almost the same pipeline, we can isolate values of interest by
doing:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end))
# A tibble: 3 x 5
Project FY Start Amount Year
<dbl> <fct> <fct> <dbl> <dbl>
1 5 2010/11 2010/11 500 2010
2 5 2012/13 2010/11 788 2012
3 5 2011/12 2010/11 100 2011
And once you have done this, you can simply add the ggplot plotting part at the end of this pipe sequence:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end)) #%>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
Does it answer your question ?

adding rows to a tibble based on mostly replicating existing rows

I have data that only shows a variable if it is not 0. However, I would like to have gaps representing these 0s in the graph.
(I will be working from a large dataframe, but have created an example data based on how I will be manipulating it for this purpose.)
library(tidyverse)
library(ggplot2)
A <- tibble(
name = c("CTX_M", "CblA_1"),
rpkm = c(350, 4),
sample = "A"
)
B <- tibble(
name = c("CTX_M", "OXA_1", "ampC"),
rpkm = c(324, 357, 99),
sample = "B"
)
plot <- bind_rows(A, B)
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")
Sample A and B both have CTX_M, however the othre three "names" are only present in either sample A or sample B. When I run the code, the output graph shows two bars for sample A and three bars for sample B the resulting graph was:
Is there a way for me to add ClbA_1 to sample B with rpkm=0, and OXA_1 and ampC to sample A with rpkm=0, while maintaining sample separation? - so the tibble would look like this (order not important):
and the graph would therefore look like this:
You can use complete from tidyr.
plot <- plot %>% complete(name,sample,fill=list(rpkm=0))
# A tibble: 8 x 3
name sample rpkm
<chr> <chr> <dbl>
1 ampC A 0
2 ampC B 99
3 CblA_1 A 4
4 CblA_1 B 0
5 CTX_M A 350
6 CTX_M B 324
7 OXA_1 A 0
8 OXA_1 B 357
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")

Plot a multivariate histogram in R

I would like to plot 6 different variables with their corresponding calculated statistical data. The following dataframe may serve as an example
X aggr_a aggr_b count
<chr> <dbl> <dbl> <dbl>
1 A 470676 594423 58615
2 B 549142 657291 67912
3 C 256204 311723 26606
4 D 248256 276593 40201
5 E 1581770 1717788 250553
6 F 1932096 2436769 385556
I would like to plot each row as category with its statistics as histogram bins. The desired output is
May I use ggplots for this kind of graphs?
All the available resources seem to cover the uni variate case only.
library(tidyverse)
df = read.table(text = "
X aggr_a aggr_b count
A 470676 594423 58615
B 549142 657291 67912
C 256204 311723 26606
D 248256 276593 40201
E 1581770 1717788 250553
F 1932096 2436769 385556
", header=T)
df %>%
gather(type,value,-X) %>% # reshape dataset
ggplot(aes(X,value,fill=type))+
geom_bar(position = "dodge", stat = "identity")

R conditional lookup and sum

I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...

Resources