Combine multiple plots, generated using the "by" R function, in one figure - r

I have a data frame containing multiple numeric columns and one column with different factors. I'd like to produce a unique image containing the plots of the numeric columns, by factor. I tried the following:
varA <- runif(40)
varB <- runif(40)
varB <- runif(40)
varC <- runif(40)
mainVar <- c(rep('cat', 10), rep('dof', 10), rep('mouse', 10), rep('frog', 10))
plotData <- data.frame(varA, varB, varC, mainVar)
pdf('asd.pdf')
par(mfrow=c(2,2))
by(plotData, plotData$mainVar, function(x){
par(mfrow=c(1,3))
boxplot(x$varA)
boxplot(x$varB)
boxplot(x$varC)
})
dev.off()
It produces a unique pdf, but with a page for every factor.
Instead, I'd like to get something like that (without the red lines):

First, both techniques shown here prefer data in a "tall" format. There are several tools that will reshape it for you, I'll use
# library(tidyr)
plotDataTall <- tidyr::gather(plotData, k, v, -mainVar)
head(plotDataTall)
# mainVar k v
# 1 cat varA 0.4023846
# 2 cat varA 0.3406813
# 3 cat varA 0.7990530
# 4 cat varA 0.3706167
# 5 cat varA 0.5986029
# 6 cat varA 0.1626782
Other tools include the reshape2 package or the stats function reshape, both of which are increasingly less-intuitive to use for first time users.
ggplot2
library(ggplot2)
ggplot(plotDataTall, aes(x = k, y = v)) +
geom_boxplot() +
facet_wrap(~ mainVar, nrow=2) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
Base R
Because you cannot nest uses of par(mfrow=...) (they replace, not nest), you can stick with the over-arching 2x2 and handling the per-variable boxplots within boxplot. This can be with the wide data:
par(mfrow=c(2,2), mar=c(4,3,3,0.1))
ign <- by(plotData, plotData$mainVar, function(x) {
boxplot(x$varA, x$varB, x$varC, main=x$mainVar[1])
})
or the tall format:
par(mfrow=c(2,2), mar=c(4,3,3,0.1))
ign <- by(plotDataTall, plotDataTall$mainVar, function(x) {
boxplot(v~k, data=x, main=x$mainVar[1])
})
(I took the liberty of adjusting the margins, primarily for a shrunken combined plot here. Not required for production.)

As r2evans already points out, I doubt this is possible with the base plot function. Using ggplot2 (part of the tidyverse) you can get a one-page plot using:
library(tidyverse)
plotData %>%
gather(var, y, -mainVar) %>%
ggplot(aes(x = var, y = y)) + geom_boxplot() + facet_wrap(~mainVar)
note that this also uses the pipe operator (dplyr), and gather (tidyr) both part of tidyverse

Related

Problem with the order of the x-axis in my plot

I have a problem with the order of the x-axis in my plot.
Download the data from the website of the Central Bank of PerĂº (GDP by quarter).
library(jsonlite)
library(rstudioapi)
library(ggplot2)
library(data.table)
PBI <- "PN02635BQ"
URL3 <- paste0("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/",
PBI,"/json/2018-1-1/2021-7-31")
Use of json for download the data
l_json <- jsonlite::fromJSON(URL3)
dt_PBI <- data.table(l_json$periods)
sapply(dt_PBI,class)
dt_PBI[,values := round(as.numeric(values),4)]
colnames(dt_PBI)<- c("Quarter", "Millions")
As you see the plot is not in order in the x-axis.
ggplot(dt_PBI, aes(x=Quarter, y=Millions)) +
geom_point()
The x-axis is not in order because your character column becomes a factor and is sorted by alphanumeric ordering. So in your case "T1.19" would sort before "T2.18". To fix this, use library forcats and sort the factor based on the year. I split this out into its own column ("year") to make it clear what was being accomplished here.
library(jsonlite)
library(rstudioapi)
library(ggplot2)
library(data.table)
library(forcats)
PBI <- "PN02635BQ"
URL3 <- paste0("https://estadisticas.bcrp.gob.pe/estadisticas/series/api/",
PBI,"/json/2018-1-1/2021-7-31")
l_json <- jsonlite::fromJSON(URL3)
dt_PBI <- data.table(l_json$periods)
sapply(dt_PBI,class)
dt_PBI[,values := round(as.numeric(values),4)]
colnames(dt_PBI) <- c("Quarter", "Millions")
dt_PBI$year <- substr(dt_PBI$Quarter, 4, 5)
dt_PBI$Quarter <- fct_reorder(dt_PBI$Quarter, dt_PBI$year, min)
ggplot(dt_PBI, aes(x=Quarter, y=Millions)) +
geom_point()

Using Geom_Segment in R

I am trying to create a plot in R using Geom_Segment. I am stuck with an error that says I need to input yend but I am inputting it already... this is my code:
library(ggplot2)
library(data.table)
library(magrittr)
dataset$From<-Sys.Date()
format(dataset$From, format="%Y-%m-%dT%H:%M:%OS")
dataset$To<-Sys.Date()
format(dataset$To, format="%Y-%m-%dT%H:%M:%OS")
ggplot(dataset, aes(x=datetime_start, y=dataset$Audit_Title,
color=dataset$Employee_Name)) +
geom_segment(aes(x=dataset$From,xend=dataset$To,y=dataset$Audit_Title,yend=dataset$Audit_Title),size=20)+
scale_colour_discrete(guide=guide_legend(override.aes=list(size=15))) +
ggtitle("Audit by Employee Timeline") + xlab("") + ylab("") + theme_bw()
SAMPLE DATA:
Here is the sample data
This is how I changed the code below to take in the data from Excel I inputted into Power BI:
library(ggplot2)
library(dplyr)
# transform into date
dataset <- dataset %>%
mutate_at(vars(dataset$From, dataset$To),
.funs = function(tt) readr::parse_date(as.character(tt),
format = "%m/%d/%Y"))
ggplot(dataset)+
geom_segment(aes(x=dataset$From, xend=dataset$To,
y=dataset$Employee_Name, yend=dataset$Employee_Name))
First of all, ideally you would share your data as a dput(dataset). If you can't share real data, you should make a minimal reproducible example and share that. See here
Here's your data
library(ggplot2)
library(dplyr)
df <-
read.table(
text =
"01/03/2020 03/16/2020 Supply_Chain John_Smith
05/08/2020 08/20/2020 Business_Unit Karen_Scott")
names(df) <- c("From", "To", "Audit_Title", "Employee_Name")
# transform into date
df <- df %>%
mutate_at(vars(From, To),
.funs = function(tt) readr::parse_date(as.character(tt),
format = "%m/%d/%Y"))
Now do the actual plot by selecting the proper x xend and having y be the employee (y=yend).
ggplot(df)+
geom_segment(aes(x=From, xend=To,
y=Employee_Name, yend=Employee_Name))
Which produces
If you want fancy colors, labels and stuff go ahead and check the proper documentation for ggplot. See here

ggplot facet grid within a factor

Consider data that looks like this
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
head(df)
fsubs fitem fcounts ftime
1 sub1 1 8 a
2 sub2 1 10 a
3 sub3 1 4 a
4 sub1 2 4 a
5 sub2 2 1 a
6 sub3 2 6 a
I would like to plot a facet grid that shows the counts for the two time points ('a','b'), subject-wise. I can't seem to figure out how to plot this in ggplot
here is my ugly attempt to do it
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
s1<-ggplot(fdf_counts,aes(x=fcounts,y=fcounts.1))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub1')
s2<-ggplot(fdf_counts,aes(x=fcounts.2,y=fcounts.3))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub2')
s3<-ggplot(fdf_counts,aes(x=fcounts.4,y=fcounts.5))+geom_point()+geom_smooth(method='lm')+labs(x='a',y='b',title='sub3')
plot_grid(s1,s2,s3)#from 'cowplot' package
How can I do this with using the original fdf data.frame? Especially as the # of subs increase
Or for example if I wanted to plot one scatter plot across all of the subs with fcounts against eachother with ftime(a) as x axis and ftime(b) as y axis?
Consider a merge solution with data frame by itself on fsubs and fitem (being sequential number of items per fsubs and ftime grouping). This approach allows you to keep your long, tidy data format which is ideal format for ggplot since you can then facet_grid using fsubs without iteration.
mdf <- merge(subset(fdf, ftime=="a"),
subset(fdf, ftime=="b"),
by=c("fsubs", "fitem"),
suffixes=c("", "_"))
ggplot(mdf, aes(x=fcounts, y=fcounts_)) +
geom_point() +
geom_smooth(method='lm') +
labs(x='a', y='b') +
facet_grid(~fsubs)
This should get you close:
library(dplyr)
library(tidyr)
library(tibble)
library(ggplot2)
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-tibble(fsubs,fitem,fcounts,ftime)
fdf <- fdf %>%
group_by(ftime) %>%
mutate(row_id = row_number()) %>%
pivot_wider(values_from = fcounts,
names_from = ftime)
ggplot(data = fdf, aes(x = a, y = b)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(fsubs ~ ., ncol = 1)
The tidyr function pivot_wider allows us to create the shape of the data we need without explicit loops: create new columns a and b with values from fcounts. We do need to create a unique row id to make this work.
By the way, when I run your code the plots look different from what you posted in the question.
With this output:
Just tried to create a visualization that would analyze all 4 variables. Got a geom_histogram
```{r}
fitem<-rep(rep(1:16,each=3),2)
fsubs<-factor(rep(rep(paste('sub',1:3,sep=''),16),2))
ftime<-factor(as.character(rep(c('a','b'),each=48)))
fcounts<-as.numeric(round(runif(96,1,10)))
fdf<-data.frame(fsubs,fitem,fcounts,ftime)
fdf_counts<-data.frame()
for (i in unique(fdf$fsubs)){
fdf_counts<-append(fdf_counts,cbind(fdf%>%filter(fsubs==i,ftime=='a')%>%dplyr::select(fcounts),
fdf%>%filter(fsubs==i,ftime=='b')%>%dplyr::select(fcounts)))
fdf_counts<-data.frame(fdf_counts)
}
ggplot(data = fdf, mapping = aes(x = fdf$fsubs, y = fdf$fcounts, fill = fdf$fitem)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(cols = vars(ftime))
```

refining a simple code in R, it all works I want to polish it

here is my data
https://filebin.net/i2wpmeb19dacs3nr
it's very simple calculation to do but my code is messy.
#plot
library(ggplot2)
library(Hmisc)
library(svglite)
# Basic dot plot
#to use summary stat NA rows has to be delleted (or otherwise ignored)
leaf_count_jas_comlete<-leaf_count_jas[complete.cases(leaf_count_jas), ]
#all factors to character
leaf_count_jas_comlete <- data.frame(lapply(leaf_count_jas_comlete, as.character), stringsAsFactors=FALSE)
#change class for numbers
leaf_count_jas_comlete$leaf.no.<- as.numeric(leaf_count_jas_comlete$leaf.no.)
leaf_count_jas_comlete$height..cm.<- as.numeric(leaf_count_jas_comlete$height..cm.)
svg("rplot_height..cm..svg")
ggplot(leaf_count_jas_comlete, aes(x=genotype, y=height..cm.)) +
geom_dotplot(binaxis='y', stackdir='center', binwidth = 1, dotsize = 0.3)+
stat_summary(fun.y=mean, geom="point", shape=18,
size=3, color="blue") +
stat_summary(fun.data=mean_sdl, fun.args = list(mult=1),
geom="pointrange", color="blue")
dev.off()
#calculate mean and sd per group and export to a new table
library(plyr)
##add NA count
dt <- leaf_count_jas_comlete
jas_summary<-data.frame()
jas_summary_h<-ddply(dt,~genotype,summarise,mean=mean(height..cm.),sd=sd(height..cm.))
jas_summary_l<-ddply(dt,~genotype,summarise,mean=mean(leaf.no.),sd=sd(leaf.no.))
jas_summary_h_l<-merge(jas_summary_h,jas_summary_l, by="genotype", all=TRUE)
library(plyr)
n<-count(leaf_count_jas, "genotype")
dead<-aggregate(leaf.no. ~ genotype, leaf_count_jas, function(x) {sum(is.na(x))}, na.action = NULL)
jas_summary_h_l_dead<-merge(jas_summary_h_l,dead, by="genotype", all=TRUE)
jas_summary_h_l_dead_n<-merge(jas_summary_h_l_dead,n, by="genotype", all=TRUE)
I would like to start coding in simpler way. For example how to make an empty df
and fill it with the summary data so it has a column called "genotype", "mean_leaf_no", "leaf_no_sd", "mean_height", "height_sd", "no_plants" and "dead_plants" and gradually fill it with the calculations instead of merging new dfs? also now when I have it as it is I need to name the columns, but I would like this piece of code to be usable for many times so I would like the name to be constructed based on the original dfs col names (genotype, leaf_no, height) so for example if someone will measure "flowers_no" so it can keep the correct name all over to the summary table.
Help, please, help.
You should consider using the tidyverse if you'd like more simple code.
For example, the first part would be:
library(tidyverse); library(magrittr); library(janitor)
new_set = leaf_count_jas_comlete %>%
# clean names
clean_names() %>%
# remove rows with NA
drop_na() %>%
# convert factors to character
mutate_if(is.factor, as.character) %>%
#convert to numeric
mutate(leaf_no=as.numeric(leaf_no), height_cm=as.numeric(height_cm))
new_set %>%
group_by(genotype) %>%
summarize(mean_height=mean(height_cm), sd_height=sd(height_cm))

Substituting dates with number of days in time series

I have following data on student scores on several pretests before their true exam.
a<-(c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b<-c(300,230,400,NA,NA,NA,"2013-04-30")
c<-c(NA,260,410,420,NA,NA,"2013-05-30")
d<-c(300,230,400,NA,370,390,"2013-08-30")
df<-as.data.frame(rbind(b,c,d))
colnames(df)<-a
rownames(df)<-(c("student 1","student 2","student 3"))
The actual datasheet is much larger. Since the dates vary so much, and the timing between the pretests and to the exam are relatively similar, I would rather convert the true dates into the number of days before the exam, so that they are the new column names, not dates. I understand that this will merge some of the columns which is OK. How would I be able to do that?
This is another good use case for reshape2, because you want to go to long form for plotting. For example:
# you are going to need the student id as a field
df$student_id <- row.names(df)
library('reshape2')
df2 <- melt(df, id.vars = c('student_id','actual_exam_date'),
variable.name = 'pretest_date',
value.name = 'pretest_score')
# drop empty observations
df2 <- df2[!is.na(df2$pretest_score),]
# these need to be dates
df2$actual_exam_date <- as.Date(df2$actual_exam_date)
df2$pretest_date <- as.Date(df2$pretest_date)
# date difference
df2$days_before_exam <- as.integer(df2$actual_exam_date - df2$pretest_date)
# scores need to be numeric
df2$pretest_score <- as.numeric(df2$pretest_score)
# now you can make some plots
library('ggplot2')
ggplot(df2, aes(x = days_before_exam, y = pretest_score, col=student_id) ) +
geom_line(lwd=1) + scale_x_reverse() +
geom_vline(xintercept = 0, linetype = 'dashed', lwd = 1) +
ggtitle('Pretest Performance') + xlab('Days Before Exam') + ylab('Pretest Score')
Here is one way to approach this one. I am sure there are many others. I commented the code to explain what is going on at each step:
# Load two libraries you need
library(tidyr)
library(dplyr)
# Construct data frame you provided
a <- (c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b <- c(300,230,400,NA,NA,NA,"2013-04-30")
c <- c(NA,260,410,420,NA,NA,"2013-05-30")
d <- c(300,230,400,NA,370,390,"2013-08-30")
df <- as.data.frame(rbind(b,c,d))
colnames(df) <- a
# Add student IDs as a column instead of row names and move them to first position
df$StudentID <- row.names(df)
row.names(df) <- NULL
df <- select(df, StudentID, everything())
# Gather date columns as 'categories' with score as the new column value
newdf <- df %>% gather(Date, Score, -actual_exam_date, -StudentID) %>% arrange(StudentID)
# Convert dates coded as factor variables into actual dates so we can do days to exam computation
newdf$actual_exam_date <- as.Date(as.character(newdf$actual_exam_date))
newdf$Date <- as.Date(as.character(newdf$Date))
# Create a new column of days before exam per student ID (group) and filter
# out dates with missing scores for each student
newdf <- newdf %>% group_by(StudentID) %>% mutate(daysBeforeExam = as.integer(difftime(actual_exam_date, Date, units = 'days'))) %>% filter(!is.na(Score))
# Plot the trends using ggplot
ggplot(newdf, aes(x = daysBeforeExam, y = Score, col = StudentID, group = StudentID)) + geom_line(size = 1) + geom_point(size = 2)

Resources