Plotting Several Grouped Bar Plots in Loop [R] - r

my challenge is to plot several bar plots at once, a plot for each of variables of different subsets. My goal is to compare regional differences for each variable. I would like to print all the resulting plots to a html file via R Markdown.
My main difficulty in making automatic grouped bar charts is that you need to tabulate the groups using table(data$Var[i], data$Region)but I don't know how to do this automatically. I would highly appreciate a hint on this.
Here is a an example of what one of my subset looks like:
# To Create this example of data:
b <- rep(matrix(c(1,2,3,2,1,3,1,1,1,1)), times=10)
data <- matrix(b, ncol=10)
colnames(data) <- paste("Var", 1:10, sep = "")
data <- as.data.frame(data)
reg_name <- c("North", "South")
Region <- rep(reg_name, 5)
data <- cbind(data,Region)
Using beside = TRUE, I was able to create one grouped bar plot (grouped by Region for Var1 from data):
tb <- table(data$Var1,data$Region)
barplot(tb, main="Var1", xlab="Values", legend=rownames(tb), beside=TRUE,
col=c("green", "darkblue", "red"))
I would like to loop this process to generate for example 10 plots for Var1 to Var10:
for(i in 1:10){
tb <- table(data[i], data$Region)
barplot(tb, main = i, xlab = "Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
R prefer the apply family of functions, therefore I tried to create a function to be applied:
fct <- function(i) {
tb <- table(data[i], data$Region)
barplot(tb, main=i, xlab="Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
sapply(data, fct)
I have tried other ways, but I was never successful. Maybe lattice or ggplot2 would offer easier way to do this. I am just starting in R, I will gladly accept any tips and suggestions. Thank you!
(I run on Windows, with the most recent Rv3.1.2 "Pumpking Helmet")

Given that you say "My goal is to compare regional differences for each variable", I'm not sure you've chosen the optimal plotting strategy. But yes, it is possible to do what you are asking.
Here's the default plot you get with your code above, for reference:
If you want a list with 10 plots for each variable, you can do the following (with ggplot)
many_plots <-
# for each column name in dat (except the last one)...
lapply(names(dat)[-ncol(dat)], function(x) {
this_dat <- dat[, c(x, 'Region')]
names(this_dat)[1] <- 'Var'
ggplot(this_dat, aes(x=Var, fill=factor(Var))) +
geom_bar(binwidth=1) + facet_grid(~Region) +
theme_classic()
})
Sample output, for many_plots[[1]]:
If you wanted all the plots in one image, you can do this (using reshape and data.table)
library(data.table)
library(reshape2)
dat2 <-
data.table(melt(dat, id.var='Region'))[, .N, by=list(value, variable, Region)]
ggplot(dat2, aes(y=N, x=value, fill=factor(value))) +
geom_bar(stat='identity') + facet_grid(variable~Region) +
theme_classic()
...but that's not a great plot.

Related

Create barplot from a data.frame, with comparing columns [duplicate]

New to R and trying to figure out the barplot.
I am trying to create a barplot in R that displays data from 2 columns that are grouped by a third column.
DataFrame Name: SprintTotalHours
Columns with data:
OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3
I want a barplot that shows the OriginalEstimate next to the TimeSpent for each sprint.
I tried this but I am not getting what I want:
colours = c("red","blue")
barplot(as.matrix(SprintTotalHours),main='Hours By Sprint',ylab='Hours', xlab='Sprint' ,beside = TRUE, col=colours)
abline(h=200)
I would like to use base graphics but if it can't be done then I am not opposed to installing a package if necessary.
Using base R :
DF <- read.csv(text=
"OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3")
# prepare the matrix for barplot
# note that we exclude the 3rd column and we transpose the data
mx <- t(as.matrix(DF[-3]))
colnames(mx) <- DF$Sprint
colours = c("red","blue")
# note the use of ylim to give 30% space for the legend
barplot(mx,main='Hours By Sprint',ylab='Hours', xlab='Sprint',beside = TRUE,
col=colours, ylim=c(0,max(mx)*1.3))
# to add a box around the plot
box()
# add a legend
legend('topright',fill=colours,legend=c('OriginalEstimate','TimeSpent'))
cols <- c('red','blue');
ylim <- c(0,max(SprintTotalHours[c('OriginalEstimate','TimeSpent')])*1.8);
par(lwd=6);
barplot(
t(SprintTotalHours[c('OriginalEstimate','TimeSpent')]),
beside=T,
ylim=ylim,
border=cols,
col='white',
names.arg=SprintTotalHours$Sprint,
xlab='Sprint',
ylab='Hours',
legend.text=c('Estimated','TimeSpent'),
args.legend=list(text.col=cols,col=cols,border=cols,bty='n')
);
box();
Data
SprintTotalHours <- data.frame(OriginalEstimate=c(178L,210L,240L),TimeSpent=c(471.5,226,
195),Sprint=c('16.6.1','16.6.2','16.6.3'),stringsAsFactors=F);
You need to melt to long form so you can group. While you can do this in base R, not many people do, though there are a variety of package options (here tidyr). Again, ggplot2 gives you better results with less work, and is the way most people will end up plotting:
library(tidyr)
library(ggplot2)
ggplot(data = SprintTotalHours %>% gather(Variable, Hours, -Sprint),
aes(x = Sprint, y = Hours, fill = Variable)) +
geom_bar(stat = 'identity', position = 'dodge')
Use base R if you prefer, but this approach (more or less) is the conventional approach at this point.

drawing multiple plots, 2 per page using ggplot

I have a list of dataframes and I would like to print them all in a .RMarkdown document with 2 per page. However, I have not been able to find a source for doing this. Is it possible to do this via a for loop?
What I would like to achieve is something with the following idea:
listOfDataframes <- list(df1, df2, df3, ..., dfn)
for(i in 1:){
plot <- ggplot(listOfDataframes[i], aes(x = aData, y = bData)) + geom_point(color = "steelblue", shape = 19)
#if two plots have been ploted break to a new page.
}
Is this possible to achieve with ggplot in rmarkdown? I need to print out a PDF document.
If you just need to output plots with two per page, then I would use gridExtra as was suggested above. You could do something like this if you were to put your ggplot objects into a list.
library(ggplot2)
library(shinipsum) # Just used to create random ggplot objects.
library(purrr)
library(gridExtra)
# Create some random ggplot objects.
ggplot_objects <- list(random_ggplot("line"), random_ggplot("line"))
# Create a list of names for the plots.
ggplot_objects_names <- c("This is Graph 1", "This is Graph 2")
# Use map2 to pass the ggplot objects and the list of names to the the plot titles, so that you can change them.
ggplot_objects_new <-
purrr::map2(
.x = ggplot_objects,
.y = ggplot_objects_names,
.f = function(x, y) {
x + ggtitle(y)
}
)
# Arrange each ggplot object to be 2 per page. Use marrangeGrob so that you can save two ggplot objects per page.
ggplot_arranged <-
gridExtra::marrangeGrob(ggplot_objects_new, nrow = 2, ncol = 1)
# Save as one pdf. Use scale here in order for the multi-plots to fit on each page.
ggsave("ggplot_arranged.pdf",
ggplot_arranged, scale = 1.5)
If you have a list of dataframes that you are wanting to create ggplots for, then you can use purrr::map to do that. You could do something like this:
purrr::map(df_list, function(x) {
ggplot(data = x, aes(x = aData, y = bData)) +
geom_point(color = "steelblue", shape = 19)
})

Code to find the hull of a list of ROC curves (upper and lower limits of the set of curves)

I have made code that computes the two lines I am asking for in the question, as shown in the image below (desired lines are in red).
EDIT : This is the expected graph using my snippet to generate the ROC curves (atleast I'm pretty sure this is right) :
The problem is that said code is very very ugly (too long to even post here) and the process I came up with seems extremely tedious to me. Yet I can't seem to come up with anything better.
Here is a quick snippet to produce an input list of ROC curves
library(MASS)
library(dplyr)
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
return(rbind(c(0,0,0),data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)))
}
diab_data=rbind(data.frame(Pima.tr),data.frame(Pima.te))
roc_curves_list_logisitic=list()
for (k in 1:100) {
#Set a fixed seed for reproducibility
set.seed(k)
# sampled_rows <- createDataPartition(diab_data$type, p = .7, list = FALSE)
sampled_rows <- sample(1:nrow(diab_data), size=floor(0.7*nrow(diab_data)))
diab_data_train=diab_data[sampled_rows,]
diab_data_test=diab_data[-sampled_rows,]
diab_data_train[,1:7]=scale(diab_data_train[,1:7])
diab_data_test[,1:7]=scale(diab_data_test[,1:7])
diab_data_train[,"type"]=as.numeric(as.character(recode_factor(diab_data_train[,"type"],`Yes` = "1", `No` = "0")))
diab_data_test[,"type"]=as.numeric(as.character(recode_factor(diab_data_test[,"type"],`Yes` = "1", `No` = "0")))
logistic_model_simple=glm(data=diab_data_train,as.formula(paste(colnames(diab_data_train)[8], "~",
paste(colnames(diab_data_train)[-8], collapse = "+"),
sep = "")),family=binomial(link = "logit"))
roc_curves_list_logisitic[[k]]=simple_roc(diab_data_test[,"type"],
ifelse(predict(logistic_model_simple,diab_data_test,type='response')>0.5,1,0))
}
I am now asking for help, in case anyone has a "beautiful" solution to produce the two red lines in this graph (in ggplot2) using the list of ROC curves I provided as input.
Preferably I would like to end up with two dataframes lower_bound_roc_curves and upper_bound_roc_curves containing the necessary values to plot the two lines seperately if I need them.
Thanks in advance,
EDIT 2 :#denis Here are some parts I think your code gets wrong :
I have a solution with data.table and zoo. The first step is to have a common FPR between all your curves. It is to be able to plot the maximum and the minimum of all curve. To do so:
library(data.table)
library(zoo)
FPRlist <- unique(rbindlist(lapply(roc_curves_list_logisitic,function(ROC){
rccurve <- as.data.table(ROC)
rccurve[,.(FPR = FPR)]
})))
I create a table FPRlist containing all the FPR existing in all your curves. I will after merge each curve with this table containing all FPR, and use na.locf to complete the missing values.
I use rbindlist to make one table, with an ID for each ROC curve
results <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve <- merge(FPRlist,rccurve,all = T)
rccurve[,TPR := na.locf(TPR,na.rm = F)] # I complete the values
rccurve[,ID := idx] # I create an ID
rccurve
}))
I then calculate the max and min across all ID (all ROC curve) for each FPR step
resultmax <- results[,.(TPR = max(TPR)),by = FPR]
resultmin <- results[,.(TPR = min(TPR)),by = FPR]
And plot it the same way you plot it
ggplot()+
geom_line(data = results,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")+
geom_line(data = resultmax,aes(FPR,TPR),color = "red",size = 1)+
geom_line(data = resultmin,aes(FPR,TPR),color = "red",size = 1)
I let the dplyr translation to dplyr users, because I am not used to.
Edit
I modified my plot to make a comparison with the plot of just all raw ROC curves without any merge nor na.locf. One can see that the red lines I propose do follow the max and the min of all curves. The second plot is obtained as follow:
results2 <- rbindlist(lapply(seq(roc_curves_list_logisitic),function(idx){
rccurve <- as.data.table(roc_curves_list_logisitic[[idx]])
rccurve[,ID := idx] # I create an ID
rccurve
}))
p2 <- ggplot()+
geom_line(data = results2,aes(FPR,TPR,color = as.factor(ID)))+
theme_light() %+replace% theme(legend.position = "none")
It just plots all the ROC curves contained in the list provided in the OS question. The two column plot is obtained with multiplot function (see here)

Barplot with multiple columns in R

New to R and trying to figure out the barplot.
I am trying to create a barplot in R that displays data from 2 columns that are grouped by a third column.
DataFrame Name: SprintTotalHours
Columns with data:
OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3
I want a barplot that shows the OriginalEstimate next to the TimeSpent for each sprint.
I tried this but I am not getting what I want:
colours = c("red","blue")
barplot(as.matrix(SprintTotalHours),main='Hours By Sprint',ylab='Hours', xlab='Sprint' ,beside = TRUE, col=colours)
abline(h=200)
I would like to use base graphics but if it can't be done then I am not opposed to installing a package if necessary.
Using base R :
DF <- read.csv(text=
"OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3")
# prepare the matrix for barplot
# note that we exclude the 3rd column and we transpose the data
mx <- t(as.matrix(DF[-3]))
colnames(mx) <- DF$Sprint
colours = c("red","blue")
# note the use of ylim to give 30% space for the legend
barplot(mx,main='Hours By Sprint',ylab='Hours', xlab='Sprint',beside = TRUE,
col=colours, ylim=c(0,max(mx)*1.3))
# to add a box around the plot
box()
# add a legend
legend('topright',fill=colours,legend=c('OriginalEstimate','TimeSpent'))
cols <- c('red','blue');
ylim <- c(0,max(SprintTotalHours[c('OriginalEstimate','TimeSpent')])*1.8);
par(lwd=6);
barplot(
t(SprintTotalHours[c('OriginalEstimate','TimeSpent')]),
beside=T,
ylim=ylim,
border=cols,
col='white',
names.arg=SprintTotalHours$Sprint,
xlab='Sprint',
ylab='Hours',
legend.text=c('Estimated','TimeSpent'),
args.legend=list(text.col=cols,col=cols,border=cols,bty='n')
);
box();
Data
SprintTotalHours <- data.frame(OriginalEstimate=c(178L,210L,240L),TimeSpent=c(471.5,226,
195),Sprint=c('16.6.1','16.6.2','16.6.3'),stringsAsFactors=F);
You need to melt to long form so you can group. While you can do this in base R, not many people do, though there are a variety of package options (here tidyr). Again, ggplot2 gives you better results with less work, and is the way most people will end up plotting:
library(tidyr)
library(ggplot2)
ggplot(data = SprintTotalHours %>% gather(Variable, Hours, -Sprint),
aes(x = Sprint, y = Hours, fill = Variable)) +
geom_bar(stat = 'identity', position = 'dodge')
Use base R if you prefer, but this approach (more or less) is the conventional approach at this point.

How to plot deviation from mean

In R I have created a simple matrix of one column yielding a list of numbers with a set mean and a given standard deviation.
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
r <- rnorm2(100,4,1)
I now would like to plot how these numbers differ from the mean. I can do this in Excel as shown below:
But I would like to use ggplot2 to create a graph in R. in the Excel graph I have cheated by using a line graph but if I could do this as columns it would be better. I have tried using a scatter plot but I cant work out how to turn this into deviations from the mean.
Perhaps you want:
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(100,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
par(las=1,bty="l") ## cosmetic preferences
plot(x, r, col = "green", pch=16) ## draws the points
## if you don't want points at all, use
## plot(x, r, type="n")
## to set up the axes without drawing anything inside them
segments(x0=x, y0=4, x1=x, y1=r, col="green") ## connects them to the mean line
abline(h=4)
If you were plotting around 0 you could do this automatically with type="h":
plot(x,r-4,type="h", col="green")
To do this in ggplot2:
library("ggplot2")
theme_set(theme_bw()) ## my cosmetic preferences
ggplot(data.frame(x,r))+
geom_segment(aes(x=x,xend=x,y=mean(r),yend=r),colour="green")+
geom_hline(yintercept=mean(r))
Ben's answer using ggplot2 works great, but if you don't want to manually adjust the line width, you could do this:
# Half of Ben's data
rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
set.seed(101)
r <- rnorm2(50,4,1)
x <- seq_along(r) ## sets up a vector from 1 to length(r)
# New variable for the difference between each value and the mean
value <- r - mean(r)
ggplot(data.frame(x, value)) +
# geom_bar anchors each bar at zero (which is the mean minus the mean)
geom_bar(aes(x, value), stat = "identity"
, position = "dodge", fill = "green") +
# but you can change the y-axis labels with a function, to add the mean back on
scale_y_continuous(labels = function(x) {x + mean(r)})
in base R it's quite simple, just do
plot(r, col = "green", type = "l")
abline(4, 0)
You also tagged ggplot2, so in that case it will be a bit more complicated, because ggplot requires creating a data frame and then melting it.
library(ggplot2)
library(reshape2)
df <- melt(data.frame(x = 1:100, mean = 4, r = r), 1)
ggplot(df, aes(x, value, color = variable)) +
geom_line()

Resources