Create barplot from a data.frame, with comparing columns [duplicate] - r

New to R and trying to figure out the barplot.
I am trying to create a barplot in R that displays data from 2 columns that are grouped by a third column.
DataFrame Name: SprintTotalHours
Columns with data:
OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3
I want a barplot that shows the OriginalEstimate next to the TimeSpent for each sprint.
I tried this but I am not getting what I want:
colours = c("red","blue")
barplot(as.matrix(SprintTotalHours),main='Hours By Sprint',ylab='Hours', xlab='Sprint' ,beside = TRUE, col=colours)
abline(h=200)
I would like to use base graphics but if it can't be done then I am not opposed to installing a package if necessary.

Using base R :
DF <- read.csv(text=
"OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3")
# prepare the matrix for barplot
# note that we exclude the 3rd column and we transpose the data
mx <- t(as.matrix(DF[-3]))
colnames(mx) <- DF$Sprint
colours = c("red","blue")
# note the use of ylim to give 30% space for the legend
barplot(mx,main='Hours By Sprint',ylab='Hours', xlab='Sprint',beside = TRUE,
col=colours, ylim=c(0,max(mx)*1.3))
# to add a box around the plot
box()
# add a legend
legend('topright',fill=colours,legend=c('OriginalEstimate','TimeSpent'))

cols <- c('red','blue');
ylim <- c(0,max(SprintTotalHours[c('OriginalEstimate','TimeSpent')])*1.8);
par(lwd=6);
barplot(
t(SprintTotalHours[c('OriginalEstimate','TimeSpent')]),
beside=T,
ylim=ylim,
border=cols,
col='white',
names.arg=SprintTotalHours$Sprint,
xlab='Sprint',
ylab='Hours',
legend.text=c('Estimated','TimeSpent'),
args.legend=list(text.col=cols,col=cols,border=cols,bty='n')
);
box();
Data
SprintTotalHours <- data.frame(OriginalEstimate=c(178L,210L,240L),TimeSpent=c(471.5,226,
195),Sprint=c('16.6.1','16.6.2','16.6.3'),stringsAsFactors=F);

You need to melt to long form so you can group. While you can do this in base R, not many people do, though there are a variety of package options (here tidyr). Again, ggplot2 gives you better results with less work, and is the way most people will end up plotting:
library(tidyr)
library(ggplot2)
ggplot(data = SprintTotalHours %>% gather(Variable, Hours, -Sprint),
aes(x = Sprint, y = Hours, fill = Variable)) +
geom_bar(stat = 'identity', position = 'dodge')
Use base R if you prefer, but this approach (more or less) is the conventional approach at this point.

Related

Plot multicolor vertical lines by using ggplot to show average time taken for each type as facet. Each type will have different vertical lines

I want to plot a chart in R where it will show me vertical lines for each type in facet.
df is the dataframe with person X takes time in minutes to reach from A to B and so on.
I have tried below code but not able to get the result.
df<-data.frame(type =c("X","Y","Z"), "A_to_B"= c(20,56,57), "B_to_C"= c(10,35,50), "C_to_D"= c(53,20,58))
ggplot(df, aes(x = 1,y = df$type)) + geom_line() + facet_grid(type~.)
I have attached image from excel which is desired output but I need only vertical lines where there are joins instead of entire horizontal bar.
I would not use facets in your case, because there are only 3 variables.
So, to get a similar plot in R using ggplot2, you first need to reformat the dataframe using gather() from the tidyverse package. Then it's in long or tidy format.
To my knowledge, there is no geom that does what you want in standard ggplot2, so some fiddling is necessary.
However, it's possible to produce the plot using geom_segment() and cumsum():
library(tidyverse)
# First reformat and calculate cummulative sums by type.
# This works because factor names begins with A,B,C
# and are thus ordered correctly.
df <- df %>%
gather(-type, key = "route", value = "time") %>%
group_by(type) %>%
mutate(cummulative_time = cumsum(time))
segment_length <- 0.2
df %>%
mutate(route = fct_rev(route)) %>%
ggplot(aes(color = route)) +
geom_segment(aes(x = as.numeric(type) + segment_length, xend = as.numeric(type) - segment_length, y = cummulative_time, yend = cummulative_time)) +
scale_x_discrete(limits=c("1","2","3"), labels=c("Z", "Y","X"))+
coord_flip() +
ylim(0,max(df$cummulative_time)) +
labs(x = "type")
EDIT
This solutions works because it assigns values to X,Y,Z in scale_x_discrete. Be careful to assign the correct labels! Also compare this answer.

Barplot with multiple columns in R

New to R and trying to figure out the barplot.
I am trying to create a barplot in R that displays data from 2 columns that are grouped by a third column.
DataFrame Name: SprintTotalHours
Columns with data:
OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3
I want a barplot that shows the OriginalEstimate next to the TimeSpent for each sprint.
I tried this but I am not getting what I want:
colours = c("red","blue")
barplot(as.matrix(SprintTotalHours),main='Hours By Sprint',ylab='Hours', xlab='Sprint' ,beside = TRUE, col=colours)
abline(h=200)
I would like to use base graphics but if it can't be done then I am not opposed to installing a package if necessary.
Using base R :
DF <- read.csv(text=
"OriginalEstimate,TimeSpent,Sprint
178,471.5,16.6.1
210,226,16.6.2
240,195,16.6.3")
# prepare the matrix for barplot
# note that we exclude the 3rd column and we transpose the data
mx <- t(as.matrix(DF[-3]))
colnames(mx) <- DF$Sprint
colours = c("red","blue")
# note the use of ylim to give 30% space for the legend
barplot(mx,main='Hours By Sprint',ylab='Hours', xlab='Sprint',beside = TRUE,
col=colours, ylim=c(0,max(mx)*1.3))
# to add a box around the plot
box()
# add a legend
legend('topright',fill=colours,legend=c('OriginalEstimate','TimeSpent'))
cols <- c('red','blue');
ylim <- c(0,max(SprintTotalHours[c('OriginalEstimate','TimeSpent')])*1.8);
par(lwd=6);
barplot(
t(SprintTotalHours[c('OriginalEstimate','TimeSpent')]),
beside=T,
ylim=ylim,
border=cols,
col='white',
names.arg=SprintTotalHours$Sprint,
xlab='Sprint',
ylab='Hours',
legend.text=c('Estimated','TimeSpent'),
args.legend=list(text.col=cols,col=cols,border=cols,bty='n')
);
box();
Data
SprintTotalHours <- data.frame(OriginalEstimate=c(178L,210L,240L),TimeSpent=c(471.5,226,
195),Sprint=c('16.6.1','16.6.2','16.6.3'),stringsAsFactors=F);
You need to melt to long form so you can group. While you can do this in base R, not many people do, though there are a variety of package options (here tidyr). Again, ggplot2 gives you better results with less work, and is the way most people will end up plotting:
library(tidyr)
library(ggplot2)
ggplot(data = SprintTotalHours %>% gather(Variable, Hours, -Sprint),
aes(x = Sprint, y = Hours, fill = Variable)) +
geom_bar(stat = 'identity', position = 'dodge')
Use base R if you prefer, but this approach (more or less) is the conventional approach at this point.

R plot two series of means with 95% confidence intervals

I am trying to plot the following data
factor <- as.factor(c(1,2,3))
V1_mean <- c(100,200,300)
V2_mean <- c(350,150,60)
V1_stderr <- c(5,9,3)
V2_stderr <- c(12,9,10)
plot <- data.frame(factor,V1_mean,V2_mean,V1_stderr,V2_stderr)
I want to create a plot with factor on the x-axis, value on the y-axis and seperate lines for V1 and V2 (hence the points are the values of V1_mean on one line and V2_mean on the other). I would also like to add error bars for these means based on V1_stderr and V2_stderr
Many thanks
I'm not sure regarding your desired output, but here's a possible solution.
First of all, I wouldn't call your data plot as this is a stored function in R which is being commonly used
Second of all, when you want to plot two lines in ggplot you'll usually have to tide your data using functions such as melt (from reshape2 package) or gather (from tidyr package).
Here's an a possible approach
library(ggplot2)
library(reshape2)
dat <- data.frame(factor, V1_mean, V2_mean, V1_stderr, V2_stderr)
mdat <- cbind(melt(dat[1:3], "factor"), melt(dat[c(1, 4:5)], "factor"))
names(mdat) <- make.names(names(mdat), unique = TRUE)
ggplot(mdat, aes(factor, value, color = variable)) +
geom_point(aes(group = variable)) + # You can also add `geom_point(aes(group = variable)) + ` if you want to see the actual points
geom_errorbar(aes(ymin = value - value.1, ymax = value + value.1))

Plotting Several Grouped Bar Plots in Loop [R]

my challenge is to plot several bar plots at once, a plot for each of variables of different subsets. My goal is to compare regional differences for each variable. I would like to print all the resulting plots to a html file via R Markdown.
My main difficulty in making automatic grouped bar charts is that you need to tabulate the groups using table(data$Var[i], data$Region)but I don't know how to do this automatically. I would highly appreciate a hint on this.
Here is a an example of what one of my subset looks like:
# To Create this example of data:
b <- rep(matrix(c(1,2,3,2,1,3,1,1,1,1)), times=10)
data <- matrix(b, ncol=10)
colnames(data) <- paste("Var", 1:10, sep = "")
data <- as.data.frame(data)
reg_name <- c("North", "South")
Region <- rep(reg_name, 5)
data <- cbind(data,Region)
Using beside = TRUE, I was able to create one grouped bar plot (grouped by Region for Var1 from data):
tb <- table(data$Var1,data$Region)
barplot(tb, main="Var1", xlab="Values", legend=rownames(tb), beside=TRUE,
col=c("green", "darkblue", "red"))
I would like to loop this process to generate for example 10 plots for Var1 to Var10:
for(i in 1:10){
tb <- table(data[i], data$Region)
barplot(tb, main = i, xlab = "Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
R prefer the apply family of functions, therefore I tried to create a function to be applied:
fct <- function(i) {
tb <- table(data[i], data$Region)
barplot(tb, main=i, xlab="Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
sapply(data, fct)
I have tried other ways, but I was never successful. Maybe lattice or ggplot2 would offer easier way to do this. I am just starting in R, I will gladly accept any tips and suggestions. Thank you!
(I run on Windows, with the most recent Rv3.1.2 "Pumpking Helmet")
Given that you say "My goal is to compare regional differences for each variable", I'm not sure you've chosen the optimal plotting strategy. But yes, it is possible to do what you are asking.
Here's the default plot you get with your code above, for reference:
If you want a list with 10 plots for each variable, you can do the following (with ggplot)
many_plots <-
# for each column name in dat (except the last one)...
lapply(names(dat)[-ncol(dat)], function(x) {
this_dat <- dat[, c(x, 'Region')]
names(this_dat)[1] <- 'Var'
ggplot(this_dat, aes(x=Var, fill=factor(Var))) +
geom_bar(binwidth=1) + facet_grid(~Region) +
theme_classic()
})
Sample output, for many_plots[[1]]:
If you wanted all the plots in one image, you can do this (using reshape and data.table)
library(data.table)
library(reshape2)
dat2 <-
data.table(melt(dat, id.var='Region'))[, .N, by=list(value, variable, Region)]
ggplot(dat2, aes(y=N, x=value, fill=factor(value))) +
geom_bar(stat='identity') + facet_grid(variable~Region) +
theme_classic()
...but that's not a great plot.

How to add titles from a list to a series of histograms?

Is it possibile to attach titles via an apply-family function to a series of histograms where the titles are picked from a list?
The code below creates three histograms. I want to give each a different name from a list (named "list"), in order. How do I do this?
data <- read.csv("outcome-of-care-measures.csv")
outcome <-data[,c(11,17,23)]
out <- apply(outcome, 2,as.data.frame)
par(mfrow=c(3,1))
apply(outcome,2, hist, xlim=range(out,na.rm=TRUE), xlab="30 day death rate")
I'd use facet_wrap from ggplot2 to get this done. ggplot2 supports this kind of plots very painlessly:
library(ggplot2)
theme_set(theme_bw())
df = data.frame(values = rnorm(3000), ID = rep(LETTERS[1:3], each = 1000))
ggplot(df, aes(x = values)) + geom_histogram() + facet_wrap(~ ID)
To change the text in the box above each facet, simply replace the text in the ID variable:
id_to_text_translator = c(A = 'Facet text for A',
B = 'Facet text for B',
C = 'Facet text for C')
df$ID = id_to_text_translator[df$ID]
I would recommend taking a close look at what happens in these two lines. Using vectorized subsetting to perform this kind of replacement has compact syntax and high performance. Replacing this code would require a for or apply loop combined with a set of if statements would make the code much longer and slower.
Another option is to directly influence the levels of ID, which is a factor:
levels(df$ID) = id_to_text_translator[levels(df$ID)]
This is a lot faster, especially on datasets with a lot of rows. In addition, it keeps ID a factor, while the previous solution makes ID a character vector.
The resulting plot:
ggplot(df, aes(x = values)) + geom_histogram() + facet_wrap(~ ID)
As it is not only the columns but another argument that changes, you may use mapply to have a function from the apply family.
args <- list(xlim=range(data, na.rm=TRUE), xlab="30 day death rate")
titles <- list("Title 1", "Title 2", "Title 3")
par(mfrow=c(3,1))
mapply(hist, data, main=titles, MoreArgs=args)
If you wrap the invisible function around the last line you can avoid the console output.
Note: I think using loops here is far more straightforward.

Resources