How to better create stacked bar graphs with multiple variables from ggplot2? - r

I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things:
First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, which is why I take out the axis labels completely.
Second, there must be a simpler way to reorganize my data to make this happen. It seems like the sort of thing I should be able to do natively in ggplot2 with plyR, but the documentation for plyR is not very clear (and I have read both the ggplot2 book and the online plyR documentation.
My best graph looks like this, the code to create it follows:
The R code I use to get it is the following:
library(epicalc)
### recode the variables to factors ###
recode(c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ), c(1,2,3,4,5,6,7,8,9, NA),
c('Very Interested','Somewhat Interested','Not Very Interested','Not At All interested',NA,NA,NA,NA,NA,NA))
### Combine recoded variables to a common vector
Interest1<-c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ)
### Create a second vector to label the first vector by original variable ###
a1<-rep("News about Bangladesh", length(int_newcoun))
a2<-rep("Neighboring Countries", length(int_newneigh))
[...]
a17<-rep("Education", length(int_educ))
Interest2<-c(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, a16, a17)
### Create a Weighting vector of the proper length ###
Interest.weight<-rep(weight, 17)
### Make and save a new data frame from the three vectors ###
Interest.df<-cbind(Interest1, Interest2, Interest.weight)
Interest.df<-as.data.frame(Interest.df)
write.csv(Interest.df, 'C:\\Documents and Settings\\[name]\\Desktop\\Sweave\\InterestBangladesh.csv')
### Sort the factor levels to display properly ###
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Not Very Interested')
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Somewhat Interested')
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Very Interested')
Interest.df$Interest2<-relevel(Interest$Interest2, ref='News about Bangladesh')
Interest.df$Interest2<-relevel(Interest$Interest2, ref='Education')
[...]
Interest.df$Interest2<-relevel(Interest$Interest2, ref='European Politics')
detach(Interest)
attach(Interest)
### Finally create the graph in ggplot2 ###
library(ggplot2)
p<-ggplot(Interest, aes(Interest2, ..count..))
p<-p+geom_bar((aes(weight=Interest.weight, fill=Interest1)))
p<-p+coord_flip()
p<-p+scale_y_continuous("", breaks=NA)
p<-p+scale_fill_manual(value = rev(brewer.pal(5, "Purples")))
p
update_labels(p, list(fill='', x='', y=''))
I'd very much appreciate any tips, tricks or hints.

Your second problem can be solved with melt and cast from the reshape package
After you've factored the elements in your data.frame called you can use something like:
install.packages("reshape")
library(reshape)
x <- melt(your.df, c()) ## Assume you have some kind of data.frame of all factors
x <- na.omit(x) ## Be careful, sometimes removing NA can mess with your frequency calculations
x <- cast(x, variable + value ~., length)
colnames(x) <- c("variable","value","freq")
## Presto!
ggplot(x, aes(variable, freq, fill = value)) + geom_bar(position = "fill") + coord_flip() + scale_y_continuous("", formatter="percent")
As an aside, I like to use grep to pull in columns from a messy import. For example:
x <- your.df[,grep("int.",df)] ## pulls all columns starting with "int_"
And factoring is easier when you don't have to type c(' ', ...) a million times.
for(x in 1:ncol(x)) {
df[,x] <- factor(df[,x], labels = strsplit('
Very Interested
Somewhat Interested
Not Very Interested
Not At All interested
NA
NA
NA
NA
NA
NA
', '\n')[[1]][-1]
}

You don't need prop.tables or count etc to do the 100% stacked bars. You just need +geom_bar(position="stack")

About percentages insted of ..count.. , try:
ggplot(mtcars, aes(factor(cyl), prop.table(..count..) * 100)) + geom_bar()
but since it's not a good idea to shove a function into the aes(), you can write custom function to create percentages out of ..count.. , round it to n decimals etc.
You labeled this post with plyr, but I don't see any plyr in action here, and I bet that one ddply() can do the job. Online plyr documentation should suffice.

If I am understanding you correctly, to fix the axis labeling problem make the following change:
# p<-ggplot(Interest, aes(Interest2, ..count..))
p<-ggplot(Interest, aes(Interest2, ..density..))
As for the second one, I think you would be better off working with the reshape package. You can use it to aggregate data into groups very easily.
In reference to aL3xa's comment below...
library(ggplot2)
r<-rnorm(1000)
d<-as.data.frame(cbind(r,1:1000))
ggplot(d,aes(r,..density..))+geom_bar()
Returns...
alt text http://www.drewconway.com/zia/wp-content/uploads/2010/04/density.png
The bins are now densities...

Your first question: Would this help?
geom_bar(aes(y=..count../sum(..count..)))
Your second question; could you use reorder to sort the bars? Something like
aes(reorder(Interest, Value, mean), Value)
(just back from a seven hour drive - am tired - but I guess it should work)

Related

Unexpected behaviour when re-ordering facets (for ggplot2)

I'd like some help understanding an error so that it can't happen again.
I was producing some (gg) plots and wanted to change the order of facets for aesthetic reasons. The way I did this had unexpected consequences and almost slipped through the net when I was checking the results - it could have caused serious problems with the article I'm working on!
I wanted to re-order the facets based on a numerical vector that I could define up-front
E.g. facet_order=c(1,2,4,3). This was so the graph syntax could be copied / pasted for repeat graphs more easily and I wouldn't have to dig around too much in the code each time.
# some example data:
df <- data.frame(x=c(1,2,3,4), y=c(1,2,3,4), facet_var=factor(c('A','B','C','D')))
# First plot (facet order defined by default):
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var, nrow = 1)+labs(title='Original data')
In the second plot, facets 'C' and 'D' are swapped as intended:
# reorder facets (normal method)
df$facet_var2 <- factor(df$facet_var, levels=c('A','B','D','C')) # Set the facets var
as a factor, to define the order
# Second plot:
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var2, nrow = 1)+labs(title='Re-
ordered facets', subtitle='working as expected')
However, this is the mistake I made:
# different syntax to reorder the facets
df$facet_var3 <- df$facet_var # duplicate the faceting variable
levels(df$facet_var3) <- levels(df$facet_var3)[c(1,2,4,3)] # I thought I was just
re-ordering the levels here
# Third plot:
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var3, nrow = 1)+labs(title='Re-
ordered facets (method 2)',subtitle='Unexpected behaviour')
In the third graph, it looks like the data doesn't move, but the facet labels do, which is obviously wrong.
Digging a bit deeper, it appears that my syntax changed not only the order of the factor, but actually the underlying data in the factor variable. Is this behaviour expected?
Here's the crux of it:
facet_order <- c(1,2,4,3)
levels(df$facet_var) <- levels(df$facet_var)[facet_order] # bad
df$facet_var <- factor(df$facet_var, levels=c(levels(df$facet_var)[facet_order)) #
good
Obviously I now know the solution but I'm still unclear what I actually did wrong here. Any pointers?
Hang on while I try and fix the images:
quick'n'dirty: posterior reordering with fct_reorder of {forcats} (part of tidyverse):
ggplot(df, aes(x,y)) +
geom_point() +
facet_wrap(~ fct_reorder(facet_var, c('B','A','D','C')),
nrow = 1)

Apply ggplot2 across columns

I am working with a dataframe with many columns and would like to produce certain plots of the data using ggplot2, namely, boxplots, histograms, density plots. I would like to do this by writing a single function that applies across all attributes (columns), producing one boxplot (or histogram etc) and then storing that as a given element of a list into which all the boxplots will be chained, so I could later index it by number (or by column name) in order to return the plot for a given attribute.
The issue I have is that, if I try to apply across columns with something like apply(df,2,boxPlot), I have to define boxPlot as a function that takes just a vector x. And when I do so, the attribute/column name and index are no longer retained. So e.g. in the code for producing a boxplot, like
bp <- ggplot(df, aes(x=Group, y=Attr, fill=Group)) +
geom_boxplot() +
labs(title="Plot of length per dose", x="Group", y =paste(Attr)) +
theme_classic()
the function has no idea how to extract the info necessary for Attr from just vector x (as this is just the column data and doesn't carry the column name or index).
(Note the x-axis is a factor variable called 'Group', which has 6 levels A,B,C,D,E,F, within X.)
Can anyone help with a good way of automating this procedure? (Ideally it should work for all types of ggplots; the problem here seems to simply be how to refer to the attribute name, within the ggplot function, in a way that can be applied / automatically replicated across the columns.) A for-loop would be acceptable, I guess, but if there's a more efficient/better way to do it in R then I'd prefer that!
Edit: something like what would be achieved by the top answer to this question: apply box plots to multiple variables. Except that in that answer, with his code you would still need a for-loop to change the indices on y=y[2] in the ggplot code and get all the boxplots. He's also expanded-grid to include different ````x``` possibilities (I have only one, the Group factor), but it would be easy to simplify down if the looping problem could be handled.
I'd also prefer just base R if possible--dplyr if absolutely necessary.
Here's an example of iterating over all columns of a data frame to produce a list of plots, while retaining the column name in the ggplot axis label
library(tidyverse)
plots <-
imap(select(mtcars, -cyl), ~ {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
})
plots$mpg
You can also do this without purrr and dplyr
to_plot <- setdiff(names(mtcars), 'cyl')
plots <-
Map(function(.x, .y) {
ggplot(mtcars, aes(x = cyl, y = .x)) +
geom_point() +
ylab(.y)
}, mtcars[to_plot], to_plot)
plots$mpg

Remove unused facet combinations in 2-way facet_grid

I have two factors and two continuous variables, and I use this to create a two-way facet plot using ggplot2. However, not all of my factor combinations have data, so I end up with dummy facets. Here's some dummy code to produce an equivalent output:
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
facet_grid(row~col)
This produces this figure
Is there any way to remove the facets that don't plot any data? And, ideally, move the x and y axis labels up or right to the remaining plots? As shown in this GIMPed version
I've searched here and elsewhere and unless my search terms just aren't good enough, I can't find the same problem anywhere. Similar issues are often with unused factor levels, but here no factor level is unused, just factor level combinations. So facet_grid(drop=TRUE) or ggplot(data=droplevel(dummy)) doesn't help here. Combining the factors into a single factor and dropping unused levels of the new factor can only produce a 1-dimensional facet grid, which isn't what I want.
Note: my actual data has a third factor level which I represent by different point colours. Thus a single-plot solution allowing me to retain a legend would be ideal.
It's not too difficult to rearrange the graphical objects (grobs) manually to achieve what you're after.
Load the necessary libraries.
library(grid);
library(gtable);
Turn your ggplot2 plot into a grob.
gg <- ggplot(data = dummy, aes(x = x,y = y)) +
geom_point() +
facet_grid(row ~ col);
grob <- ggplotGrob(gg);
Working out which facets to remove, and which axes to move where depends on the grid-structure of your grob. gtable_show_layout(grob) gives a visual representation of your grid structure, where numbers like (7, 4) denote a panel in row 7 and column 4.
Remove the empty facets.
# Remove facets
idx <- which(grob$layout$name %in% c("panel-2-1", "panel-3-1", "panel-3-2"));
for (i in idx) grob$grobs[[i]] <- nullGrob();
Move the x axes up.
# Move x axes up
# axis-b-1 needs to move up 4 rows
# axis-b-2 needs to move up 2 rows
idx <- which(grob$layout$name %in% c("axis-b-1", "axis-b-2"));
grob$layout[idx, c("t", "b")] <- grob$layout[idx, c("t", "b")] - c(4, 2);
Move the y axes to the right.
# Move y axes right
# axis-l-2 needs to move 2 columns to the right
# axis-l-3 needs ot move 4 columns to the right
idx <- which(grob$layout$name %in% c("axis-l-2", "axis-l-3"));
grob$layout[idx, c("l", "r")] <- grob$layout[idx, c("l", "r")] + c(2, 4);
Plot.
# Plot
grid.newpage();
grid.draw(grob);
Extending this to more facets is straightforward.
Maurits Evers solution worked great, but is quite cumbersome to modify.
An alternative solution is to use facet_manual from {ggh4x}.
This is not equivalent though as it uses facet_wrap, but allows appropriate placement of the facets.
# devtools::install_github("teunbrand/ggh4x")
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
design <- "
ABC
#DE
##F
"
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
ggh4x::facet_manual(vars(row,col), design = design, labeller = label_both)
Created on 2022-02-25 by the reprex package (v2.0.0)
One possible solution, of course, would be to create a plot for each factor combination separately and then combine them using grid.arrange() from gridExtra. This would probably lose my legend and would be an all around pain, would love to hear if anyone has any better suggestions.
This particular case looks like a job for ggpairs (link to a SO example). I haven't used it myself, but for paired plots this seems like the best tool for the job.
In a more general case, where you're not looking for pairs, you could try creating a column with a composite (pasted) factor and facet_grid or facet_wrap by that variable (example)

R: Plot multiple box plots using columns from data frame

I would like to plot an INDIVIDUAL box plot for each unrelated column in a data frame. I thought I was on the right track with boxplot.matrix from the sfsmsic package, but it seems to do the same as boxplot(as.matrix(plotdata) which is to plot everything in a shared boxplot with a shared scale on the axis. I want (say) 5 individual plots.
I could do this by hand like:
par(mfrow=c(2,2))
boxplot(data$var1
boxplot(data$var2)
boxplot(data$var3)
boxplot(data$var4)
But there must be a way to use the data frame columns?
EDIT: I used iterations, see my answer.
You could use the reshape package to simplify things
data <- data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100))
library(reshape)
meltData <- melt(data)
boxplot(data=meltData, value~variable)
or even then use ggplot2 package to make things nicer
library(ggplot2)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
From ?boxplot we see that we have the option to pass multiple vectors of data as elements of a list, and we will get multiple boxplots, one for each vector in our list.
So all we need to do is convert the columns of our matrix to a list:
m <- matrix(1:25,5,5)
boxplot(x = as.list(as.data.frame(m)))
If you really want separate panels each with a single boxplot (although, frankly, I don't see why you would want to do that), I would instead turn to ggplot and faceting:
m1 <- melt(as.data.frame(m))
library(ggplot2)
ggplot(m1,aes(x = variable,y = value)) + facet_wrap(~variable) + geom_boxplot()
I used iteration to do this. I think perhaps I wasn't clear in the original question. Thanks for the responses none the less.
par(mfrow=c(2,5))
for (i in 1:length(plotdata)) {
boxplot(plotdata[,i], main=names(plotdata[i]), type="l")
}

Simple analog for plotting a line from a table object in ggplot2

I have been unable to find a simple analog for plotting a line graph from a table object in ggplot2. Given the elegance and utility of the package, I feel I must be missing something quite obvious. As an illustration consider a data frame with yearly observations:
dat<-data.frame(year=sample(c("2001":"2010"),1000, replace=T))
And a quick time series plot in base R:
plot(table(dat$year), type="l")
Switching to qplot, returns the error "attempt to apply a non-function":
qplot(table(dat$year), geom="line")
ggplot2 requires a data frame. Fair enough. But this returns the same error.
qplot(year, data=dat, geom="line")
After some searching and fiddling, I abandoned qplot, and came up with the following approach which involves specifying a line geometry, binning the counts, and dropping final values to avoid plotting zeros.
ggplot(dat, aes(year) ) + geom_line(stat = "bin", binwidth=1, drop=TRUE)
It seems like rather a long walk around the block. And it is still not entirely satisfactory, since the bins don't align precisely with the mid-year values on the x-axis. Where have I gone wrong?
Maybe still more complicated than you want, but:
qplot(Var1,Freq,data=as.data.frame(table(dat$year)),geom="line",group=1)
(the group=1 is necessary because the Year variable (Var1) is returned as a factor ...)
If you didn't need it as a one-liner you could use ytab <- as.data.frame(table(dat$year)) first to extract the table and convert it to a data frame ...
Following Brian Diggs's answer, if you're willing to construct a bit more fortify machinery you can condense this a bit more:
A utility function that converts a factor to numeric if possible:
conv2num <- function(x) {
xn <- suppressWarnings(as.numeric(as.character(x)))
if (!all(is.na(xn))) xn else x
}
And a fortify method that turns the table into a data frame and then tries to make the columns numeric:
fortify.table <- function(x,...) {
z <- as.data.frame(x)
facs <- sapply(z,is.factor)
z[facs] <- lapply(z[facs],conv2num)
z
}
Now this works almost as you would like it to:
qplot(Var1,Freq,data=table(dat$year),geom="line")
(It would be nice/easier if there were a table option to preserve the numeric nature of cross-classifying factors ...)
Expanding on Ben's answer, the "standard" approach would be to create the data frame from the table, at which point you can covert the years back into numbers.
ytab <- as.data.frame(table(dat$year))
ytab$Var1 <- as.numeric(as.character(ytab$Var1))
The either of the following will work:
ggplot(ytab, aes(Var1, Freq)) + geom_line()
qplot(Var1, Freq, data=ytab, geom="line")
The other approach is to create a fortify function which will transform the table into a data frame, and use that.
fortify.table <- as.data.frame.table
Then you can pass the table directly instead of a data frame. But Var1 is now still a factor and so you need group=1 to connect the line across years.
ggplot(table(dat$year), aes(Var1, Freq)) + geom_line(aes(group=1))
qplot(Var1, Freq, data=table(dat$year), geom="line", group=1)

Resources