How to draw barplot with both stacked and sided bars in R? - r

I have three variables, support, party, and gender. Support has three levels, 1 for yes, 2 for no, and 3 for neural. Party and gender are dummy variables. Now let's fake them:
support = sample(1:3, size=100, replace=T)
party = as.numeric(rbinom(100, 100, 0.4) > 42)
gender = as.numeric(rbinom(100, 100, 0.4) > 39)
I want to see the percentage of support conditioned on party and gender. So far, I can do percentage conditioned on only one variable, say party.
counts = table(support,party)
percent = t(t(counts)/colSums(counts))
barplot(percent)
How can I split party by gender and place gender side-by-side while maintaining party support stacked in the current way? (If you don't understand what I am saying, read on ...)
To be clear, I want the first column bar 0 (party = 0) be split into two adjacent column bars, one for each gender. Then I want the second column bar 1 (party=1) be split into two adjacent column bars, one for each gender. For each specific column bar, I will want it to be stacked up like the way there are now.
I am not even sure this can be accomplished.
BY THE WAY, is there a way to control the width of the bars? They are way too wide for my taste.

How about something like this, we can call barplot twice to place two sets on the same surface. First, I named some of the data so I could keep track of it
#sample data
set.seed(15)
support = sample(1:3, size=100, replace=T)
party = factor(as.numeric(rbinom(100, 100, 0.4) > 42), levels=0:1, labels=c("D","R"))
gender = factor(as.numeric(rbinom(100, 100, 0.4) > 39), levels=0:1, labels=c("M","F"))
Now we summarize the data separately for each party
tt<-table(support, gender, party)
p1<-tt[,,1]
p1<-p1/sum(p1)
p2<-tt[,,2]
p2<-p2/sum(p2)
And now we combine the barplots
xx<-barplot(p1, width=.3, space=c(.25,.6), xaxt="n",
xlim=c(0,2.4), ylim=c(0, max(colSums(p1), colSums(p2))))
axis(1,xx, levels(gender), line=0, tick=F)
yy<-barplot(p2, width=.3, space=c(5, .6), xaxt="n", add=T)
axis(1,yy, levels(gender), line=0, tick=F)
axis(1, c(mean(xx), mean(yy)), levels(party), line=1, tick=F)
And this will produce

I'd do this using ggplot2, but not put the bars side-by-side, but use sub-plots (or facets in ggplot2 jargon):
df = data.frame(support, party, gender)
library(ggplot2)
ggplot(df, aes(x = factor(party), fill = factor(support))) +
geom_bar() + facet_wrap(~ gender)

Related

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

ggplot2 density of circular data

I have a data set where x represents day of year (say birthdays) and I want to create a density graph of this.
Further, since I have some grouping information (say boys or girls), I want to use the capabilities of ggplot2 to make a density plot.
Easy enough at first:
require(ggplot2); require(dplyr)
bdays <- data.frame(gender = sample(c('M', 'F'), 100, replace = T), bday = sample(1:365, 100, replace = T))
bdays %>% ggplot(aes(x = bday)) + geom_density(aes(color = factor(gender)))
However, this gives a poor estimate because of edge effects.
I want to apply the fact that I can use circular coordinates so that 365 + 1 = 1 -- one day after December 31st is January 1st.
I know that the circular package provides this functionality, but I haven't had any success implementing it using a stat_function() call.
It's particularly useful for me to use ggplot2 because I want to be able to use facets, aes calls, etc.
Also, for clarification, I would like something that looks like geom_density -- I am not looking for a polar plot like the one shown at: Circular density plot using ggplot2.
To remove the edge effects you could stack three copies of the data, create the density estimate, and then show the density only for the middle copy of data. That will guarantee "wrap around" continuity of the density function from one edge to the other.
Below is an example comparing your original plot with the new version. I've used the adjust parameter to set the same bandwidth between the two plots. Note also that in the circularized version, you'll need to renormalize the densities if you want them to add to 1:
set.seed(105)
bdays <- data.frame(gender = sample(c('M', 'F'), 100, replace = T), bday = sample(1:365, 100, replace = T))
# Stack three copies of the data, with adjusted values of bday
bdays = bind_rows(bdays, bdays, bdays)
bdays$bday = bdays$bday + rep(c(0,365,365*2),each=100)
# Function to adjust bandwidth of density plot
# Source: http://stackoverflow.com/a/24986121/496488
bw = function(b,x) b/bw.nrd0(x)
# New "circularized" version of plot
bdays %>% ggplot(aes(x = bday)) +
geom_density(aes(color = factor(gender)), adjust=bw(10, bdays$bday[1:100])) +
coord_cartesian(xlim=c(365, 365+365+1), expand=0) +
scale_x_continuous(breaks=seq(366+89, 366+365, 90), labels=seq(366+89, 366+365, 90)-365) +
scale_y_continuous(limits=c(0,0.0016))
ggtitle("Circularized")
# Original plot
ggplot(bdays[1:100,], aes(x = bday)) +
geom_density(aes(color = factor(gender)), adjust=bw(30, bdays$bday[1:100])) +
scale_x_continuous(breaks=seq(90,360,90), expand=c(0,0)) +
ggtitle("Not Circularized")

Proper display of confidence interval in R using ggplot

I'm trying to make a plot that will represent 2 measurements(prr and ebgm) for different adverse reactions of different drugs grouped by age category like so:
library(ggplot2)
strata <- factor(c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), levels=c("Neonates", "Infants", "Children", "Adolescents", "Pediatrics"), order=T)
Data <- data.frame(
strata = sample(strata, 200, replace=T),
drug=sample(c("ibuprofen", "clarithromycin", "fluticasone"), 200, replace=T), #20 de medicamente
reaction=sample(c("Liver Injury", "Sepsis", "Acute renal failure", "Anaphylaxis"), 200, replace=T),
measurement=sample(c("prr", "EBGM"), 200, replace=T),
value_measurement=sample(runif(16), 200, replace=T),
lower_CI=sample(runif(6), 200, replace=T),
upper_CI=sample(runif(5), 200, replace=T)
)
g <- ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement))+
geom_histogram(stat="identity", position="dodge")+
facet_wrap(~reaction)+
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI), position="dodge", stat="identity")
ggsave(file="meh.png", plot=g)
The upper and lower CI are the confidence interval limit of the measurement. Given that I for each measurement I have a confidence interval I want the proper histogram to have the corresponding confidence interval, but what I get is s follows.
Graph:
Any ideas how to place those nasty conf intervals properly? Thank you!
Later edit: in the original data for a given drug I have many rows each containing an adverse reaction, the age category and each of these categories has 2 measurements: prr or EBGM and the corresponding confidence interval. This is not reflected in the data simulation.
The problem is that each of your bars is really multiple bars plotted over each other, because you have more than one row of data for each combination of reaction, strata, and measurement. (You're getting multiple error bars for the same reason.)
You can see this in the code below, where I've changed geom_histogram to geom_bar and added alpha=0.3 and colour="grey40" to show the multiple overlapping bars. I've also commented out the error bars.
ggplot(Data, aes(x=strata, y=value_measurement, fill=measurement, group=measurement)) +
geom_bar(stat="identity", position="dodge", alpha=0.3, colour="grey40") +
facet_wrap(~reaction) #+
# geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI),
# position="dodge", stat="identity")
You can fix this by adding another column to your data that adds a grouping category by which you can separate these bars. For example, in the code below we add a new column called count that just assigns numbers 1 through n for each row of data within each combination of reaction and strata. We sort by measurement so that each measurement type will be kept together in the count sequence.
library(dplyr)
Data = Data %>% group_by(reaction, strata) %>%
arrange(measurement) %>%
mutate(count = 1:n())
Now plot the data:
ggplot(Data, aes(x=strata, y=value_measurement,
fill=measurement, group=count)) +
geom_bar(stat="identity", position=position_dodge(0.7), width=0.6) +
facet_wrap(~reaction, ncol=1) +
geom_errorbar(aes(x=strata, ymax=upper_CI, ymin=lower_CI, group=count),
position=position_dodge(0.7), stat="identity", width=0.3)
Now you can see the separate bars, along with their error bars (which are weird, but only because they're fake data).

Multiple variables on same barplot in R with ifelse col statement

I currently have a matrix of data with size 365x8 (365 days, 8 variables). Each day value represents a true or false value, and for each day, I want to plot a red box if true, and a green box if false.
My goal is to display all 8 columns of the matrix in a barplot. Below is some current code I have.
data # matrix with size 365x8
# Plot good vs. bad data
first <- barplot(data, xlab = "# of Days", ylab = "Type of Data",
col=ifelse(data==TRUE, "red", "green"), beside=FALSE, horiz=TRUE,
xlim=c(0,365))
print(first)
I want the ifelse statement for 'col' to go to each value of each column in 'data' and determine which color each value for each day should be. My problem is that I can only get my barplot to plot the first variable in data (data[,1]), and it repeats this for all eight. Do I need to call barplot multiple times (tried to, didn't work), or call something else entirely? I would post a plot of my current progress, but I don't have enough reputation yet.
Let me know if I haven't been clear enough, tried my best to describe what I'm doing.
Have you considered using ggplot2 and creating a "heatmap" type plot (with geom_tile)? I made an example where dat is your data matrix, and all the values were either 0 or 1.
ggplot(melt(dat), aes(x=Var2, y=Var1, fill=factor(value))) +
geom_tile() +
scale_fill_manual(values=c('green', 'red'))
EDIT: You will also need to use melt from the reshape2 package. Here is the data I randomly used, and part of the plot generated:
dat <- replicate(8, sample(c(0, 0, 0, 1), 365, replace=T))
ggplot(melt(dat[1:50, ]), aes(x=Var2, y=Var1, fill=factor(value))) +
geom_tile() +
scale_fill_manual(values=c('green', 'red'))
You can also do this with heatmap() in base R. It only works with numeric values, though, so you'll need to convert your TRUE/FALSE values to numbers first. Also, to keep the rows and columns in the same order in the plot as in the matrix, you need to turn off reordering in the call to heatmap(). So something like this:
data.num <- ifelse(data == TRUE, 1, 0)
heatmap(data.num, Rowv = NA, Colv = NA, col = ifelse(z==1, "gray15", "gray85"))

ggplot: percentage counts line graph for factor groups on a scale

Say I want to plot percentages of "yes" answers to a question, across different age groups in ggplot. These age groups are obviously factors, but I want them to be shown in a scale-like fashion, so want to use a line graph.
Here's some data:
mydata <- data.frame(
age_group = c("young", "middle", "old"),
question = sample(c("yes", "no"), 99, replace = TRUE))
mydata$age_group = factor(mydata$age_group,levels(mydata$age_group)[c(3, 1, 2)])
mydata$question = factor(mydata$question,levels(mydata$question)[c(2,1)])
So far, I have been using this code to generate a stacked barplot:
ggplot(mydata, aes(age_group, fill = question)) + geom_bar(position = "fill")
How could I change this into a line graph, with just the frequency counts of the "yes" answers? Mark in the answers suggests a workaround which produces the right output:
But I hoping there was a way to do this automatically in one line of code, rather than creating this summary table first.
If I understood correctly, this does what you want:
ggplot(mydata) +
stat_bin(aes(x=age_group, color=question, group=question), geom="line")
Note this doesn't look exactly the same as yours in terms of yes/no because you didn't set a seed for the random numbers.
If you just want the percentages of "yes" for each category, I suggest changing your data to the following:
question age_group value percent
1 yes young 14 0.4242424
3 yes middle 17 0.5151515
5 yes old 20 0.6060606
Using this code to summarize the data:
library(reshape)
mydata.summary = melt(xtabs(~question+age_group,data=mydata))
mydata.summary2 = mydata.summary[mydata.summary$question=="yes",]
mydata.summary2$percent <- mydata.summary2$value/melt(xtabs(~age_group,data=mydata))$value
ggplot(mydata.summary2, aes(age_group,percent, group = question, colour=question)) + geom_line()

Resources