How to make a bar plot using ggplot that uses multiple columns for the x-axis? - r

I am trying to use multiple column names as the x-axis in a barplot. So each column name will be the "factor" and the data it contains is the count for that.
I have tried iterations of this:
ggplot(aes( x = names, y = count)) + geom_bar()
I tried concatenating the x values I want to show with aes(c(col1, col2))
but the aesthetics length does not match and won't work.
library(dplyr)
library(ggplot2)
head(dat)
Sample Week Response_1 Response_2 Response_3 Response_4 Vaccine_Type
1 1 1 300 0 2000 100 1
2 2 1 305 0 320 15 1
3 3 1 310 0 400 35 1
4 4 1 400 1 410 35 1
5 5 1 405 0 180 35 2
6 6 1 410 2 800 75 2
dat %>%
group_by(Week) %>%
ggplot(aes(c(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
dat %>%
group_by(Week) %>%
ggplot(aes(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
> Error: Aesthetics must be either length 1 or the same as the data
> (24): x
Both of these failed (kind of expected based on aes length error code), but hopefully you know the direction I was aiming for and can help out.
Goal is to have 4 separate groups, each with their own boxplot (1 for every response). And also have them faceted by week.

Using the simple code below got mostly what I want. Unfortunately I don't think it would be as easy to include the points and other characteristics to the plot like you can with ggplot.
boxplot(dat[,3:6], use.cols = TRUE)
And I could pretty easily just filter by the different weeks and use mfrow for faceting. Not as informative as ggplot, but gets the job done. If anyone else has other workarounds, I'd be interested in seeing.

Related

boxplot with filtered values

I am new to coding and want to create boxplots based on my data.
For that, I want to filter a boxplot by specific values:
My data structure is called "Auswertungen" and is structured like this:
Participant Donation Treatment Manipulation
1 0 1 passed
2 0.4 2 passed
3 0.2 2 failed
4 0 3 failed
5 0.3 3 passed
now I want to plot the Donations based on the Treatments, using a boxplot. I want to graphs, one with all data points and one without those who failed the manipulation.
I found something like
boxplot(Donation ~ Treatment)
with(subset(Auswertungen, Manipulation == "passed"), boxplot(Donation ~ Treatment))
but the second formula is exactly showing me the same boxplots as before, so I guess the subset is not working?
Got it, sorry.
boxplot(Donation ~ Treatment)
boxplot(Donation[Manipulation == "passed"] ~ Treatment[Manipulation == "passed"]
If your data is roughly structured like this:
set.seed(222)
Donation <- abs(rnorm(20))
Treatment <- sample(1:3, 20, replace = T)
Manipulation <- sample(c("passed", "failed"), 20, replace = T)
df <- data.frame(Donation, Treatment, Manipulation)
df
Donation Treatment Manipulation
1 1.487757090 3 passed
2 0.001891901 2 failed
3 1.381020790 1 failed
4 0.380213631 3 passed
5 0.184136230 1 failed
6 0.246895883 3 passed
7 1.215560910 3 failed
8 1.561405098 1 failed
9 0.427310197 2 passed
10 1.201023506 3 passed
11 1.052458495 2 passed
12 1.305063566 2 failed
13 0.692607634 3 failed
14 0.602648854 3 failed
15 0.197753074 2 failed
16 1.185874517 2 passed
17 2.005512989 3 passed
18 0.007509885 2 passed
19 0.519490356 2 failed
20 0.746295471 2 failed
And you want to have two boxplots, you can first define a two-panel layout:
par(mfrow = c(1,2))
And then fill your two boxplots into it, the first one unfiltered:
boxplot(df$Donation ~ df$Treatment)
and the second filtered on the condition that Manipulation=="passed":
boxplot((df$Donation[df$Manipulation=="passed"] ~ df$Treatment[df$Manipulation=="passed"]))
The result would be something like this:

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

How to plot relative frequencies in R or Stata

I have this dataset :
> head(xc)
wheeze3 SmokingGroup_Kai TG2000 TG2012 PA_Score asthma3 tres3 age3 bmi bmi3
1 0 1 2 2 2 0 0 47 20.861 21.88708
2 0 5 2 3 3 0 0 57 20.449 23.05175
3 0 1 2 3 2 0 0 45 25.728 26.06168
4 0 2 1 1 3 0 0 48 22.039 23.50780
5 1 4 2 2 1 0 1 61 25.391 25.63692
6 0 4 2 2 2 0 0 54 21.633 23.66144
education3 group_change
1 2 0
2 2 3
3 3 3
4 3 0
5 1 0
6 2 0
Here
asthma3 is a variable that takes values 0,1 ;
group_change takes values 0,1,2,3,4,5,6 ;
age3 represents the age.
I would like to plot the percentage of people with asthma3==1 as a function of the variable age3.
I would like 6 lines on the same plot obtained dividing the samples by group_change.
I think that this should be possible using ggplot2.
Here's a ggplot2 approach:
library(ggplot2)
library(dplyr)
# Create fake data
set.seed(10)
xc=data.frame(age3=sample(40:50, 500, replace=TRUE),
asthma3=sample(0:1,500, replace=TRUE),
group_change=sample(0:6, 500, replace=TRUE))
# Summarize asthma percent by group_change and age3 (using dplyr)
xc1 = xc %.%
group_by(group_change, age3) %.%
summarize(asthma.pct=mean(asthma3)*100)
# Plot using ggplot2
ggplot(xc1, aes(x=age3, y=asthma.pct, colour=as.factor(group_change))) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=40:50) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")
Here's another ggplot2 approach that works directly with the original data frame and calculates the percentages on the fly. I've also formatted the y-axis in percent format.
library(scales) # Need this for "percent_format()"
ggplot(xc, aes(x=age3, y=asthma3, colour=as.factor(group_change))) +
stat_summary(fun.y=mean, geom='line') +
stat_summary(fun.y=mean, geom='point') +
scale_x_continuous(breaks=40:50) +
scale_y_continuous(labels=percent_format()) +
xlab("Age") + ylab("Asthma Percent") +
scale_colour_discrete(name="Group Change")
Here is one way using Stata. The example data has three groups.
The proportions are computed taking the mean of asthma3 which you identify as a binary variable.
clear all
set more off
*----- example data -----
set obs 500
set seed 135
gen age3 = floor((50-40+1)*runiform() + 40)
gen asthma3 = round(runiform())
egen group_change = seq(), to(3)
*----- pretty list -----
order age3 group_change
sort age3 group_change asthma3
list, sepby(age3)
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- syntax for graph and graph -----
levelsof(group_change), local(gc)
local i = 1
foreach g of local gc {
local call `call' || connected asthma3 age3 if group_change == `g', sort
local leg `leg' label(`i++' "Group`g'") // syntax for legend
}
twoway `call' legend(`leg') /// graph
title("Proportion with asthma by group")
This coincides with one of my first questions in Statalist. In Nick's words, you "build up the syntax" using a local macro and then feed that to twoway.
#NickCox, in a comment, suggests an alternative:
<snip>
*----- compute proportions -----
collapse (mean) asthma3, by(age3 group_change)
list
*----- graph -----
separate asthma3, by(group_change) veryshortlabel
twoway connected asthma31-asthma33 age3, sort ///
title("Proportion with asthma by group")
<snip>
This second alternative creates new variables from the original asthma3 which I abbreviate in the call to twoway connected as asthma31-asthma33.
Both alternatives produce a legend identifying groups. Labels I leave to you (see help graph).

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

R stacked percentage bar plot with percentage of binary factor and labels (with ggplot)

I want to produce a graphic that looks something like this:
My original data set looks something like this:
> bb[sample(nrow(bb), 20), ]
IMG QUANT FIX
25663 1 1 0
7936 2 2 0
23586 3 2 0
23017 2 2 1
31363 1 3 1
7886 2 2 0
23819 3 3 1
29838 2 2 1
8169 2 3 1
9870 2 3 0
31440 2 1 0
35564 3 1 0
24066 1 2 0
12020 3 2 0
6742 3 2 0
6189 2 3 0
26692 2 3 0
1387 3 2 0
31839 2 3 1
28637 3 2 0
So the idea is that the bars display where FIX = 1 per factor QUANT and per
factor IMG.
I've aggregated my data set into percentages using plyr
library(plyr)
bb.perc <- ddply(bb,.(QUANT,IMG),summarise,FIX.PROP = sum(FIX) / length(FIX))
It does almost the right thing:
QUANT IMG FIX.PROP
1 1 1 0.52439024
2 1 2 0.19085366
3 1 3 0.13658537
4 2 1 0.20414201
5 2 2 0.53964497
6 2 3 0.09585799
7 3 1 0.29000000
8 3 2 0.13000000
9 3 3 0.40705882
But now if I make a graph, it doesn't account for the FIX==0 cases, i.e. all bars have the same height, namely 100%, which isn't what I want. Note how the individual QUANT subframes don't add up to 100%:
> sum(bb.perc[1:3,]$FIX.PROP)
[1] 0.8518293
> sum(bb.perc[4:6,]$FIX.PROP)
[1] 0.839645
> sum(bb.perc[7:9,]$FIX.PROP)
[1] 0.8270588
The best I could do with R is to display counts:
# Take only the positive samples
bb.pos <- bb[bb$FIX == 1,]
# Plot the counts
ggplot(bb,aes(factor(QUANT),fill=factor(IMG))) + geom_bar() +
scale_y_continuous(labels=percent)
And results in:
This is also not what I want:
The percentage scale is way off. I need a way to pass the 100% point to the
percent function, but I have no idea how.
It lacks the labels.
There are a great deal of similar questions on SO already, but I seem to lack
the sufficient amount of intelligence (or understanding of R) to extrapolate
from them to a solution to my particular problem.
Thanks for any pointers!
EDIT: Sven Hohenstein provided an answer already, but here's how I ended up doing it myself as well:
> ggplot(bb.perc,aes(x=factor(QUANT),y=FIX.PROP,label=paste(round(FIX.PROP*100),
"%"),fill=factor(IMG)))+ geom_bar(stat="identity") + geom_text(position="stack",
aes(ymax=1),vjust=5) + scale_y_continuous(labels = percent)
Using the bb.perc that I defined further up using plyr. This one has the
advantage that the percentages are computed locally per column, and not
globally.
Thanks everyone for the help. The following two questions and their respective
answers helped me greatly in getting it right:
Stacked Bar Graph Labels with ggplot2
Adding labels to ggplot bar chart
What I did wrong initially, was pass the position = "fill" parameter to
geom_bar(), which for some reason made all the bars have the same height!
This is a way to generate the plot:
ggplot(bb[bb$FIX == 1, ],aes(x = factor(QUANT), fill = factor(IMG),
y = (..count..)/sum(..count..))) +
geom_bar() +
stat_bin(geom = "text",
aes(label = paste(round((..count..)/sum(..count..)*100), "%")),
vjust = 5) +
scale_y_continuous(labels = percent)
Change the value of the vjust parameter to adjust the vertical position of the labels.

Resources