Grouping 2 categorical variables with geom_boxplot

Grouping 2 categorical variables with geom_boxplot - r

I have tried some examples I found here but I always get an error or a different graph from what I need (e.g. lines instead of the boxplot, or only 2 boxes instead of 4).
I want to plot the following
Condition Time mean sem
A I 0.5578552 0.05294356
A II 0.6957565 0.09149457
P I 0.7078374 0.08142464
P II 0.7762761 0.10945771 ```
I need "Condition" in the x axis and I need to group "Time".
The idea is to get a similar visual representation to this:
enter image description here
My attempt was:
ggplot(data = means.sem, aes(x = Condition, y = mean, fill=Time, ymin = mean-sem, ymax = mean + sem))
+ geom_boxplot() +
stat_boxplot(geom ='errorbar', width = 0.5)+
scale_y_continuous(expand = c(0, 0), limits = c(0, 0.85))+ scale_fill_manual(values=c("black", "grey"))+
labs(y= "Mean", x="")+ theme_classic()```
Thank you!

What do you want your y-axis to be? On the assumption it is, for example, the sem variable, I use the following code:
boxplot <- ggplot(data=dataset, aes(x=condition, y=sem, fill=time)) + geom_boxplot(position="dodge2")
Obviously you can alter the colours, etc as you need to.
EDIT: changed the position to dodge2 as this creates a pleasing small gap between each boxplot within a group.

Related

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.

I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:

#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

Percentages in faceted histogram with scale_y_continuous()

I am trying to use scale_y_continuous() with a faceted histogram and running into an issue. I am hoping to get each count to be a percentage instead. My code is:
ggplot(d, aes(x = likely_att)) +
geom_histogram(binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())
It looks like the distributions themselves are accurate, but the scaling is off: the percentages are "200 000%", "5 000%", etc. and that seems wrong, but I'm not quite sure why it's happening.
There are many more "yes" than "no" or "separated" married values in my dataset, which is why I use scales = "free_y" and why I'm hoping to just have percentages shown and only need one axis value shown.
I can't share this exact data for privacy reasons, but the likely_att variable is just a 1-5 numeric var, and married is a character var with 3 values: yes, no, separated.
In case it's helpful, I basically want it to look just like this image, but with percentages instead of counts, so I can just have one single y axis on the far left with 0 - 100 %

The problem is that using the percentage_format() function changes the way the labels are printed, but it doesn't actually rescale the numbers. To do that, you could use the density constructed variable and multiply it by the bin-width, then use the percent formatting.
ggplot(d, aes(x = likely_att)) +
stat_bin(aes(y=..density..*.5, group = married),
binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())

How to set automatic label position based on box height

In a previous question, I asked about moving the label position of a barplot outside of the bar if the bar was too small. I was provided this following example:
library(ggplot2)
options(scipen=2)
dataset <- data.frame(Riserva_Riv_Fine_Periodo = 1:10 * 10^6 + 1,
Anno = 1:10)
ggplot(data = dataset,
aes(x = Anno,
y = Riserva_Riv_Fine_Periodo)) +
geom_bar(stat = "identity",
width=0.8,
position="dodge") +
geom_text(aes( y = Riserva_Riv_Fine_Periodo,
label = round(Riserva_Riv_Fine_Periodo, 0),
angle=90,
hjust= ifelse(Riserva_Riv_Fine_Periodo < 3000000, -0.1, 1.2)),
col="red",
size=4,
position = position_dodge(0.9))
And I obtain this graph:
The problem with the example is that the value at which the label is moved must be hard-coded into the plot, and an ifelse statement is used to reposition the label. Is there a way to automatically extract the value to cut?

A slightly better option might be to base the test and the positioning of the labels on the height of the bar relative to the height of the highest bar. That way, the cutoff value and label-shift are scaled to the actual vertical range of the plot. For example:
ydiff = max(dataset$Riserva_Riv_Fine_Periodo)
ggplot(dataset, aes(x = Anno, y = Riserva_Riv_Fine_Periodo)) +
geom_bar(stat = "identity", width=0.8) +
geom_text(aes(label = round(Riserva_Riv_Fine_Periodo, 0), angle=90,
y = ifelse(Riserva_Riv_Fine_Periodo < 0.3*ydiff,
Riserva_Riv_Fine_Periodo + 0.1*ydiff,
Riserva_Riv_Fine_Periodo - 0.1*ydiff)),
col="red", size=4)
You would still need to tweak the fractional cutoff in the test condition (I've used 0.3 in this case), depending on the physical size at which you render the plot. But you could package the code into a function to make the any manual adjustments a bit easier.
It's probably possible to automate this by determining the actual sizes of the various grobs that make up the plot and setting the condition and the positioning based on those sizes, but I'm not sure how to do that.
Just as an editorial comment, a plot with labels inside some bars and above others risks confusing the visual mapping of magnitudes to bar heights. I think it would be better to find a way to shrink, abbreviate, recode, or otherwise tweak the labels so that they contain the information you want to convey while being able to have all the labels inside the bars. Maybe something like this:
library(scales)
ggplot(dataset, aes(x = Anno, y = Riserva_Riv_Fine_Periodo/1000)) +
geom_col(width=0.8, fill="grey30") +
geom_text(aes(label = format(Riserva_Riv_Fine_Periodo/1000, big.mark=",", digits=0),
y = 0.5*Riserva_Riv_Fine_Periodo/1000),
col="white", size=3) +
scale_y_continuous(label=dollar, expand=c(0,1e2)) +
theme_classic() +
labs(y="Riserva (thousands)")
Or maybe go with a line plot instead of bars:
ggplot(dataset, aes(Anno, Riserva_Riv_Fine_Periodo/1e3)) +
geom_line(linetype="11", size=0.3, colour="grey50") +
geom_text(aes(label=format(Riserva_Riv_Fine_Periodo/1e3, big.mark=",", digits=0)),
size=3) +
theme_classic() +
scale_y_continuous(label=dollar, expand=c(0,1e2)) +
expand_limits(y=0) +
labs(y="Riserva (thousands)")

Overlay raw data onto geom_bar

I have a data-frame arranged as follows:
condition,treatment,value
A , one , 2
A , one , 1
A , two , 4
A , two , 2
...
D , two , 3
I have used ggplot2 to make a grouped bar plot that looks like this:
The bars are grouped by "condition" and the colours indicate "treatment." The bar heights are the mean of the values for each condition/treatment pair. I achieved this by creating a new data frame containing the mean and standard error (for the error bars) for all the points that will make up each group.
What I would like to do is superimpose the raw jittered data to produce a bar-chart version of this box plot: http://docs.ggplot2.org/0.9.3.1/geom_boxplot-6.png [I realise that a box plot would probably be better, but my hands are tied because the client is pathologically attached to bar charts]
I have tried adding a geom_point object to my plot and feeding it the raw data (rather than the aggregated means which were used to make the bars). This sort of works, but it plots the raw values at the wrong x axis locations. They appear at the points at which the red and grey bars join, rather than at the centres of the appropriate bar. So my plot looks like this:
I can not figure out how to shift the points by a fixed amount and then jitter them in order to get them centered over the correct bar. Anyone know? Is there, perhaps, a better way of achieving what I'm trying to do?
What follows is a minimal example that shows the problem I have:
#Make some fake data
ex=data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
#Calculate the mean and SD of each condition/treatment pair
agg=aggregate(value~cond*treat, data=ex, FUN="mean") #mean
agg$sd=aggregate(value~cond*treat, data=ex, FUN="sd")$value #add the SD
dodge <- position_dodge(width=0.9)
limits <- aes(ymax=value+sd, ymin=value-sd) #Set up the error bars
p <- ggplot(agg, aes(fill=treat, y=value, x=cond))
#Plot, attempting to overlay the raw data
print(
p + geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
geom_point(data= ex[ex$treat=='one',], colour="green", size=3) +
geom_point(data= ex[ex$treat=='two',], colour="pink", size=3)
)

I found it is unnecessary to create separate dataframes. The plot can be created by providing ggplot with the raw data.
ex <- data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position = position_dodge(width = 1))

You need just one call to geom_point() where you use data frame ex and set x values to cond, y values to value and color=treat (inside aes()). Then add position=dodge to ensure that points are dodgeg. With scale_color_manual() and argument values= you can set colors you need.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(cond,value,color=treat),position=dodge)+
scale_color_manual(values=c("green","pink"))
UPDATE - jittering of points
You can't directly use positions dodge and jitter together. But there are some workarounds. If you save whole plot as object then with ggplot_build() you can see x positions for bars - in this case they are 0.775, 1.225, 1.775... Those positions correspond to combinations of factors cond and treat. As in data frame ex there are 4 values for each combination, then add new column that contains those x positions repeated 4 times.
ex$xcord<-rep(c(0.775,1.225,1.775,2.225,2.775,3.225,3.775,4.225),each=4)
Now in geom_point() use this new column as x values and set position to jitter.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(xcord,value,color=treat),position=position_jitter(width =.15))+
scale_color_manual(values=c("green","pink"))

As illustrated by holmrenser above, referencing a single dataframe and updating the stat instruction to "summary" in the geom_bar function is more efficient than creating additional dataframes and retaining the stat instruction as "identity" in the code.
To both jitter and dodge the data points with the bar charts per the OP's original question, this can also be accomplished by updating the position instruction in the code with position_jitterdodge. This positioning scheme allows widths for jitter and dodge terms to be customized independently, as follows:
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position =
position_jitterdodge(jitter.width = 0.5, jitter.height=0.4,
dodge.width=0.9))

What is the simplest method to fill the area under a geom_freqpoly line?

The x-axis is time broken up into time intervals. There is an interval column in the data frame that specifies the time for each row. The column is a factor, where each interval is a different factor level.
Plotting a histogram or line using geom_histogram and geom_freqpoly works great, but I'd like to have a line, like that provided by geom_freqpoly, with the area filled.
Currently I'm using geom_freqpoly like this:
ggplot(quake.data, aes(interval, fill=tweet.type)) + geom_freqpoly(aes(group = tweet.type, colour = tweet.type)) + opts(axis.text.x=theme_text(angle=-60, hjust=0, size = 6))
I would prefer to have a filled area, such as provided by geom_density, but without smoothing the line:
The geom_area has been suggested, is there any way to use a ggplot2-generated statistic, such as ..count.., for the geom_area's y-values? Or, does the count aggregation need to occur prior to using ggplot2?
As stated in the answer, geom_area(..., stat = "bin") is the solution:
ggplot(quake.data, aes(interval)) + geom_area(aes(y = ..count.., fill = tweet.type, group = tweet.type), stat = "bin") + opts(axis.text.x=theme_text(angle=-60, hjust=0, size = 6))
produces:

Perhaps you want:
geom_area(aes(y = ..count..), stat = "bin")

geom_ribbon can be used to produce a filled area between two lines without needing to explicitly construct a polygon. There is good documentation here.

ggplot(quake.data, aes(interval, fill=tweet.type, group = 1)) + geom_density()
But I don't think this is a meaningful graphic.

I'm not entirely sure what you're aiming for. Do you want a line or bars. You should check out geom_bar for filled bars. Something like:
p <- ggplot(data, aes(x = time, y = count))
p + geom_bar(stat = "identity")
If you want a line filled in underneath then you should look at geom_area which I haven't personally used but it appears the construct will be almost the same.
p <- ggplot(data, aes(x = time, y = count))
p + geom_area()
Hope that helps. Give some more info and we can probably be more helpful.
Actually i would throw on an index, just the row of the data and use that as x, and then use
p <- ggplot(data, aes(x = index, y = count))
p + geom_bar(stat = "identity") + scale_x_continuous("Intervals",
breaks = index, labels = intervals)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Grouping 2 categorical variables with geom_boxplot - r

Related

R code of scatter plot for three variables

Percentages in faceted histogram with scale_y_continuous()

How to set automatic label position based on box height

Overlay raw data onto geom_bar

What is the simplest method to fill the area under a geom_freqpoly line?

Categories

Resources