plot count of values by factor in ggplot - r

New to R, stuck googling this (probably easy) thing for too long.
I want to plot the proportion of males that fathered offspring, according to whether they have a nest or not. (I don't want the information of how many offspring they fathered). This is my dataset called "males"
fishID nest off
fish1 1 25
fish2 0 0
fish3 0 5
fish4 1 15
fish5 1 0
fish6 0 2
fish7 0 0
fish8 1 4
I've used the following code to change the values of offspring to 0 and 1 (though this feels clumsy already)...
#converts the values in offspring to 0 and 1s
vars=c("off")
males[males$off != "0", vars]="1"
males
...and I can plot proportions using...
ggplot(males,aes(x = males$nest,fill = males$off)) +
geom_bar(position = "fill")
...but I would like to colour them so that 0 (no nest) is one colour and 1 (nest) is another colour, then the proportion of males that didn't father offspring is a paler version of each colour. The above produces colours according to "offspring", irrespective of "nest".
Tips welcome.
(Mac OS X, R 3.0.3 GUI 1.63 Snow Leopard build (6660))

Is this what you're looking for?
library(ggplot2)
males$nest <- as.factor(males$nest)
males$off <- as.factor(males$off)
ggplot(males, aes(x = nest, fill = off)) +
geom_bar(width = 0.25) +
scale_fill_manual(values = c('green', 'darkgreen'))

Done it! Thank you. It was the fill by interaction I was missing.
require(ggplot2)
males$off <- factor(as.numeric(males$off != 0))
males$nest <- as.factor(males$nest)
ggplot(males, aes(x = nest, fill = interaction(males$nest, males$off))) + geom_bar(width = 0.25) + scale_fill_manual(values = c('deepskyblue3', 'tomato3', 'deepskyblue', 'tomato'))
(Eventually needed the same number of lines of code as days googling...)

Related

boxplot in R is howing a vertical straight line

I have a data frame of multiple columns. I want to create a two boxplots of the two variable "secretary" and "driver" but the result is not satisfiying as the picture shows boxplot. This is my code:
profession ve.count.descrition euse.count.description Qualitative.result
secretary 0 1 -0.5
secretary 0 2 1
driver 1 1 -1
driver 0 2 0.3
data %>%
mutate(Qualitative.result = factor(Qualitative.result)) %>%
ggplot(aes(x = Profession , fill = Qualitative.result)) +
geom_boxplot()
You should not make Qualitative.result as factor. Maybe you want something like this:
library(tidyverse)
data %>%
ggplot(aes(x = Profession, y = Qualitative.result, fill = Profession)) +
geom_boxplot()
Output:

Error in ggplot2 when using both fill and group parameters in geom_bar

There seems to be a problem with R's ggplot2 library when I include both the fill and group parameters in a bar plot (geom_bar()). I've already tried looking for answers for several hours but couldn't find one that would help. This is actually my first post here.
To give a little background, I have a dataframe named smokement (short for smoke and mental health), a categorical variable named smoke100 (smoked in the past 100 days?) with "Yes" and "No", and another categorical variable named misnervs (frequency of feelings of nervousness) with 5 possible values: "All", "Most", "Some", "A little", and "None."
When I run this code, I get this result:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
However, the result I want is to have all grouped bar plots display their respective proportions. By reading a bit of "R for Data Science" book I found out that I need to include y = ..prop.. and group = 1 in aes() to achieve it:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1)) +
facet_wrap(~misnervs, nrow = 1)
Finally, I try to use the fill = smoke100 parameter in aes() to display this categorical variable in color, just like I did on the first code. But when I add this fill parameter, it doesn't work! The code runs, but it shows exactly the same output as the second code, as if the fill parameter this time was somehow ignored!
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
Does anyone have an idea of why this happens, and how to solve it? My end goal is to display each value of smoke100 (the "Yes" and "No" bars) with colors and a legend at the right, just like on the first graph, while having each grouping level of "misnervs" display their respective proportions of smoke100 ("Yes", "No") levels, just like on the second graph.
EDIT:
> dim(smokement)
[1] 35471 6
> str(smokement)
'data.frame': 35471 obs. of 6 variables:
$ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 1 1 1 ...
$ misnervs: Factor w/ 5 levels "All","Most","Some",..: 3 4 5 4 1 5 3 3 5 5 ...
$ mishopls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 5 5 5 5 5 5 5 ...
$ misrstls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 3 1 5 3 5 1 5 ...
$ misdeprd: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 4 5 5 5 5 5 ...
$ miswtles: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 5 5 5 5 5 5 ...
> head(smokement)
smoke100 misnervs mishopls misrstls misdeprd miswtles
1 Yes Some Some Some None None
2 No A little None None None None
3 Yes None None None None None
4 No A little None Some None None
5 Yes All None All A little None
6 Yes None None None None None
As for the output without group = 1
ggplot(data = smokement) +
+ geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100)) +
+ facet_wrap(~misnervs, nrow = 1)
Besides the solution offered here the GGAlly package includes a stat_prop which introduces a new by aesthetic to specify the way the proportions should be calculated:
library(GGally)
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100, by = misnervs), stat = "prop") +
facet_wrap(~misnervs, nrow = 1)
And just for reference the same could be achieved without GGAlly by setting fill=factor(..x..):
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = factor(..x..), group = 1)) +
facet_wrap(~misnervs, nrow = 1)
DATA
misnervs <- c("All", "Most", "Some", "A little", "None")
set.seed(123)
smokement <-
data.frame(
smoke100 = sample(c("Yes", "No"), 100, replace = TRUE),
misnervs = factor(sample(misnervs, 100, replace = TRUE), levels = misnervs)
)
I wasn't able to get what you wanted by tweaking your call to geom_bar*, but I think this gives you what you are looking for. As you didn't provide your input dataset (for understandable reasons), I've used the diamonds tibble in my code. The changes you need to make should be obvious.
*: I'm sure it can be done: I just wasn't able to work it out.
The idea behind my solution is to pre-compute the proportions you want to plot before the call to ggplot.
group_modify takes a grouped tibble and applies the specified function to each group in turn, before returning the modified (grouped) tibble.
diamonds %>%
group_by(cut) %>%
group_modify(
function(.x, .y)
.x %>%
group_by(color) %>%
summarise(Prop=n()/nrow(.))
) %>%
ggplot() +
geom_col(aes(x=color, y=Prop, fill=color)) +
facet_wrap(~cut)
Note the switch from geom_bar to geom_col: geom_bar uses row counts, geom_col uses values in the data.
As a rough-and-ready QC, here's the equivalent of your code that produces the "all grey' plot:
diamonds %>%
ggplot() +
geom_bar(aes(x=color, y=..prop.., fill=color, group=1)) +
facet_wrap(~cut)

no. of geom_point matches the value

I have an existing ggplot with geom_col and some observations from a dataframe. The dataframe looks something like :
over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0
The geom_col represents the runs data column and now I want to represent the wickets column using geom_point in a way that the number of points represents the wickets.
I want my graph to look something like this :
As
As far as I know, we'll need to transform your data to have one row per point. This method will require dplyr version > 1.0 which allows summarize to expand the number of rows.
You can adjust the spacing of the wickets by multiplying seq(wickets), though with your sample data a spacing of 1 unit looks pretty good to me.
library(dplyr)
wicket_data = dd %>%
filter(wickets > 0) %>%
group_by(over) %>%
summarize(wicket_y = runs + seq(wickets))
ggplot(dd, aes(x = over)) +
geom_col(aes(y = runs), fill = "#A6C6FF") +
geom_point(data = wicket_data, aes(y = wicket_y), color = "firebrick4") +
theme_bw()
Using this sample data:
dd = read.table(text = "over runs wickets
1 12 0
2 8 0
3 9 2
4 3 1
5 6 0", header = T)

R + ggplot. Draw children data in same plot as parent data

Having Titanic dataset I draw histograms of age/sex against the passenger class.
str(titanic) gives the following data
> 'data.frame': 714 obs. of 4 variables:
$ Survived: int 0 1 1 1 0 0 0 1 1 1 ...
$ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 54 2 27 14 4 ...
First, I made a plot of proportion of male/female against the travel classes.
It has been done by
ggplot(data = titanic, aes(x = factor(Age), fill = factor(Sex))) +
geom_bar(position = "dodge", aes(y = (..count..)/sum(..count..))) +
facet_grid(. ~ Pclass) + scale_x_discrete(breaks=c(20,40,60)) +
ylab("Frequency") + xlab("Age") +
scale_fill_discrete(name = "Sex")
Now I want to use the same graph, but add additional information -> proportion of survivals for all categories.
For example, what is the proportion of survival women age 20-30 who traveled the first class.
I would like to see it in the same bars, i.e split each column into two parts(survived/not survived).
Can I do it with ggplot? And if yes, how?
Using the builtin Titanic data set, I can show you roughly what #Axeman suggested in the comments. Note that it only has two categories for age (Child/Adult) so you would need to decide how to bin for your data.
ggplot(as.data.frame(Titanic)
, aes(y = Freq
, x = Age
, fill = Survived)) +
geom_col() +
facet_grid(Sex ~ Class)
Importantly, I am not sure that you are gaining anything by showing the frequencies in the way you currently are, as they do not appear to be showing anything meaningfully different than the counts would. If, instead, you wanted to show the proportion within each group that survived, you may be better off calculating those percentages first, then passing them to ggplot. Here is an example of that using dplyr. Again, your age bins can be whatever you want, but note that the narrower the bins, the more noisy the data will be.
as.data.frame(Titanic) %>%
group_by(Class, Sex, Age) %>%
mutate(Proportion = Freq/ sum(Freq)) %>%
ggplot(aes(y = Proportion
, x = Age
, fill = Survived)) +
geom_col() +
facet_grid(Sex ~ Class)

Legends not showing up properly in heatmap with ggplot2

I am trying to make a heatmap of normalized read abundance values with geom_tile in ggplot2 based on the example code here. My current code produces a heatmap for the desired ranges, but for some reason only 4 out of the 7 ranges are shown in heatmap and I cannot figure out what is the issue. When I followed the example in the original link it worked fine, so I must have changed something incorrectly in my code. Can anyone please help me to identify the error in my code that is causing this?
I want to have the following color scheme:
-Inf < value <= 0 -> white
0 < value <=1 -> yellow
1< value <=10 -> orange
10< value <= 100 -> darkorange2
100< value <= 1000 -> red
1000 <value <= 10000 -> red3
10000 < value <= 32000 -> red4
Here is my code:
#re-order the labels in the order of appearance in the data frame
df$label <- factor(df$X1, as.character(df$X1))
# make the cuts
df$value1 <-cut(df$value,breaks=c(Inf,0,1,10,100,1000,10000,32000),right = T)
ggplot(data = df, aes(x = label, y = X2)) + geom_tile(aes(fill=value1), colour= "black") + scale_fill_manual(breaks=c("(-Inf,0]", "(0,1]", "(1,10]", "(10,100]", "(100,1000]", "(1000,10000]", "(10000,32000]"),values =c("white","yellow","orange","darkorange2","red","red3","red4"))
here is a preview of my data (actual data has 228 rows featuring reads per million values for 38 IDs in 6 different experiments):
head(df)
X1 X2 value label value1
1 merged_read_17785-997_aka_156_aka_21 RPM.MT1 91.783028 merged_read_17785-997_aka_156_aka_21 (10,100]
2 merged_read_133362-79_aka_156_aka_21 RPM.MT1 6.403467 merged_read_133362-79_aka_156_aka_21 (1,10]
3 merged_read_147828-69_aka_156_aka_20 RPM.MT1 4.268978 merged_read_147828-69_aka_156_aka_20 (1,10]
4 merged_read_162443-60_aka_156_aka_21 RPM.MT1 0.000000 merged_read_162443-60_aka_156_aka_21 (-Inf,0]
5 merged_read_262156-32_aka_156_aka_21 RPM.MT1 5.691971 merged_read_262156-32_aka_156_aka_21 (1,10]
6 merged_read_22905-759_aka_159_aka_21 RPM.MT1 140.164780 merged_read_22905-759_aka_159_aka_21 (100,1e+03]
And here is the plot that I get from the above data:
I think I figured this out, if I take out the breaks argument from scale_fill_manual then all legends are shown:
ggplot(data = df, aes(x = label, y = X2)) + geom_tile(aes(fill=value1), colour= "black") + scale_fill_manual(values =c("white","yellow","orange","darkorange2","red","red3","red4"))

Resources