R: how to filter within aes() - r

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")

Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()

Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

Related

How to change bar position without affecting the results?

I'm new to R and im trying to change the position of a bar on a bar chart but my results have changed too. Here is the chart : Chart of age
when I use the code :
positions <- c("Moins de 18 ans","18 a 22 ans", "23 a 27 ans", "33 a 37 ans","38 ans et plus")
p + theme_classic() + scale_x_discrete(limits = positions)
This is the results I have:
Chart of age 2
and the message :
Warning messages:
1: Removed 86 rows containing non-finite values (stat_count).
2: Removed 86 rows containing non-finite values (stat_count).
I don't know what to do with this. Someone help me please!
Since you haven't provided the data, I can show how to rearrange bars using dummy data. To sort bar graph, essentially you need to sort data on variables you are using as x-axis in plot.
vec = c(rep("a", 30), rep("b", 20), rep("c", 10))
df = as.data.frame(table(vec)) # Create dummy data frame
Dataframe df looks like this -
vec Freq
1 a 30
2 b 20
3 c 10
The plot will be -
df %>%
ggplot(aes(x = vec, y = Freq)) +
geom_bar(stat = "identity") # default plot
Now, I want bars in order b,a,c. All I need to do is sort my dataframe in the same order -
df$vec = factor(df$vec, levels = c("b", "a", "c")) # assign levels in order you want to see the bar-plot
df = df[order(df$vec),] # sort dataframe on your x-variable
df %>%
ggplot(aes(x = vec, y = Freq)) +
geom_bar(stat = "identity") # barplot will be sorted on levels of factor now
The output of above code is -
I haven't done rest of the formatting, but from your graphs, you are good with that. By following these steps, your data shouldn't change when reordering the bars. If you can share your data, I can better understand if it solves your problem.

ggplot geom_bar with stat = "sum"

I want to plot a bar chart summing a variable along two dimensions, one will be spread along x, and the other will be spread vertically (stacked).
I would expect the two following instructions to do the same, but they don't and only the 2nd one gives the desired output (where I aggregate the data myself).
I'd like to understand what's going on in the first case, and if there's a way to use ggplot2 's built-in aggregation features to get the right output.
library(ggplot2)
library(dplyr)
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_bar(stat="sum",na.rm=TRUE)
yielding this plot:
p2 <- ggplot(diamonds %>%
group_by(cut,color) %>%
summarize_at("price",sum,na.rm=T),
aes(cut,price,fill=color)) +
geom_bar(stat="identity",na.rm=TRUE)
yielding this picture:
Here's where the top of our bars should be, p1 doesn't give these values:
diamonds %>% group_by(cut) %>% summarize_at("price",sum,na.rm=TRUE)
# # A tibble: 5 x 2
# cut price
# <ord> <int>
# 1 Fair 7017600
# 2 Good 19275009
# 3 Very Good 48107623
# 4 Premium 63221498
# 5 Ideal 74513487
You might be misunderstanding the stat option for geom_bar. In this case, since you want the values for each factor to be summed up within each bar, and the bars to be colored based off how much of that total sum is in each color, you can simplify the call to geom_col which uses the values as heights for the bar; and therefore "sums" all the values within each category. For example, the following will give the desired output:
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_col(na.rm=TRUE)
Alternatively, if you want to use geom_bar with a stat call, then you want to use the "identity" stat:
p1 <- ggplot(diamonds,aes(cut,price,fill=color)) +
geom_bar(stat = "identity", na.rm=TRUE)
For more information, consider this thread: https://stackoverflow.com/a/27965637/6722506

ggplot geom_bar: meaning of aes(group = 1)

I am learning geom_bar on section 3.7 of r4ds.had.co.nz. I run a code like this:
library(ggplot2)
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
Then I have this plot:
The point is, if I exclude the "group = 1" part:
library(ggplot2)
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop..))
The plot will be wrong,
But if I replace group = 1 by group = 2 or group = "x", the plot still looks correct. So I don't quite understand the meaning of group = 1 here and how to use it.
group="whatever" is a "dummy" grouping to override the default behavior, which (here) is to group by cut and in general is to group by the x variable. The default for geom_bar is to group by the x variable in order to separately count the number of rows in each level of the x variable. For example, here, the default would be for geom_bar to return the number of rows with cut equal to "Fair", "Good", etc.
However, if we want proportions, then we need to consider all levels of cut together. In the second plot, the data are first grouped by cut, so each level of cut is considered separately. The proportion of Fair in Fair is 100%, as is the proportion of Good in Good, etc. group=1 (or group="x", etc.) prevents this, so that the proportions of each level of cut will be relative to all levels of cut.
Group will help the plot to look at the specific rows that contain the specific cut and the proportion is found with respect to the whole database as in proportion of an ideal cut in the whole dataset.
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, The proportion of an ideal cut in the ideal cut specific data will be 1.

How to barplot select rows of data from a dataframe in R?

This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!
It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")

R Setting Y Axis to Count Distinct in ggplot2

I have a data frame that contains 4 variables: an ID number (chr), a degree type (factor w/ 2 levels of Grad and Undergrad), a degree year (chr with year), and Employment Record Type (factor w/ 6 levels).
I would like to display this data as a count of the unique ID numbers by year as a stacked area plot of the 6 Employment Record Types. So, count of # of ID numbers on the y-axis, degree year on the x-axis, the value of x being number of IDs for that year, and the fill will handle the Record Type. I am using ggplot2 in RStudio.
I used the following code, but the y axis does not count distinct IDs:
ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID,
fill=Employment.Data.Type)) + geom_freqpoly() +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
I also tried setting y = Entity.ID to y = ..count.. and that did not work either. I have searched for solutions as it seems to be a problem with how I am writing the aes code.
I also tried the following code based on examples of similar plots:
ggplot(AlumJobStatusCopy, aes(interval)) +
geom_area(aes(x=Degree.Year, y = Entity.ID,
fill = Employment.Data.Type)) +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
This does not even seem to work. I've read the documentation and am at my wit's end.
EDIT:
After figuring out the answer to the problem, I realized that I was not actually using the correct values for my Year variable. A count tells me nothing as I am trying to display the rise in a lack of records and the decline in current records.
My Dataset:
Year, int, 1960-2015
Current Record, num: % of total records that are current
No Record, num: % of total records that are not current
Ergo each Year value has two corresponding percent values. I am now using 2 lines instead of an area plot since the Y axis has distinct values instead of a count function, but I would still like the area under the curves filled. I tried using Melt to convert the data from wide to long, but was still unable to fill both lines. Filling is just for aesthetic purposes as I would like to use a gradient for each with 1 fill being slightly lighter than the other.
Here is my current code:
ggplot(Alum, aes(Year)) +
geom_line(aes(y = Percent.Records, colour = "Percent.Records")) +
geom_line(aes(y = Percent.No.Records, colour = "Percent.No.Records")) +
scale_y_continuous(labels = percent) + ylab('Percent of Total Records') +
ggtitle("Active, Living Alumni Employment Record") +
scale_x_continuous(breaks=seq(1960, 2014, by=5))
I cannot post an image yet.
I think you're missing a step where you summarize the data to get the quantities to plot on the y-axis. Here's an example with some toy data similar to how you describe yours:
# Make toy data with three levels of employment type
set.seed(1)
df <- data.frame(Entity.ID = rep(LETTERS[1:10], 3), Degree.Year = rep(seq(1990, 1992), each=10),
Degree.Type = sample(c("grad", "undergrad"), 30, replace=TRUE),
Employment.Data.Type = sample(as.character(1:3), 30, replace=TRUE))
# Here's the part you're missing, where you summarize for plotting
library(dplyr)
dfsum <- df %>%
group_by(Degree.Year, Employment.Data.Type) %>%
tally()
# Now plot that, using the sums as your y values
library(ggplot2)
ggplot(dfsum, aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")
The result could use some fine-tuning, but I think it's what you mean. Here, the bars are equal height because each year in the toy data include an equal numbers of IDs; if the count of IDs varied, so would the total bar height.
If you don't want to add objects to your workspace, just do the summing in the call to ggplot():
ggplot(tally(group_by(df, Degree.Year, Employment.Data.Type)),
aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")

Resources