R Setting Y Axis to Count Distinct in ggplot2 - r

I have a data frame that contains 4 variables: an ID number (chr), a degree type (factor w/ 2 levels of Grad and Undergrad), a degree year (chr with year), and Employment Record Type (factor w/ 6 levels).
I would like to display this data as a count of the unique ID numbers by year as a stacked area plot of the 6 Employment Record Types. So, count of # of ID numbers on the y-axis, degree year on the x-axis, the value of x being number of IDs for that year, and the fill will handle the Record Type. I am using ggplot2 in RStudio.
I used the following code, but the y axis does not count distinct IDs:
ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID,
fill=Employment.Data.Type)) + geom_freqpoly() +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
I also tried setting y = Entity.ID to y = ..count.. and that did not work either. I have searched for solutions as it seems to be a problem with how I am writing the aes code.
I also tried the following code based on examples of similar plots:
ggplot(AlumJobStatusCopy, aes(interval)) +
geom_area(aes(x=Degree.Year, y = Entity.ID,
fill = Employment.Data.Type)) +
scale_fill_brewer(palette="Blues",
breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))
This does not even seem to work. I've read the documentation and am at my wit's end.
EDIT:
After figuring out the answer to the problem, I realized that I was not actually using the correct values for my Year variable. A count tells me nothing as I am trying to display the rise in a lack of records and the decline in current records.
My Dataset:
Year, int, 1960-2015
Current Record, num: % of total records that are current
No Record, num: % of total records that are not current
Ergo each Year value has two corresponding percent values. I am now using 2 lines instead of an area plot since the Y axis has distinct values instead of a count function, but I would still like the area under the curves filled. I tried using Melt to convert the data from wide to long, but was still unable to fill both lines. Filling is just for aesthetic purposes as I would like to use a gradient for each with 1 fill being slightly lighter than the other.
Here is my current code:
ggplot(Alum, aes(Year)) +
geom_line(aes(y = Percent.Records, colour = "Percent.Records")) +
geom_line(aes(y = Percent.No.Records, colour = "Percent.No.Records")) +
scale_y_continuous(labels = percent) + ylab('Percent of Total Records') +
ggtitle("Active, Living Alumni Employment Record") +
scale_x_continuous(breaks=seq(1960, 2014, by=5))
I cannot post an image yet.

I think you're missing a step where you summarize the data to get the quantities to plot on the y-axis. Here's an example with some toy data similar to how you describe yours:
# Make toy data with three levels of employment type
set.seed(1)
df <- data.frame(Entity.ID = rep(LETTERS[1:10], 3), Degree.Year = rep(seq(1990, 1992), each=10),
Degree.Type = sample(c("grad", "undergrad"), 30, replace=TRUE),
Employment.Data.Type = sample(as.character(1:3), 30, replace=TRUE))
# Here's the part you're missing, where you summarize for plotting
library(dplyr)
dfsum <- df %>%
group_by(Degree.Year, Employment.Data.Type) %>%
tally()
# Now plot that, using the sums as your y values
library(ggplot2)
ggplot(dfsum, aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")
The result could use some fine-tuning, but I think it's what you mean. Here, the bars are equal height because each year in the toy data include an equal numbers of IDs; if the count of IDs varied, so would the total bar height.
If you don't want to add objects to your workspace, just do the summing in the call to ggplot():
ggplot(tally(group_by(df, Degree.Year, Employment.Data.Type)),
aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
geom_bar(stat="identity") + labs(fill="Employment")

Related

R: how to filter within aes()

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")
Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()
Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

Assigning specific colors to columns in geom_col()

I produced a geom_col() that has 13 separate columns on it. I would like to assign a specific color to the columns: for example, I have "Teams" on the x axis, and "AVG Attendance" of the teams on the y axis.
Of the 13 teams, I would like my specific team's column to be colored red so it stands out, four of the other columns (teams newly added to league) to be in green, and the other 9 existing teams to be in blue.
I can only get ALL of the columns to be one color if I use - geom_col(fill = "blue").
How do I differentiate the columns to have a separate color grouped by the way I described above? I have spent many days googling this and I can't find a way to do it.
Notes in the code...
# need dplyr and ggplot
library(tidyverse)
# make our random numbers the same
set.seed(123)
# fake minimum reproducible example
game_df <- tibble(team = LETTERS[1:13],
attendance = sample(5000:10000, 13),
# this is hand coded but you should find a way to do this automatically using existing data
# or use mutate() to create a new column with a calculation
color_group = c('old','old','new', 'new', 'old', 'highlight','old','new','old','new','old','old','old') )
# plot
game_df %>%
ggplot (., aes(x = team, y = attendance, fill = color_group)) +
geom_col() +
scale_color_manual( values = c('red','blue','green'))

How can I plot 3 repeat observations per sample on a scatter in R?

I have a dataframe with the following columns; Sample, Read_length, Length, Rep, Year, Sex. Each unique sample has 6 Length values (2 Read_length conditions x 3 Reps). I would like to plot Length vs Year in such a way that each group of 3 repeats is visually linked on the plot, so I can see the variation. I am using colour and point shape to distinguish between the 2 read-lengths and between Male & Female.
ggplot(data1, aes(x = Year, y = Length, shape = Sex, colour = Read_length)) + geom_point(size = 3) + scale_shape_manual(values = c(1, 4))
Is there a way to group first by read_length, and then by sample name, to generate the groups of three (and how to then plot that)?
Take your input data and use group_by() from dplyr. This will allow ggplot, and many other tidyverse functions to process each sample separately.
data %>% group_by(Sample)

R, ggplot, How do I keep related points together when using jitter?

One of the variables in my data frame is a factor denoting whether an amount was gained or spent. Every event has a "gain" value; there may or may not be a corresponding "spend" amount. Here is an image with the observations overplotted:
Adding some random jitter helps visually, however, the "spend" amounts are divorced from their corresponding gain events:
I'd like to see the blue circles "bullseyed" in their gain circles (where the "id" are equal), and jittered as a pair. Here are some sample data (three days) and code:
library(ggplot2)
ccode<-c(Gain="darkseagreen",Spend="darkblue")
ef<-data.frame(
date=as.Date(c("2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-01","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-02","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03","2021-03-03")),
site=c("Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace","Castle","Temple","Temple","Temple","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Temple","Temple","Palace","Palace","Castle","Castle","Castle","Castle","Castle","Temple","Temple","Palace"),
id=c("C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99","C123","T101","T93","T94","T95","T96","P102","P96","C126","C127","C128","T100","T98","P100","P98","C129","C130","C131","C132","C133","T104","T99","P99"),
gainspend=c("Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Gain","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend","Spend"),
amount=c(6,14,34,31,3,10,6,14,2,16,16,14,1,1,15,11,8,7,2,10,15,4,3,NA,NA,4,5,NA,NA,NA,NA,NA,NA,2,NA,1,NA,3,NA,NA,2,NA,NA,2,NA,3))
#▼ 3 day, points centered
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
#▼ 3 day, jitted
ggplot(ef,aes(date,site)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5,position=position_jitter(w=0,h=0.2)) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20))
My main idea is the old "add jitter manually" approach. I'm wondering if a nicer approach could be something like plotting little pie charts as points a la package scatterpie.
In this case you could add a random number for the amount of jitter to each ID so points within groups will be moved the same amount. This takes doing work outside of ggplot2.
First, draw the "jitter" to add for each ID. Since a categorical axis is 1 unit wide, I choose numbers between -.3 and .3. I use dplyr for this work and set the seed so you will get the same results.
library(dplyr)
set.seed(16)
ef2 = ef %>%
group_by(id) %>%
mutate(jitter = runif(1, min = -.3, max = .3)) %>%
ungroup()
Then the plot. I use a geom_blank() layer so that the categorical site axis is drawn before I add the jitter. I convert site to be numeric from a factor and add the jitter on; this only works for factors so luckily categorical axes in ggplot2 are based on factors.
Now paired ID's move together.
ggplot(ef2, aes(x = date, y = site)) +
geom_blank() +
geom_point(aes(size = amount, color = gainspend,
y = as.numeric(factor(site)) + jitter),
alpha=0.5) +
scale_color_manual(values = ccode) +
scale_size_continuous(range = c(1, 15), breaks = c(5, 10, 20))
#> Warning: Removed 15 rows containing missing values (geom_point).
Created on 2021-09-23 by the reprex package (v2.0.0)
You can add some jitter by id outside the ggplot() call.
jj <- data.frame(id = unique(ef$id), jtr = runif(nrow(ef), -0.3, 0.3))
ef <- merge(ef, jj, by = 'id')
ef$sitej <- as.numeric(factor(ef$site)) + ef$jtr
But you need to make site integer/numeric to do this. So when it comes to making the plot, you need to manually add axis labels with scale_y_continuous(). (Update: the geom_blank() trick from aosmith above is a better solution!)
ggplot(ef,aes(date,sitej)) +
geom_point(aes(size=amount,color=gainspend),alpha=0.5) +
scale_color_manual(values=ccode) +
scale_size_continuous(range=c(1,15),breaks=c(5,10,20)) +
scale_y_continuous(breaks = 1:3, labels= sort(unique(ef$site)))
This seems to work, but there are still a few gain/spend circles without a partner--perhaps there is a problem with the id variable.
Perhaps someone else has a better approach!

Barplot of groups based on counts

I'm trying to make barplot
Data are in dataframe. In those dataframes I have several column, one named ID and another count.
First I'm trying to make group of this count. In the barplot we should see,count=0,count=1,count=2,count>=3
Some exemple data
data1 <- data.frame(ID="ID_1", count=(rep(seq(0,10,by=1),each=4)))
data2 <- data.frame(ID="ID_2", count=(rep(seq(0,10,by=1),each=4)))
data3 <- data.frame(ID="ID_3", count=(rep(seq(0,10,by=1),each=4)))
Obviously here, barplots of the dataframes will look same
I tried to make this in ggplot (it's not nice at all)
ggplot(data1)+
geom_bar(aes(x = ID, fill = count),position = "fill")+
geom_bar(data=data2,aes(x = ID, fill = count),position = "fill")+
geom_bar(data=data3,aes(x = ID, fill = count),position = "fill")
I got something like that
What I'm trying to do is to have different groups within a barplot, like the proportion of counts 0, proportion of counts 1,2 and proportion of counts greater (and equal) to 3.
I expect something like that
But of course in my example barplots will look same.
Also if you have some suggestion to change Y axis from 1.00 to 100%.
Also One of my problem is that length of my real dataframes are not equal but it should doesn't matter because I try to get the percentage of count group
You need to put all the data in 1 dataframe, in long format. Then cast your counts to factors, and it works.
ggplot(bind_rows(data1, data2, data3)) +
geom_bar(aes(x = ID, fill = as.factor(count)), position = "fill") +
scale_y_continuous(labels=scales::percent) # To get the Y axis in percentage
So I did something to try to create my barplot
data1$var="first"
data2$var="second"
data3$var="third"
data4$var="fourth"
data5$var="fifth"
full_data=rbind(data1,data2,data3,data4,data5)
ggplot(ppgk) +
geom_bar(aes(x = var, fill = as.factor(Count)), position = "fill")+
scale_y_continuous(labels=scales::percent)
So I got something like that :
If Someone have the solution to make different group of counts : count=0,count=1,count=2,count>=3

Resources