How to plot top 5 most frequent variables by region in R - r

I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!

If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")

creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")

Related

R: how to filter within aes()

As an R-beginner, there's one hurdle that I just can't find the answer to. I have a table where I can see the amount of responses to a question according to gender.
Response
Gender
n
1
1
84
1
2
79
2
1
42
2
2
74
3
1
84
3
2
79
etc.
I want to plot these in a column chart: on the y I want the n (or its proportions), and on the x I want to have two seperate bars: one for gender 1, and one for gender 2. It should look like the following example that I was given:
The example that I want to emulate
However, when I try to filter the columns according to gender inside aes(), it returns an error! Could anyone tell me why my approach is not working? And is there another practical way to filter the columns of the table that I have?
ggplot(table) +
geom_col(aes(x = select(filter(table, gender == 1), Q),
y = select(filter(table, gender == 1), n),
fill = select(filter(table, gender == 2), n), position = "dodge")
Maybe something like this:
library(RColorBrewer)
library(ggplot2)
df %>%
ggplot(aes(x=factor(Response), y=n, fill=factor(Gender)))+
geom_col(position=position_dodge())+
scale_fill_brewer(palette = "Set1")
theme_light()
Your answer does not work, because you are assigning the x and y variables as if it was two different datasets (one for x and one for y). In line with the solution from TarJae, you need to think of it as the axis in a diagram - so you need for your x axis to assign the categorical variables you are comparing, and you want for the y axis to assign the numerical variables which determines the height of the bars. Finally, you want to compare them by colors, so each group will have a different color - that is where you include your grouping variable (here, I use fill).
library(dplyr) ## For piping
library(ggplot2) ## For plotting
df %>%
ggplot(aes(x = Response, y = n, fill = as.character(Gender))) +
geom_bar(stat = "Identity", position = "Dodge")
I am adding "Identity" because the default in geom_bar is to count the occurences in you data (i.e., if you data was not aggregated). I am adding "Dodge" to avoid the bars to be stacked. I will recommend you, to look at this resource for more information: https://r4ds.had.co.nz/index.html

How can I plot 3 repeat observations per sample on a scatter in R?

I have a dataframe with the following columns; Sample, Read_length, Length, Rep, Year, Sex. Each unique sample has 6 Length values (2 Read_length conditions x 3 Reps). I would like to plot Length vs Year in such a way that each group of 3 repeats is visually linked on the plot, so I can see the variation. I am using colour and point shape to distinguish between the 2 read-lengths and between Male & Female.
ggplot(data1, aes(x = Year, y = Length, shape = Sex, colour = Read_length)) + geom_point(size = 3) + scale_shape_manual(values = c(1, 4))
Is there a way to group first by read_length, and then by sample name, to generate the groups of three (and how to then plot that)?
Take your input data and use group_by() from dplyr. This will allow ggplot, and many other tidyverse functions to process each sample separately.
data %>% group_by(Sample)

r - How to change the order of a factor within another factor - ggplot2

I have an issue with ggplot2. I need to plot a bar chart of an independent variable divided in two factors (x and fill). However within the two levels of the x factor, my independent variable has to be ordered differently.
I tried using the factor function, but it orders the factor itself and not the levels differently.
# here is the str of the df
$ Indep_var: int 90 70 30 50
$ Factor_1 : Factor w/ 2 levels "One","Two": 2 2 1 1
$ Factor_2 : Factor w/ 2 levels "Area1","Area2": 1 2 1 2
$ SE : num 3 4 3.5 5
# here is the code of the plot
ggplot(df, aes(x=Factor_1, y=Indep_var, fill=Factor_2)) +
geom_col(colour="black",width=0.5, position=position_dodge(0.5)) +
geom_errorbar(aes(ymin=Indep_var-SE, ymax=Indep_var+SE), width=0.2, position=position_dodge(0.5))
p2 + scale_fill_grey(start=0.8, end=0.4) + theme_classic() + coord_cartesian(ylim=c(0,100))
I need that my Indep_var is sorted as Area1-Area2 (that are levels of Factor_2) in the first level of Factor_1 (i.e. "One") and Area2-Area1 in the second level of Factor_1 (i.e. "Two").
Can someone specify the code that I need to add?
I hope it's clear enough. Thanks for your time.
From what I understand (correct me if I am wrong), within each group of Factor_1, you want to plot the bars placing the one with smaller value of Indep_var in the left and the one with larger value of Indep_var in the right. And then map the fill color to Factor_2.
I am not sure this is the best way to do it, but i created a new column with his ordering. Then i used the group aesthetic to group them in the desired order on the x axis.
Here is my code:
df <- data.frame(Indep_var=c(90, 70, 30, 50),
Factor_1=c('Two', 'Two', 'One', 'One'),
Factor_2=c('Area_1', 'Area_2', 'Area_1', 'Area_2'),
SE=c(3, 4, 3.5, 5))
df2 <- df %>%
# group by Factor_1
group_by(Factor_1) %>%
# within each group, sort rows according to value of Indep_var
arrange(Indep_var) %>%
# create new column with rank (or row number) within each group
mutate(order_in_group=row_number()) %>%
# transform rank values into a factor
mutate(order_in_group=as.factor(order_in_group)) %>%
# remove grouping
ungroup()
# in ggplot, group bars according to column `order_in_group` (I just added `group=order_in_group`)
ggplot(df2, aes(x=Factor_1, y=Indep_var, fill=Factor_2, group=order_in_group)) +
geom_col(colour="black",width=0.5, position=position_dodge(0.5)) +
geom_errorbar(aes(ymin=Indep_var-SE, ymax=Indep_var+SE), width=0.2, position=position_dodge(0.5))
And here is my plot:

grouped barplot: order x-axis & keep constant bar width, in case of missing levels

Here is my script (example inspired from here and using the reorder option from here):
library(ggplot2)
Animals <- read.table(
header=TRUE, text='Category Reason Species
1 Decline Genuine 24
2 Improved Genuine 16
3 Improved Misclassified 85
4 Decline Misclassified 41
5 Decline Taxonomic 2
6 Improved Taxonomic 7
7 Decline Unclear 10
8 Improved Unclear 25
9 Improved Bla 10
10 Decline Hello 30')
fig <- ggplot(Animals, aes(x=reorder(Animals$Reason, -Animals$Species), y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge")
This gives the following output plot:
What I would like is to order my barplot only on condition 'Decline', and all the 'Improved' would not be inserted in the middle. Here is what I would like to get (after some svg editing):
So now all the whole 'Decline' condition is sorted and the 'Improved' condition comes after. Besides, ideally, the bars would all be at the same width, even if the condition is not represented for the value (e.g. "Bla" has no "Decline" value).
Any idea on how I could do that without having to play with SVG editors? Many thanks!
First let's fill your data.frame with missing combinations like this.
library(dplyr)
Animals2 <- expand.grid(Category=unique(Animals$Category), Reason=unique(Animals$Reason)) %>% data.frame %>% left_join(Animals)
Then you can create an ordering variable for the x-scale:
myorder <- Animals2 %>% filter(Category=="Decline") %>% arrange(desc(Species)) %>% .$Reason %>% as.character
An then plot:
ggplot(Animals2, aes(x=Reason, y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=myorder)
Define new data frame with all combinations of "Category" and "Reason", merge with data of "Species" from data frame "Animals". Adapt ggplot by correct scale_x_discrete:
Animals3 <- expand.grid(Category=unique(Animals$Category),Reason=unique(Animals$Reason))
Animals3 <- merge(Animals3,Animals,by=c("Category","Reason"),all.x=TRUE)
Animals3[is.na(Animals3)] <- 0
Animals3 <- Animals3[order(Animals3$Category,-Animals3$Species),]
ggplot(Animals3, aes(x=Animals3$Reason, y=Species, fill = Category)) + geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=as.character(Animals3[Animals3$Category=="Decline","Reason"]))
To achieve something like that I would adjust the data frame when working with ggplot. Add the missing categories with a value of zero.
Animals <- rbind(Animals,
data.frame(Category = c("Improved", "Decline"),
Reason = c("Hello", "Bla"),
Species = c(0,0)
)
)
Along the same lines as the answer from user Alex, a less manual way of adding the categories might be
d <- with(Animals, expand.grid(unique(Category), unique(Reason)))
names(d) <- names(Animals)[1:2]
Animals <- merge(d, Animals, all.x=TRUE)
Animals$Species[is.na(Animals$Species)] <- 0

Creating a line chart in r for the average value of groups

I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()

Resources