R - ggplot2 - limit bar chart output for categorical data - r

I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.
My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.
First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).
In the code below, I am using the forcats function fct_infreq to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile, scale_x_discrete, etc. but those don't seem to work for categorical data.
Thanks for your help!
df %>% filter(LoanStatus %in% c("Chargedoff")) %>%
ggplot() +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))
Resulting error:
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
UPDATE:
Using Yifu's answer below, I was able to get the desired output like this:
pd_occupation <- pd %>%
dplyr::filter(LoanStatus == "Chargedoff") %>%
group_by(Occupation) %>%
mutate(group_num = n())
table(pd_occupation$group_num)#to view the distribution
ggplot(subset(pd_occupation, group_num >= 361)) +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
ggtitle('Loan Charge-Offs by Occupation')

You can do it in dplyr instead:
#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
group_by(carb) %>%
mutate(group_num = n()) %>%
# you can substitute the number with 10% percentitle or whatever you want
dplyr::filter(group_num >= 7) #%>%
#ggplot()
#create your plot
The idea is to filter the observations and pass it to ggplot rather than filter data in ggplot.

Related

How to create a stacked area chart in R from a csv with non-numerical data

I am trying to create a stacked area chart in R using data from this csv: https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv
(The above file is raw content, for better readability of the data look here: https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/masculinity-survey.csv)
I am trying to create a percentage based stacked area chart, that i similar to this example: https://r-charts.com/en/evolution/percentage-stacked-area_files/figure-html/percentage-areaplot.png
The problem is that since i am working with non-numerical data only, it is a bit hard for me to get a proper graph.
My goal is to have the graph display the different age groups in the x-axis ( row "age3" in raw content), and the fill to be the ethnicities (row "racethn4" in raw content. All while the y axis simply is the percentage that represents the number of total answers in the survey (that of course goes up to 100).
I tried to do it the following way, but im not sure what the y value should be:
df <- read_csv("Path to csv")
ggplot(df, aes(x = df$age3, y = ???, fill = df$racethn4)) + geom_stream()
Any ideas on how to represent the plot as described?
I'm not too well versed in ggplot as I use other graphing packages but I gave this a shot. I don't believe you can use geom_area when x is a categorical variable. At least I did not have any luck trying that. So I used geom_col instead.
Here's two approaches for transforming the data. Using dplyr and data.table. Feel free to pick whichever is more natural for you.
You need to sum up the number of observations per group combo first and then get the percent total for the y values.
library(data.table)
library(ggplot2)
library(dplyr)
dat = fread("temp.csv") # from data.table::fread
# data.table way
dat_sub = dat[, .(age3 = as.factor(age3), racethn4 = as.factor(racethn4))][,.N, by = .(age3,racethn4)]
dat_sub[, tot := sum(N), by = age3][, perc := N/tot*100][order(age3)]
# dplyr way
dat_sub = dat %>%
select(age3, racethn4) %>%
group_by(age3, racethn4) %>%
summarise(n = n()) %>%
group_by(age3) %>%
mutate(tot = sum(n),
perc = n / tot * 100)
# using a stacked bar chart instead of stacked area
ggplot(dat_sub, aes(x = age3, y = perc, fill = racethn4)) +
geom_col()

How can I run this function multiple times and plot each result?

I have the following function and ggplot code. Essentially it runs through my database each time randomly removing one line per plot and calculating the frequency of each Category until there are only 5 lines per plot left. I then plot the calculated frequencies using ggplot and at the moment I am using a subset to only plot 4 IDs.
What I want to do is run the full function 5 different times and graph the results for each run on the same plot. The results should slightly differ since the function randomly removes lines. So the ggplot would have 5 lines per Category as opposed to the one line per category at the moment.
Initial Dataset (there are multiple plots, and 50 rows per plot:
Dataset after For Loop:
Code being used:
for ( i in 0:45){
if (i>0){
dat<- dat %>%
group_by(Plot) %>%
sample_n(n() - 1) %>%
ungroup()
}
j<-dat %>%
group_by(Category) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n),
total=50-i)
if (i==0){
tot_j=j
} else {
tot_j=bind_rows(tot_j,j)
}
}
ggplot(subset(tot_j,Category %in% c("C" , "G","Z","S"))) +
geom_line(aes(total, freq, colour=Category)) +
xlim(50,5)
Graph currently produced
I appreciate any help or advice! Still learning how to use r to its fullest extent!

How to use for loop across various years and get multiple plots together?

https://www.kaggle.com/nowke9/ipldata ---- contains the data set.
I am fairly new to R programming. This is an exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I am trying to plot the relationship between matches won by teams across different cities.
However, since 12 seasons are over the output which I am getting is not helping to make sufficient conclusions. In order to plot the relationship across each year, it is required to use for loop. Right now, the output for all the 12 years is displayed in a single graph.
How to rectify this mistake and plot a separate graph for each year with proper color scheming ?
library(tidyverse)
matches_tbl <- read_csv("data/matches_updated.csv")
deliveries_tbl <- read_csv("data/deliveries_updated.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
combined_matches_deliveries_tbl %>%
group_by(city, winner)%>%
filter(season == 2008:2019, !result == "no result")%>%
count(match_id)%>%
ungroup()%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+
theme_bw()
The output is as follows:-
There were 50 or more warnings (use warnings() to see the first 50)
[Winner of teams across cities for the years between 2008 and 2019][1]
The required output is :- 12 separate graphs in a single code with proper color scheming.
Many thanks in advance.
Here is an example using mtcars to split by a variable into separate plots. What I created is a scatter plot of vs and mpg by splitting the dataset by cyl. First create an empty list. Then I use lapply to loop through the values of cyl (4,6,8) and then filter the data by that value. After that I plot the scatter plot for the subset and save it to the empty list. Each segment of the list will represent a plot and you can pull them out as you see fit.
library(dplyr)
library(ggplot2)
gglist <- list()
gglist <- lapply(c(4,6,8), function(x){
ggplot(filter(mtcars, cyl == x))+
geom_point(aes(x=vs,y=mpg))
})
Is this what you want?
combined_matches_deliveries_tbl %>%
group_by(city, winner,season)%>%
filter(season %in% 2008:2019, !result == "no result")%>%
count(match_id)%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+ facet_wrap(season~.)+
theme_bw()

ordering y axis of geom_tile plot by cumulative sum of days

I have a dataset with 4 columns client,date, sales and scale.
I am trying figure it ou how to to order the y axis (client) on a geom_tile plot not based on the default decreasing order of the levels but instead on the cumulative sum of each client for all days.
The code bellow is an example. The client are ordered 5,4,3,2,1 but I need instead ordered based on the sales of all days.
data.frame(client=seq(1:5),date=Sys.Date()-0:05,sales=rnorm(30,300,100)) %>% mutate_if(is.numeric,round,0) %>% mutate(escale=cut(sales,breaks=c(0,100,200,300,1000),labels=c("0-100","100-200","200-300","+300"))) %>% ggplot(.,aes(x=date,y=client,fill=escale)) + geom_tile(colour="white",size=0.25)
Appreciate any help
Maybe a quick fix like this?
df <- data.frame(client=seq(1:5),date=Sys.Date()-0:05,sales=rnorm(30,300,100)) %>%
mutate_if(is.numeric,round,0) %>%
mutate(escale=cut(sales,breaks=c(0,100,200,300,1000),
labels=c("0-100","100-200","200-300","+300")))
# we get the order according to total sales
o <- names(sort(tapply(df$sales,df$client,sum)))
ggplot(df,aes(x=date,y=client,fill=escale)) +
geom_tile(colour="white",size=0.25)+
# just manually set the y-axis here
scale_y_discrete(limits=o)
Personally, I prefer the set the factor levels, in the data.frame, and plot
# we get the order according to total sales
o <- names(sort(tapply(df$sales,df$client,sum)))
df %>% mutate(client = factor(client,levels=o)) %>%
ggplot(.,aes(x=date,y=client,fill=escale)) +
geom_tile(colour="white",size=0.25)
Both will give this:

Multiple barplots of different mean of years in one plot

I currently have two dataframes. I wish to get multiple bar plots from both of them in one plot using ggplot. I want to get an average of 'NEE' variable from different years(1850-1950,1951-2012,2013-2100) from both dataframes and plot side by side just like in this green barplot visualization(https://ars.els-cdn.com/content/image/1-s2.0-S0048969716303424-fx1_lrg.jpg).
The header of two dataframes is as follows (this is only a portion).The header is the same for both dataframes from year 1850-1859:
How can I achieve plotting bar plots lets say for the year 1850-1852 , 1854-1856, 1857-1859 from both dataframes in one plot. I know the barplots will be the same in this case as both data frames are similar, but i would like to get an idea and I can edit the code to my desired years.
(Note that I have 39125 obs with 9 variables)
This is what I have done so far (by following a solution posted by member in this website).I achieved data1 and data2 geom_col successfully.But how can i merge them together and plot geom_col of 1850-1852 , 1854-1856, 1857-1859 side by side from both dataframes?graph of data1 graph of data2 :
data1 %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1950 ~ "1850-1950",
Year <= 2012 ~ "1951-2012",
Year <= 2100 ~ "2013-2100",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, NEE)) + geom_col(position =
"dodge")+theme_classic()+xlab("Year")+ylab(ln)+labs(subtitle="CCSM4
RCP2.6")+
geom_hline(yintercept=0, color = "black", size=1)
My preferred approach is usually to do the data summarization first and then send the output to ggplot. In this case, you might use dplyr from the tidyverse meta-package to add a variable relating to which time epoch a given year belongs to, and then collect the stats for that whole epoch.
For instance, just using your example data, we might group those years arbitrarily and find the averages for 1850-51, 1852-53, and 1854-55, and then display those next to each other:
library(tidyverse)
df %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1851 ~ "1850-51",
Year <= 1853 ~ "1852-53",
Year <= 1855 ~ "1854-55",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, Total)) + geom_col()
If you have multiple locations, you might use ggplot facets to display those separately, or use dodge within geom_col (equivalent to geom_bar(stat = "identity"), btw) to show the different locations next to each other.

Resources