I have the following function and ggplot code. Essentially it runs through my database each time randomly removing one line per plot and calculating the frequency of each Category until there are only 5 lines per plot left. I then plot the calculated frequencies using ggplot and at the moment I am using a subset to only plot 4 IDs.
What I want to do is run the full function 5 different times and graph the results for each run on the same plot. The results should slightly differ since the function randomly removes lines. So the ggplot would have 5 lines per Category as opposed to the one line per category at the moment.
Initial Dataset (there are multiple plots, and 50 rows per plot:
Dataset after For Loop:
Code being used:
for ( i in 0:45){
if (i>0){
dat<- dat %>%
group_by(Plot) %>%
sample_n(n() - 1) %>%
ungroup()
}
j<-dat %>%
group_by(Category) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n),
total=50-i)
if (i==0){
tot_j=j
} else {
tot_j=bind_rows(tot_j,j)
}
}
ggplot(subset(tot_j,Category %in% c("C" , "G","Z","S"))) +
geom_line(aes(total, freq, colour=Category)) +
xlim(50,5)
Graph currently produced
I appreciate any help or advice! Still learning how to use r to its fullest extent!
Related
I am trying to create a stacked area chart in R using data from this csv: https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv
(The above file is raw content, for better readability of the data look here: https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/masculinity-survey.csv)
I am trying to create a percentage based stacked area chart, that i similar to this example: https://r-charts.com/en/evolution/percentage-stacked-area_files/figure-html/percentage-areaplot.png
The problem is that since i am working with non-numerical data only, it is a bit hard for me to get a proper graph.
My goal is to have the graph display the different age groups in the x-axis ( row "age3" in raw content), and the fill to be the ethnicities (row "racethn4" in raw content. All while the y axis simply is the percentage that represents the number of total answers in the survey (that of course goes up to 100).
I tried to do it the following way, but im not sure what the y value should be:
df <- read_csv("Path to csv")
ggplot(df, aes(x = df$age3, y = ???, fill = df$racethn4)) + geom_stream()
Any ideas on how to represent the plot as described?
I'm not too well versed in ggplot as I use other graphing packages but I gave this a shot. I don't believe you can use geom_area when x is a categorical variable. At least I did not have any luck trying that. So I used geom_col instead.
Here's two approaches for transforming the data. Using dplyr and data.table. Feel free to pick whichever is more natural for you.
You need to sum up the number of observations per group combo first and then get the percent total for the y values.
library(data.table)
library(ggplot2)
library(dplyr)
dat = fread("temp.csv") # from data.table::fread
# data.table way
dat_sub = dat[, .(age3 = as.factor(age3), racethn4 = as.factor(racethn4))][,.N, by = .(age3,racethn4)]
dat_sub[, tot := sum(N), by = age3][, perc := N/tot*100][order(age3)]
# dplyr way
dat_sub = dat %>%
select(age3, racethn4) %>%
group_by(age3, racethn4) %>%
summarise(n = n()) %>%
group_by(age3) %>%
mutate(tot = sum(n),
perc = n / tot * 100)
# using a stacked bar chart instead of stacked area
ggplot(dat_sub, aes(x = age3, y = perc, fill = racethn4)) +
geom_col()
I'm looking to make a plot of the frequencies of the ten most common dog breeds in an excel csv data set with hundreds of dog breeds in it. Is there any way to do this?
getwd()
library(datasets)
library(ggplot2)
pacman::p_load(pacman, rio,ggplot2)
dogSample <- import("C:/Users/casey/OneDrive/Desktop/Rtest/samples.csv")
head(dogSample)
summary(dogSample)
breedbars<-table(dogSample$Breed)
genders<-table(dogSample$Gender)
plot(breedbars,
xlab="Breeds",
ylab="frequency",
main="Numbers of breeds",
)
Here's a generic approach given a common dataset, diamonds, included with ggplot2, part of the tidyverse meta-package.
Here, I take the diamonds dataset, count the number of rows by the number of times each cut appears (as n), only keep the first 5 most common, then change my cut variable to be an ordered factor that is ordered by n, and finally plot that as a horizontal bar plot, with n on the x axis and the cut on the y axis.
The only step you'd need to add is to load the csv, for which there are many online tutorials, like here: https://r4ds.had.co.nz/data-import.html
library(tidyverse)
diamonds %>%
count(cut, sort = TRUE) %>%
slice(1:5) %>%
mutate(cut = fct_reorder(cut, n)) %>%
ggplot(aes(x = n, y = cut)) +
geom_col()
https://www.kaggle.com/nowke9/ipldata ---- contains the data set.
I am fairly new to R programming. This is an exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I am trying to plot the relationship between matches won by teams across different cities.
However, since 12 seasons are over the output which I am getting is not helping to make sufficient conclusions. In order to plot the relationship across each year, it is required to use for loop. Right now, the output for all the 12 years is displayed in a single graph.
How to rectify this mistake and plot a separate graph for each year with proper color scheming ?
library(tidyverse)
matches_tbl <- read_csv("data/matches_updated.csv")
deliveries_tbl <- read_csv("data/deliveries_updated.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
combined_matches_deliveries_tbl %>%
group_by(city, winner)%>%
filter(season == 2008:2019, !result == "no result")%>%
count(match_id)%>%
ungroup()%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+
theme_bw()
The output is as follows:-
There were 50 or more warnings (use warnings() to see the first 50)
[Winner of teams across cities for the years between 2008 and 2019][1]
The required output is :- 12 separate graphs in a single code with proper color scheming.
Many thanks in advance.
Here is an example using mtcars to split by a variable into separate plots. What I created is a scatter plot of vs and mpg by splitting the dataset by cyl. First create an empty list. Then I use lapply to loop through the values of cyl (4,6,8) and then filter the data by that value. After that I plot the scatter plot for the subset and save it to the empty list. Each segment of the list will represent a plot and you can pull them out as you see fit.
library(dplyr)
library(ggplot2)
gglist <- list()
gglist <- lapply(c(4,6,8), function(x){
ggplot(filter(mtcars, cyl == x))+
geom_point(aes(x=vs,y=mpg))
})
Is this what you want?
combined_matches_deliveries_tbl %>%
group_by(city, winner,season)%>%
filter(season %in% 2008:2019, !result == "no result")%>%
count(match_id)%>%
ggplot(aes(x = winner))+
geom_bar(aes(fill = city),alpha = 0.5, color = "black", position = "stack")+
coord_flip()+ facet_wrap(season~.)+
theme_bw()
I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.
My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.
First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).
In the code below, I am using the forcats function fct_infreq to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile, scale_x_discrete, etc. but those don't seem to work for categorical data.
Thanks for your help!
df %>% filter(LoanStatus %in% c("Chargedoff")) %>%
ggplot() +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))
Resulting error:
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
UPDATE:
Using Yifu's answer below, I was able to get the desired output like this:
pd_occupation <- pd %>%
dplyr::filter(LoanStatus == "Chargedoff") %>%
group_by(Occupation) %>%
mutate(group_num = n())
table(pd_occupation$group_num)#to view the distribution
ggplot(subset(pd_occupation, group_num >= 361)) +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
ggtitle('Loan Charge-Offs by Occupation')
You can do it in dplyr instead:
#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
group_by(carb) %>%
mutate(group_num = n()) %>%
# you can substitute the number with 10% percentitle or whatever you want
dplyr::filter(group_num >= 7) #%>%
#ggplot()
#create your plot
The idea is to filter the observations and pass it to ggplot rather than filter data in ggplot.
R studio (ggplot) question: I need to prepare a plot with age on X-axis with each subject represented with one dot per session (baseline and followup) with a line drawn between them (spaghetti plot). preferably sorting them by age at baseline.. can anyone help me?
I want to plot the lines horizontally along the x-axis (from Age at Timepoint 1 to AgeTp2), and the y-axis can represent some index based on a sorted list of individuals based on AgeTp1 (so just a pile of horizontal lines, really)
IMAGE OF DATASET
Here is a simple example that you can modify to suit your purposes...
df <- data.frame(ID=c("A","A","B","B","C","C"),
age=c(20,25,22,27,21,28))
library(dplyr)
library(ggplot2)
#sort by first age for each ID
df <- df %>% group_by(ID) %>%
mutate(index=min(age)) %>%
ungroup() %>%
mutate(index=rank(index))
ggplot(df,aes(x=age,y=index,colour=ID,group=ID))+
geom_point(size=4)+
geom_line(size=1)