I have a dataset that I want to summarize through time. I have a period of ten dates and flower counts on three plants (Tomato, Pepper, Squash). I would like to create a ggplot (barplot) plot that sums the number of flowers and displays them as a stacked bar plot colored by plant. The Y axis should be the cumulative sum of flowers and the x axis should be time. When I use cum_sum the output does not make sense to me. Any help would be great! Thanks.
dataset here
df.sum<- df.sub%>% group_by(Date) %>% mutate(cumsum_covered = cumsum(Tomato))
ggplot (df.sum, aes (x=Date, y=cumsum_covered)) + geom_bar(stat="identity")
You are grouping by date so the cumsum will always be the single value. We want to get the cumsum of each fruit ordered by date
df.sum <- df.sub %>%
# This gives us Date, fruit, amount
gather(fruit, amount, Tomato, Pepper, Squash) %>%
# We group by the fruit to get only the cumsums for the correct fruit and order by date
group_by(fruit) %>%
arrange(Date) %>%
mutate(cumsum_covered = cumsum(amount))
ggplot(df.sum, aes(Date, cumsum_covered, fill=fruit)) +
geom_col(position="stack")
Related
I have a dataset with 4 columns client,date, sales and scale.
I am trying figure it ou how to to order the y axis (client) on a geom_tile plot not based on the default decreasing order of the levels but instead on the cumulative sum of each client for all days.
The code bellow is an example. The client are ordered 5,4,3,2,1 but I need instead ordered based on the sales of all days.
data.frame(client=seq(1:5),date=Sys.Date()-0:05,sales=rnorm(30,300,100)) %>% mutate_if(is.numeric,round,0) %>% mutate(escale=cut(sales,breaks=c(0,100,200,300,1000),labels=c("0-100","100-200","200-300","+300"))) %>% ggplot(.,aes(x=date,y=client,fill=escale)) + geom_tile(colour="white",size=0.25)
Appreciate any help
Maybe a quick fix like this?
df <- data.frame(client=seq(1:5),date=Sys.Date()-0:05,sales=rnorm(30,300,100)) %>%
mutate_if(is.numeric,round,0) %>%
mutate(escale=cut(sales,breaks=c(0,100,200,300,1000),
labels=c("0-100","100-200","200-300","+300")))
# we get the order according to total sales
o <- names(sort(tapply(df$sales,df$client,sum)))
ggplot(df,aes(x=date,y=client,fill=escale)) +
geom_tile(colour="white",size=0.25)+
# just manually set the y-axis here
scale_y_discrete(limits=o)
Personally, I prefer the set the factor levels, in the data.frame, and plot
# we get the order according to total sales
o <- names(sort(tapply(df$sales,df$client,sum)))
df %>% mutate(client = factor(client,levels=o)) %>%
ggplot(.,aes(x=date,y=client,fill=escale)) +
geom_tile(colour="white",size=0.25)
Both will give this:
I currently have two dataframes. I wish to get multiple bar plots from both of them in one plot using ggplot. I want to get an average of 'NEE' variable from different years(1850-1950,1951-2012,2013-2100) from both dataframes and plot side by side just like in this green barplot visualization(https://ars.els-cdn.com/content/image/1-s2.0-S0048969716303424-fx1_lrg.jpg).
The header of two dataframes is as follows (this is only a portion).The header is the same for both dataframes from year 1850-1859:
How can I achieve plotting bar plots lets say for the year 1850-1852 , 1854-1856, 1857-1859 from both dataframes in one plot. I know the barplots will be the same in this case as both data frames are similar, but i would like to get an idea and I can edit the code to my desired years.
(Note that I have 39125 obs with 9 variables)
This is what I have done so far (by following a solution posted by member in this website).I achieved data1 and data2 geom_col successfully.But how can i merge them together and plot geom_col of 1850-1852 , 1854-1856, 1857-1859 side by side from both dataframes?graph of data1 graph of data2 :
data1 %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1950 ~ "1850-1950",
Year <= 2012 ~ "1951-2012",
Year <= 2100 ~ "2013-2100",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, NEE)) + geom_col(position =
"dodge")+theme_classic()+xlab("Year")+ylab(ln)+labs(subtitle="CCSM4
RCP2.6")+
geom_hline(yintercept=0, color = "black", size=1)
My preferred approach is usually to do the data summarization first and then send the output to ggplot. In this case, you might use dplyr from the tidyverse meta-package to add a variable relating to which time epoch a given year belongs to, and then collect the stats for that whole epoch.
For instance, just using your example data, we might group those years arbitrarily and find the averages for 1850-51, 1852-53, and 1854-55, and then display those next to each other:
library(tidyverse)
df %>%
# case_when lets us define yr_group based on Year:
mutate(yr_group = case_when(Year <= 1851 ~ "1850-51",
Year <= 1853 ~ "1852-53",
Year <= 1855 ~ "1854-55",
TRUE ~ "Other range")) %>%
# For each location and year group, get the mean of all the columns:
group_by(Lon, Lat, yr_group) %>%
summarise_all(mean) %>%
# Plot the mean Total for each yr_group
ggplot(aes(yr_group, Total)) + geom_col()
If you have multiple locations, you might use ggplot facets to display those separately, or use dodge within geom_col (equivalent to geom_bar(stat = "identity"), btw) to show the different locations next to each other.
I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.
My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.
First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).
In the code below, I am using the forcats function fct_infreq to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile, scale_x_discrete, etc. but those don't seem to work for categorical data.
Thanks for your help!
df %>% filter(LoanStatus %in% c("Chargedoff")) %>%
ggplot() +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))
Resulting error:
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
UPDATE:
Using Yifu's answer below, I was able to get the desired output like this:
pd_occupation <- pd %>%
dplyr::filter(LoanStatus == "Chargedoff") %>%
group_by(Occupation) %>%
mutate(group_num = n())
table(pd_occupation$group_num)#to view the distribution
ggplot(subset(pd_occupation, group_num >= 361)) +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
ggtitle('Loan Charge-Offs by Occupation')
You can do it in dplyr instead:
#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
group_by(carb) %>%
mutate(group_num = n()) %>%
# you can substitute the number with 10% percentitle or whatever you want
dplyr::filter(group_num >= 7) #%>%
#ggplot()
#create your plot
The idea is to filter the observations and pass it to ggplot rather than filter data in ggplot.
R studio (ggplot) question: I need to prepare a plot with age on X-axis with each subject represented with one dot per session (baseline and followup) with a line drawn between them (spaghetti plot). preferably sorting them by age at baseline.. can anyone help me?
I want to plot the lines horizontally along the x-axis (from Age at Timepoint 1 to AgeTp2), and the y-axis can represent some index based on a sorted list of individuals based on AgeTp1 (so just a pile of horizontal lines, really)
IMAGE OF DATASET
Here is a simple example that you can modify to suit your purposes...
df <- data.frame(ID=c("A","A","B","B","C","C"),
age=c(20,25,22,27,21,28))
library(dplyr)
library(ggplot2)
#sort by first age for each ID
df <- df %>% group_by(ID) %>%
mutate(index=min(age)) %>%
ungroup() %>%
mutate(index=rank(index))
ggplot(df,aes(x=age,y=index,colour=ID,group=ID))+
geom_point(size=4)+
geom_line(size=1)
I'm graphing a line plot in ggplot of numbers of migrants to a city over x years, based on country of origin. Each country is graphed as its own line, plotted on a graph against other countries, over a period of five years.
I want to order the legend by country from largest to smallest total sum of migrants over the x years, regardless of the total number of countries, instead of alphabetically as it is now.
I've tried using forcats commands such as fct_relevel, but haven't been able to find anything other than doing it manually, which can be time consuming for multiple graphs.
My data frame has variables year, country, and number_migrants, and each observation is a year-country pair.
library(tidyverse)
g <- ggplot(migrants, aes(x=year, y=number_migrants, col=country)) +
geom_line()
Current example:
You need fct_reorder
library(dplyr)
library(forcats)
migrants %>%
mutate(
country = fct_reorder(country, number_migrants, .desc = TRUE)
) %>%
ggplot(migrants, aes(x=year, y=number_migrants, col=country)) +
geom_line()