X limits with continuous character values in R ggplot - r

I am creating a bar graph with continuous x-labels of 'Fiscal Years', such as "2009/10", "2010/11", etc. I have a column in my dataset with a specific Fiscal Year that I would like the x-labels to begin at (see example image below). Then, I would like the x-labels to be every continuous Fiscal Year until the present. The last x-label should be "2018/19". When I try to set the limits with scale_x_continuous, I receive an error of Error: Discrete value supplied to continuous scale. However, if I use 'scale_x_discrete', I get a graph with only two bars: my chosen "Start" date and the "End" of 2018/19.
Start<-Project_x$Start[c(1)]
End<-"2018/2019"
ggplot(Project_x, (aes(x=`FY`, y=Amount)), na.rm=TRUE)+
geom_bar(stat="identity", position="stack")+
scale_x_continuous(limits = c(Start,End))
` Error: Discrete value supplied to continuous scale `
Thank you.
My data is:
df <- data.frame(Project = c(5, 6, 5, 5, 9, 5),
FY = c("2010/11","2017/18","2012/13","2011/12","2003/04","2000/01"),
Start=c("2010/11", "2011/12", "2010/11", "2010/11", "2001/02", "2010/11"),
Amount = c(500,502,788,100,78,NA))
To use the code in the answer below, I need to base my Start_Year off of my Start column rather than the FY column, and the graph should just be for Project #5.
as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start)))
xlabel_start<-subset(df$Start_Year, Project == 5)
xlabel_end<-2018
filter(between(Start_Year,xlabel_start,xlabel_end)) %>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
When running this, my xlabel_start is NULL.

In ggplot, continuous is dedicated for numerical values. Here, your fiscal year are character (or factor) format and so they are considered as discrete values and are sorted alphabetically by ggplot2.
One possible solution to get your expected plot is to create a new variable containing the starting year of the fiscal year and filter for values between 2010 and 2018.
But first, we are going to isolate the project and the starting year of interest by creating a new dataframe:
library(dplyr)
xlabel_start <- as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start))) %>%
distinct(Project, Start_Year) %>%
filter(Project == 5)
# A tibble: 1 x 2
Project Start_Year
<dbl> <dbl>
1 5 2010
Now, using almost the same pipeline, we can isolate values of interest by
doing:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end))
# A tibble: 3 x 5
Project FY Start Amount Year
<dbl> <fct> <fct> <dbl> <dbl>
1 5 2010/11 2010/11 500 2010
2 5 2012/13 2010/11 788 2012
3 5 2011/12 2010/11 100 2011
And once you have done this, you can simply add the ggplot plotting part at the end of this pipe sequence:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end)) #%>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
Does it answer your question ?

Related

Plot the change in mean of columns in r and change scale

I have a dataset with the first few rows shown below:
dataset
I would like to plot the change of the means of these columns in a line graph. I know I can find the individual mean of a column using mean(df$column), but I don't know how to graph these without a separate time variable, which I do not have. Additionally, the column names include dates, ranging from 2017-2050, and I would like to scale the x-axis so that each column mean appears at its date appropriately spaced from the others by time. For example, I would want the scale to start at 2017, have several closely spaced entries through 2020, and then be spaced out accordingly with each following column until 2050. I know I can change the scale in general with the xlim() function, but I don't know how to space the future ones out accordingly with the variable names. Any help would be appreciated!
Data:
dataset <- structure(list(tons_2017 = c(64.533, 3049.580, 1.609),
tons_2018 = c(65.613, 3100.588, 1.636),
tons_2019 = c(68.331, 3229.061, 1.704),
tons_2020 = c(68.816, 3251.973, 1.716),
tons_2022 = c(73.408, 3493.93, 1.755),
tons_2023 = c(75.368, 3567.198, 1.743),
tons_2025 = c(88.289, 4052.954, 1.756),
tons_2030 = c(106.873, 4749.285, 1.896),
tons_2035 = c(126.056, 5361.734, 1.954),
tons_2040 = c(152.926, 6272.844, 2.149),
tons_2045 = c(186.799, 7393.864, 2.428),
tons_2050 = c(219.586, 8429.251, 2.650)),
row.names = c(NA, 3L),
class = "data.frame")
EDITED: based on comments
I think what you need to do is reshape the data from "wide" to "long" form, convert the column names into numeric values, then group by those values to calculate the means.
Something like this:
library(tidyverse)
dataset %>%
select(starts_with("tons_")) %>%
pivot_longer(everything()) %>%
mutate(name = as.numeric(gsub("tons_", "", name))) %>%
group_by(name) %>%
summarise(meanVal = mean(value)) %>%
ggplot(aes(name, meanVal)) +
geom_line()
After the summarise step, the data looks like this:
# A tibble: 12 × 2
name meanVal
<dbl> <dbl>
1 2017 1039.
2 2018 1056.
3 2019 1100.
4 2020 1108.
5 2022 1190.
6 2023 1215.
7 2025 1381.
8 2030 1619.
9 2035 1830.
10 2040 2143.
11 2045 2528.
12 2050 2884.
And the chart looks like this:

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

geom_vline doesn't work after the scale_x_discrete in R

I am a newie here, sorry for not writing the question right :p
1, the aim is to plot a graph about the mean NDVI value during a time period (8 dates were chosen from 2019-05 to 2019-10) of my study site (named RB1). And plot vertical lines to show the date with a grass cutting event.
2, Now I had calculated the NDVI value for these 8 chosen dates and made a CSV file.
(PS. the "cutting" means when the grassland on the study site has been cut, so the corresponding dates should be show as a vertical line, using geom_vline)
infor <- read_csv("plotting information.csv")
infor
# A tibble: 142 x 3
date NDVI cutting
<date> <dbl> <lgl>
1 2019-05-12 NA NA
2 2019-05-13 NA NA
3 2019-05-14 NA NA
4 2019-05-15 NA NA
5 2019-05-16 NA NA
6 2019-05-17 0.787 TRUE
# ... with 132 more rows
3, the problem is, when I do the ggplot, first I want to keep the x-axis as the whole time period (2019-05 to 2019-10) but of course not show all dates in between, otherwise there will be way too much dates show on the x-axis). So, I do the scale_x_discrte(breaks=, labels=) to show the specific dates with NDVI values.
Second I also want to show the dates that the grasses were cut geom_vline.
BUT, it seems like the precondition for scale_x_discrte is to factor my date, while the precondition for geom_vline is to keep the date as nummeric.
these two calls seems to be contradictory.
y1 <- ggplot(infor, aes(factor(date), NDVI, group = 1)) +
geom_point() +
geom_line(data=infor[!is.na(infor$NDVI),]) +
scale_x_discrete(breaks = c("2019-05-17", "2019-06-18", "2019-06-26", "2019-06-28","2019-07-23","2019-07-28", "2019-08-27","2019-08-30", "2019-09-21"),
labels = c("0517","0618","0626","0628","0723","0728", "0827","0830","0921")))
y2 <- ggplot(infor, aes(date, NDVI, group = 1)) +
geom_point() +
geom_line(data=infor[!is.na(infor$NDVI),]))
when I add the geom_vline in the y1, vertical lines do not show on my plot:
y1 + geom_vline
when I add it in the y2, vertical lines were showed, but the dates (x axis) are weird (not show as the y1 because we donot run the scale_x_ here)
y2 + geom_vline
y1 +
geom_vline(data=filter(infor,cutting == "TRUE"), aes(xintercept = as.numeric(date)), color = "red", linetype ="dashed")
Would be appreciated if you can help!
thanks in advance! :D
I agree with the comment about leaving dates as dates. In this case, you can specify the x-intercept of geom_vline as a date.
Given basic data:
df <- tribble(
~Date, ~Volume, ~Cut,
'1-1-2010', 123456, 'FALSE',
'5-1-2010', 789012, 'TRUE',
'9-1-2010', 5858585, 'TRUE',
'12-31-2010', 2543425, 'FALSE'
)
I set the date and then pull the subset for Cut=='TRUE' into a new object:
df <- mutate(df, Date = lubridate::mdy(Date))
d2 <- filter(df, Cut == 'TRUE') %>% pull(Date)
And finally use the object to specify intercepts:
df %>%
ggplot(aes(x = Date, y = Volume)) +
geom_vline(xintercept = d2) +
geom_line()

ggplot: Y-axis as count of specific value

I am trying to create a bar chart of how many training courses my employees have completed. To do this, I have a data frame called iud, where each row is a distinct course they have begun taking:
name percent
<chr> <dbl>
1 Nardo 41.7
2 Nardo 0
3 Nardo 4.59
4 Nardo 100
...
I am trying to use ggplot to create a bar chartwhere the y axis is a count of the number of instances where percent is equal to 100. (So for the data above, Nardo's bar would be at 1). I am currently using this:
cpu <- ggplot(iud, aes(name)) +
geom_bar(data=subset(iud,percent=="100"), stat = "count") +
scale_y_continuous(breaks = seq(0,15,1))
The chart looks correct, but it does not include bars where the count of percent equals 0 (Employees who have begun, but not completed a course, are not included on the chart).
Is there a better way I can be doing this to make sure that all employees are charted--including one's where the y-axis values would be 0?
I think it is easiest to pre-process the data, to count the number of 100% first.
library(tidyverse)
df %>% group_by(name) %>%
summarise(n = sum(percent == 100)) %>%
ggplot(aes(x = name, y = n)) +
geom_col()
#data
library(readr)
df <- read_delim("name percent
Nardo 41.7
Nardo 0
Nardo 4.59
Nardo 100
Ardi 45", delim = " ")

ggplot: Generate facet grid plot with multiple series

I have following data frame:
Quarter x y p q
1 2001 8.714392 8.714621 3.3648435 3.3140090
2 2002 8.671171 8.671064 0.9282508 0.9034387
3 2003 8.688478 8.697413 6.2295996 8.4379698
4 2004 8.685339 8.686349 3.7520135 3.5278024
My goal is to generate a facet plot where x and y column in one plot in the facet and p,q together in another plot instead of 4 facets.
If I do following:
x.df.melt <- melt(x.df[,c('Quarter','x','y','p','q')],id.vars=1)
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=1)) + geom_line()+
facet_grid(variable~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'))
I all the four series in 4 different facets but how do I combine x,y to be one while p,q to be in another together. Preferable no legends.
One idea would be to create a new grouping variable:
x.df.melt$var <- ifelse(x.df.melt$variable == "x" | x.df.melt$variable == "y", "A", "B")
You can use it for facetting while using variable for grouping:
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=variable)) + geom_line()+
facet_grid(var~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'), guide = F)
I think beetroot's answer above is more elegant but I was working on the same problem and arrived at the same place a different way. I think it is interesting because I used a "double melt" (yum!) to line up the x,y/p,q pairs. Also, it demonstrates tidyr::gather instead of melt.
library(tidyr)
x.df<- data.frame(Year=2001:2004,
x=runif(4,8,9),y=runif(4,8,9),
p=runif(4,3,9),q=runif(4,3,9))
x.df.melt<-gather(x.df,"item","item_val",-Year,-p,-q) %>%
group_by(item,Year) %>%
gather("comparison","comp_val",-Year,-item,-item_val) %>%
filter((item=="x" & comparison=="p")|(item=="y" & comparison=="q"))
> x.df.melt
# A tibble: 8 x 5
# Groups: item, Year [8]
Year item item_val comparison comp_val
<int> <chr> <dbl> <chr> <dbl>
1 2001 x 8.400538 p 5.540549
2 2002 x 8.169680 p 5.750010
3 2003 x 8.065042 p 8.821890
4 2004 x 8.311194 p 7.714197
5 2001 y 8.449290 q 5.471225
6 2002 y 8.266304 q 7.014389
7 2003 y 8.146879 q 7.298253
8 2004 y 8.960238 q 5.342702
See below for the plotting statement.
One weakness of this approach (and beetroot's use of ifelse) is the filter statement quickly becomes unwieldy if you have a lot of pairs to compare. In my use case I was comparing mutual fund performances to a number of benchmark indices. Each fund has a different benchmark. I solved this by with a table of meta data that pairs the fund tickers with their respective benchmarks, then use left/right_join. In this case:
#create meta data
pair_data<-data.frame(item=c("x","y"),comparison=c("p","q"))
#create comparison name for each item name
x.df.melt2<-x.df %>% gather("item","item_val",-Year) %>%
left_join(pair_data)
#join comparison data alongside item data
x.df.melt2<-x.df.melt2 %>%
select(Year,item,item_val) %>%
rename(comparison=item,comp_val=item_val) %>%
right_join(x.df.melt2,by=c("Year","comparison")) %>%
na.omit() %>%
group_by(item,Year)
ggplot(x.df.melt2,aes(Year,item_val,color="item"))+geom_line()+
geom_line(aes(y=comp_val,color="comp"))+
guides(col = guide_legend(title = NULL))+
ylab("Value")+
facet_grid(~item)
Since there is no need for an new grouping variable we preserve the names of the reference items as labels for the facet plot.

Resources