Plot the change in mean of columns in r and change scale - r

I have a dataset with the first few rows shown below:
dataset
I would like to plot the change of the means of these columns in a line graph. I know I can find the individual mean of a column using mean(df$column), but I don't know how to graph these without a separate time variable, which I do not have. Additionally, the column names include dates, ranging from 2017-2050, and I would like to scale the x-axis so that each column mean appears at its date appropriately spaced from the others by time. For example, I would want the scale to start at 2017, have several closely spaced entries through 2020, and then be spaced out accordingly with each following column until 2050. I know I can change the scale in general with the xlim() function, but I don't know how to space the future ones out accordingly with the variable names. Any help would be appreciated!
Data:
dataset <- structure(list(tons_2017 = c(64.533, 3049.580, 1.609),
tons_2018 = c(65.613, 3100.588, 1.636),
tons_2019 = c(68.331, 3229.061, 1.704),
tons_2020 = c(68.816, 3251.973, 1.716),
tons_2022 = c(73.408, 3493.93, 1.755),
tons_2023 = c(75.368, 3567.198, 1.743),
tons_2025 = c(88.289, 4052.954, 1.756),
tons_2030 = c(106.873, 4749.285, 1.896),
tons_2035 = c(126.056, 5361.734, 1.954),
tons_2040 = c(152.926, 6272.844, 2.149),
tons_2045 = c(186.799, 7393.864, 2.428),
tons_2050 = c(219.586, 8429.251, 2.650)),
row.names = c(NA, 3L),
class = "data.frame")

EDITED: based on comments
I think what you need to do is reshape the data from "wide" to "long" form, convert the column names into numeric values, then group by those values to calculate the means.
Something like this:
library(tidyverse)
dataset %>%
select(starts_with("tons_")) %>%
pivot_longer(everything()) %>%
mutate(name = as.numeric(gsub("tons_", "", name))) %>%
group_by(name) %>%
summarise(meanVal = mean(value)) %>%
ggplot(aes(name, meanVal)) +
geom_line()
After the summarise step, the data looks like this:
# A tibble: 12 × 2
name meanVal
<dbl> <dbl>
1 2017 1039.
2 2018 1056.
3 2019 1100.
4 2020 1108.
5 2022 1190.
6 2023 1215.
7 2025 1381.
8 2030 1619.
9 2035 1830.
10 2040 2143.
11 2045 2528.
12 2050 2884.
And the chart looks like this:

Related

Creating a boxplot from two dataframes

I have two separate data frames - each representing a feature (activity, and sleep) and the amount of days that each of these features were recorded by each id number. The amount of days need to reflect on the y-axis and the feature itself needs to reflect on the x-axis.
I managed to draw the boxplots separately, showing the outliers clearly esp for the one set, however if I want to place the two boxplots next to each other, the outliers do not show up clearly. Also, how do I get the names of the two features (activity and sleep) on my x-axis?
The dataframe for the "sleep "feature:
head(idday)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 25
2 1644430081 4
3 1844505072 3
4 1927972279 5
5 2026352035 28
6 2320127002 1
The dataframe for the "activity "feature:
head(iddaya)
A tibble: 6 x 2
id days
<dbl> <int>
1 1503960366 31
2 1624580081 31
3 1644430081 30
4 1844505072 31
5 1927972279 31
6 2022484408 31
My attempt for sleep:
ggplot(idday, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
and for activity:
ggplot(iddaya, aes(y = days), boxwex = 0.05) +
stat_boxplot(geom = "errorbar",
width = 0.2) +
geom_boxplot(alpha=0.9, outlier.color="red")
I then combined them:
boxplot(summary(idday$days), summary(iddaya$days))
In this final image the outliers do not show clearly, and I want to name my x-axis and y-axis.
There are several ways to achieve your task. One way could be:
If your dataframes are coalled df_sleep and df_activity then we could combine them in a named list and add a new column feature, then plot:
df_sleep
df_activity
library(tidyverse)
bind_rows(list(sleep = df_sleep, activity = df_activity), .id = 'feature') %>%
ggplot(aes(x = feature, y=days, fill=feature))+
geom_boxplot()
If you want to compare these two boxplots with each other I recommend to use the same range for your y-axis. To achieve this you first have to combine both data frames. You can do this with inner_join() from the dplyr package.
data_combined <- inner_join(idday, iddaya,
by = "id",
suffix = c("_sleep", "_activity"))
Then you need to transform your data frame into long-format with pivot_longer() from the tidyr package:
data_combined_long <- data_combined %>%
pivot_longer(days_sleep:days_activity,
names_to = "features",
names_prefix = "days_",
values_to = "days")
After that you can again use ggplot() to create your boxplot. But now you have to define that you want your x-axis to represent your features:
ggplot(data_combined_long, aes(y = days, x = features), boxwex = 0.05)+
stat_boxplot(geom = "errorbar",
width = 0.5) +
geom_boxplot(alpha=0.9, outlier.color="red")
Your plot should then look like this:

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

X limits with continuous character values in R ggplot

I am creating a bar graph with continuous x-labels of 'Fiscal Years', such as "2009/10", "2010/11", etc. I have a column in my dataset with a specific Fiscal Year that I would like the x-labels to begin at (see example image below). Then, I would like the x-labels to be every continuous Fiscal Year until the present. The last x-label should be "2018/19". When I try to set the limits with scale_x_continuous, I receive an error of Error: Discrete value supplied to continuous scale. However, if I use 'scale_x_discrete', I get a graph with only two bars: my chosen "Start" date and the "End" of 2018/19.
Start<-Project_x$Start[c(1)]
End<-"2018/2019"
ggplot(Project_x, (aes(x=`FY`, y=Amount)), na.rm=TRUE)+
geom_bar(stat="identity", position="stack")+
scale_x_continuous(limits = c(Start,End))
` Error: Discrete value supplied to continuous scale `
Thank you.
My data is:
df <- data.frame(Project = c(5, 6, 5, 5, 9, 5),
FY = c("2010/11","2017/18","2012/13","2011/12","2003/04","2000/01"),
Start=c("2010/11", "2011/12", "2010/11", "2010/11", "2001/02", "2010/11"),
Amount = c(500,502,788,100,78,NA))
To use the code in the answer below, I need to base my Start_Year off of my Start column rather than the FY column, and the graph should just be for Project #5.
as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start)))
xlabel_start<-subset(df$Start_Year, Project == 5)
xlabel_end<-2018
filter(between(Start_Year,xlabel_start,xlabel_end)) %>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
When running this, my xlabel_start is NULL.
In ggplot, continuous is dedicated for numerical values. Here, your fiscal year are character (or factor) format and so they are considered as discrete values and are sorted alphabetically by ggplot2.
One possible solution to get your expected plot is to create a new variable containing the starting year of the fiscal year and filter for values between 2010 and 2018.
But first, we are going to isolate the project and the starting year of interest by creating a new dataframe:
library(dplyr)
xlabel_start <- as.tibble(df) %>%
mutate(Start_Year = as.numeric(sub("/\\d{2}","",Start))) %>%
distinct(Project, Start_Year) %>%
filter(Project == 5)
# A tibble: 1 x 2
Project Start_Year
<dbl> <dbl>
1 5 2010
Now, using almost the same pipeline, we can isolate values of interest by
doing:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end))
# A tibble: 3 x 5
Project FY Start Amount Year
<dbl> <fct> <fct> <dbl> <dbl>
1 5 2010/11 2010/11 500 2010
2 5 2012/13 2010/11 788 2012
3 5 2011/12 2010/11 100 2011
And once you have done this, you can simply add the ggplot plotting part at the end of this pipe sequence:
library(tidyverse)
as.tibble(df) %>%
mutate(Year = as.numeric(sub("/\\d{2}","",FY))) %>%
filter(Project == 5 & between(Year,xlabel_start$Start_Year,xlabel_end)) #%>%
ggplot(aes(x = FY, y = Amount))+
geom_col()
Does it answer your question ?

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Binding multiple column pairs into 1 column efficiently

I've configured the data how I needed but it took 15 lines of code. I was sure it could be done in 1 or 2, and I'm hoping someone a lot better at this can teach me how. Here it is...
I have a table with 11 variables that consists of a Date, 4 pairs of spread and price observations, followed by the year and quarter corresponding to the data column. The 4 pairs of data each correspond to different TBA mortgage coupons (3%, 3.5%, 4%, 4.5%).
mbstrimlast table
I need the 8 columns to be in 2 columns named ZSpread and Price, and then each pair tagged with the coupon Type.
Here's the code I used. Thanks!
mbs3 <- mbstrimlast[,c("Date",ZSpread="FN3sprd",Price="FN3px")]
names(mbs3) <- c("Date","Zspread","Price")
mbs3.5 <- mbstrimlast[,c("Date",ZSpread="FN3.5sprd",Price="FN3.5px")]
names(mbs3.5) <- c("Date","Zspread","Price")
mbs4 <- mbstrimlast[,c("Date",ZSpread="FN4sprd",Price="FN4px")]
names(mbs4) <- c("Date","Zspread","Price")
mbs4.5 <- mbstrimlast[,c("Date",ZSpread="FN4.5sprd",Price="FN4.5px")]
names(mbs4.5) <- c("Date","Zspread","Price")
mbs3$Type = c("FN3")
mbs3.5$Type = c("FN3.5")
mbs4$Type = c("FN4")
mbs4.5$Type = c("FN4.5")
mbslast = bind_rows(mbs3, mbs3.5, mbs4, mbs4.5)
mbslast <- mbslast %>% mutate(Yeartag = year(mbslast$Date))
mbslast <- mbslast %>% mutate(Qtag = quarters(mbslast$Date, abbreviate = T))
We can use the tidyverse package to make the code to complete this task a bit cleaner. First, we use gather to reshape from wide to long, then we create type and key columns using grepl and gsub, finally, we use spread to get the data back into a tidier format.
library(tidyverse)
mbstrimlast %>%
gather(variable, value, -Date, -Yeartag, -Qrts) %>% # wide to long
# column creation
mutate(type = ifelse(grepl(pattern = 'sprd', x = variable), 'Spread', 'Price'),
key = gsub(pattern = 'sprd|px', replacement = '', x = variable)) %>%
select(-variable) %>% # remove variable column
spread(type, value) # tidier
Date Yeartag Qrts key Price Spread
1 2018-06-17 23:00:00 2018 Q2 FN3 96.35938 52.8
2 2018-06-17 23:00:00 2018 Q2 FN3.5 99.10938 67.7
3 2018-06-17 23:00:00 2018 Q2 FN4 101.64844 81.9
4 2018-06-17 23:00:00 2018 Q2 FN4.5 103.89062 87.2

Resources