Assign variables in groups based on fractions and several conditions - r

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error

You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

Related

Generating repeated measures dataset

I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2

Rounded percentages that add up to 100% in group_by statement

I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.
The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
summarize(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.

Plot the change in mean of columns in r and change scale

I have a dataset with the first few rows shown below:
dataset
I would like to plot the change of the means of these columns in a line graph. I know I can find the individual mean of a column using mean(df$column), but I don't know how to graph these without a separate time variable, which I do not have. Additionally, the column names include dates, ranging from 2017-2050, and I would like to scale the x-axis so that each column mean appears at its date appropriately spaced from the others by time. For example, I would want the scale to start at 2017, have several closely spaced entries through 2020, and then be spaced out accordingly with each following column until 2050. I know I can change the scale in general with the xlim() function, but I don't know how to space the future ones out accordingly with the variable names. Any help would be appreciated!
Data:
dataset <- structure(list(tons_2017 = c(64.533, 3049.580, 1.609),
tons_2018 = c(65.613, 3100.588, 1.636),
tons_2019 = c(68.331, 3229.061, 1.704),
tons_2020 = c(68.816, 3251.973, 1.716),
tons_2022 = c(73.408, 3493.93, 1.755),
tons_2023 = c(75.368, 3567.198, 1.743),
tons_2025 = c(88.289, 4052.954, 1.756),
tons_2030 = c(106.873, 4749.285, 1.896),
tons_2035 = c(126.056, 5361.734, 1.954),
tons_2040 = c(152.926, 6272.844, 2.149),
tons_2045 = c(186.799, 7393.864, 2.428),
tons_2050 = c(219.586, 8429.251, 2.650)),
row.names = c(NA, 3L),
class = "data.frame")
EDITED: based on comments
I think what you need to do is reshape the data from "wide" to "long" form, convert the column names into numeric values, then group by those values to calculate the means.
Something like this:
library(tidyverse)
dataset %>%
select(starts_with("tons_")) %>%
pivot_longer(everything()) %>%
mutate(name = as.numeric(gsub("tons_", "", name))) %>%
group_by(name) %>%
summarise(meanVal = mean(value)) %>%
ggplot(aes(name, meanVal)) +
geom_line()
After the summarise step, the data looks like this:
# A tibble: 12 × 2
name meanVal
<dbl> <dbl>
1 2017 1039.
2 2018 1056.
3 2019 1100.
4 2020 1108.
5 2022 1190.
6 2023 1215.
7 2025 1381.
8 2030 1619.
9 2035 1830.
10 2040 2143.
11 2045 2528.
12 2050 2884.
And the chart looks like this:

Summarizing dataset in r to calculate unique values

In the following dataframe I want to calculate unique household_id - individual_id combination and Average weights and Total Duration after summarizing at country, state and date column.
Household_id 100 have two unique individual(1and 2) and househld_id 101 have three unique individual(1,2,3). So total unique is 5 after summarizing.
The Average weights I want to calculate of this 5 unique individuals i.e. (100 + 50 + 200 + 200 + 200)/5 =150
Final dataset:
what I did is
data %>% group_by(country,state,date) %>%
summarise(total_unique = n_distinct(household_id,individual_id),
Tot_Duration = sum(duration))
But not able to calculate the Average_weights.
Any help is highly appreciated.
Sample Dataset
library(dplyr)
data <- data.frame(country = c("US","US","US","US","US","US","IND","IND"),
state = c("TX","TX","TX","TX","TX","TX","AP","AP"),
date = c(20220601,20220601,20220601,20220601,20220601,20220601,20220601,20220601),
household_id = c(100,100,100,101,101,101,102,102),
individual_id=c(1,2,1,1,2,3,1,1),
weights = c(100,50,100,200,200,200,100,100),
duration = c(10,20,30,40,50,60,70,80))
EDIT
Apologies for not putting the right dataset which I realized later.
Two update in Dataset
Different individuals may have same weight as in household_id 101
Duration column is added
with solution 1 above distinct will not work and with solution 2 unique will not work. Please suggest
I have updated the sample dataset
Building on to your code, you could add an extra statement in summarise:
library(tidyverse)
data %>%
group_by(country,state,date) %>%
summarise(total_unique = n_distinct(household_id,individual_id),
Average_weights = sum(unique(weights), na.rm = T)/total_unique)
Output
country state date total_unique Average_weights
<chr> <chr> <dbl> <int> <dbl>
1 IND AP 20220601 1 100
2 US TX 20220601 5 210
You may try
library(dplyr)
data %>%
group_by(country, state, date) %>%
distinct() %>%
summarize(total_unique = n(),
average_Weights = sum(weights)/total_unique)
country state date total_unique average_Weights
<chr> <chr> <dbl> <int> <dbl>
1 IND AP 20220601 1 100
2 US TX 20220601 5 210

R - ggplot showing distribution of binary flag variable (0/1) over time as normalized bar chart (%)

I have a data set looking sth like this ....
Date Remaining Volume ID
1990-01-01 0 1000 1
1990-01-01 1 2000 2
1990-01-01 1 5000 3
1990-02-01 0 200 4
1990-03-01 1 4000 5
1990-03-01 0 3000 6
I filter the data according to a series of conditional statements and assign the binary flag variable to the data.table. A value of 0 means that the particular row entry doesn't meet the defined requirements and will subsequently be excluded; 1-flagged rows remain in the data.table. The key is ID and is unique for each row.
I would like to show two relationships.
(1) A stacked normalized/percentage bar chart over the monthly time series to show the percentage of entries remaining/being excluded in the data.set for each month,
f.ex. Jan 1990 --> 2/3 values remaining --> 66.6% vs. 33.3% of entries remain vs. are excluded
(2) A stacked normalized/percentage bar chart showing the normalized percentage of volume remaining/ being excluded by the filtering operation for each month,
f.ex. Jan 1990 --> 2k + 5k out of 8k remaining --> 87.5% vs. 12.5% of volume remains vs. is excluded
I tried various things so far, f.ex. compute the number of occurences of each flag-value per month and the sum of the corresponding "bucket" (0/1) volume, but all my attempts failed so far.
# dt_1 is the original data.table
id.vec <- dt_1[ , id]
dt_2 <- dt_1
# dt_1 is filterd subsequently
id_remaining.vec <- dt_1[ , id]
dt_2 <- dt_2[id.vec %in% id_remaining.vec, REMAIN := 1]
dt_2 <- dt_2[id.vec %notin% id_remaining.vec, REMAIN := 0]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN := .N]
dt_2 <- dt_2[REMAIN == 1 , N_REMAIN_MON := .N]
# Tried the code below to no avail
ggplot(data = dt_2, aes(x = Date, y = REMAIN, color = REMAIN, fill = REMAIN)) +
geom_bar(position = "fill", stat = "identity")
Usually, I find ggplot grammar very intuitive, but I guess I am overlooking sth here or maybe the data set is not in the right format.
Any pointer or idea highly appreciated!
Here's how I'd do it with dplyr:
library(dplyr)
dt_2 %>%
mutate(Remaining = as.character(Remaining)) %>% # just to make the charts use scale_fill_discrete by default
group_by(Date, Remaining) %>%
summarize(entries = n(),
volume = sum(Volume)) %>%
mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
> dt_2_summary
# A tibble: 5 x 6
Date Remaining entries volume share_entries share_volume
<chr> <chr> <int> <int> <dbl> <dbl>
1 1990-01-01 0 1 1000 0.333 0.125
2 1990-01-01 1 2 7000 0.667 0.875
3 1990-02-01 0 1 200 1 1
4 1990-03-01 0 1 3000 0.5 0.429
5 1990-03-01 1 1 4000 0.5 0.571
Then to chart:
dt_2_summary %>%
ggplot(aes(Date, share_entries, fill = Remaining)) +
geom_col()
dt_2_summary %>%
ggplot(aes(Date, share_volume, fill = Remaining)) +
geom_col()
Just as an appendix to Jon's great soution.
I had a large project with >25 libraries loaded and while the proposed code seemingly worked, it only did work for the share_entries and not for share_volume. Output of dt_2_summary was weird. The share_entries column was apparently computed to the total number of entries and not within each group and the share_volume column only showed NAs.
After hours of troubleshooting, I identified the culprit to be the pkg plyr, which did overwrite some functions. Thus, I had to specify which version of the applied functions I wanted to use.
The code below did the trick for me.
library(plyr) # the culprit
library(dplyr)
dt_2 %>%
dplyr::mutate(Remaining = as.character(Remaining)) %>%
group_by(Date, Remaining) %>%
dplyr::summarize(entries = n(),
volume = sum(Volume)) %>%
dplyr::mutate(share_entries = entries / sum(entries),
share_volume = volume / sum(volume)) %>%
ungroup() -> dt_2_summary
Thanks again Jon for your wonderful solutiopn!

Resources