Weighted survey functions by group - r

I have a weighted survey dataset that involves age groups, incomes and expenditure. I want to find the average of spending within age groups and within income deciles.
So for example
DF:
Age Income Spending1 Spending2 Weight
45-49 1000 50 35 100
30-39 2000 40 60 150
40-44 3434 30 55 120
Currently I have coded this:
DF$hhdecile<-weighted_ntile(DF$Income, weights=DF$Weight, 5)
Result1<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending1, w = Weight))
Result2<- DF %>% group_by(Age,hhdecile) %>% dplyr::summarise(mean.exp = weighted.mean(x = Spending2, w = Weight))
df.list <- list(Result1=Result1,
Result2=Result2)
names(df.list$Result1)[names(df.list$Result1)=="mean.exp"]<- Result1
ResultJoined < - df.list %>% reduce(full_join, by=c('Age','hhdecile')
That finds the quintile of people compared to the population of all ages, and I'm interested in their quintile compared to their age group.
Is there a way to use group_by or similar to perform the weighted percentile function on each age group individually?
(there are actually 15 categories of spending)

Related

Generating repeated measures dataset

I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2

Assign variables in groups based on fractions and several conditions

I've tried for several days on something I think should be rather simple, with no luck. Hope someone can help me!
I have a data frame called "test" with the following variables: "Firm", "Year", "Firm_size" and "Expenditures".
I want to assign firms to size groups by year and then display the mean, median, std.dev and N of expenditures for these groups in a table (e.g. stargazer). So the first size group (top 10% largest firms) should show the mean, median ++ of expenditures for the 10% largest firms each year.
The size groups should be,
The 10% largest firms
The firms that are between 10-25% largest
The firms that are between 25-50% largest
The firms that are between 50-75% largest
The firms that are between 75-90% largest
The 10% smallest firms
This is what I have tried:
test<-arrange(test, -Firm_size)
test$Variable = 0
test[1:min(5715, nrow(test)),]$Variable <- "Expenditures, 0% size <10%"
test[5715:min(14288, nrow(test)),]$Variable <- "Expenditures, 10% size <25%"
test[14288:min(28577, nrow(test)),]$Variable <- "Expenditures, 25% size <50%"
--> And so on
library(dplyr)
testtest = test%>%
group_by(Variable)%>%
dplyr::summarise(
Mean=mean(Expenditures),
Median=median(Expenditures),
Std.dev=sd(Expenditures),
N=n()
)
stargazer(testtest, type = "text", title = "Expenditures firms", digits = 1, summary = FALSE)
As shown over, I dont know how I could use fractions/group by percentage. I have therefore tried to assign firms in groups based on their rows after having arranged Firm_size to descending. The problem with doing so is that I dont take year in to consideration which I need to, and it is a lot of work to do this for each year (20 in total).
My intention was to make a new variable which gives each size group a name. E.g. top 10% largest firms each year should get a variable with the name "Expenditures, 0% size <10%"
Further I make a new dataframe "testtest" where I calculate the different measures, before using the stargazer to present it. This works.
!!EDIT!!
Hi again,
Now I get the error "List object cannot be coerced to type double" when running the code on a new dataset (but it is the same variables as before).
The mutate-step I'm referring to is the "mutate(gs = cut ++" after "rowwise()" in the solution you provided.
enter image description here
The_code
The_error
You can create the quantiles as a nested variable (size_groups), and then use cut() to create the group sizes (gs). Then group by Year and gs to summarize the indicators you want.
test %>%
group_by(Year) %>%
mutate(size_groups = list(quantile(Firm_size, probs=c(.1,.25,.5,.75,.9)))) %>%
rowwise() %>%
mutate(gs = cut(
Firm_size,c(-Inf, size_groups, Inf),
labels = c("Lowest 10%","10%-25%","25%-50%","50%-75%","75%-90%","Highest 10%"))) %>%
group_by(Year, gs) %>%
summarize(across(Expenditures,.fns = list(mean,median,sd,length)), .groups="drop") %>%
rename_all(~c("Year", "Group_Size", "Mean_Exp", "Med_Exp", "SD_Exp","N_Firms"))
Output:
# A tibble: 126 x 6
Year Group_Size Mean_Exp Med_Exp SD_Exp N_Firms
<int> <fct> <dbl> <dbl> <dbl> <int>
1 2000 Lowest 10% 20885. 21363. 3710. 3
2 2000 10%-25% 68127. 69497. 19045. 4
3 2000 25%-50% 42035. 35371. 30335. 6
4 2000 50%-75% 36089. 29802. 17724. 6
5 2000 75%-90% 53319. 54914. 19865. 4
6 2000 Highest 10% 57756. 49941. 34162. 3
7 2001 Lowest 10% 55945. 47359. 28283. 3
8 2001 10%-25% 61825. 70067. 21777. 4
9 2001 25%-50% 65088. 76340. 29960. 6
10 2001 50%-75% 57444. 53495. 32458. 6
# ... with 116 more rows
If you wanted to have an additional column with the yearly mean, you can remove the .groups="drop" from the summarize(across()) line, and then add this final line to the pipeline:
mutate(YrMean = sum(Mean_Exp*N_Firms/sum(N_Firms)))
Note that this is correctly weighted by the number of Firms in each Group_size, and thus returns the equivalent of doing this with the original data
test %>% group_by(Year) %>% summarize(mean(Expenditures))
Input Data:
set.seed(123)
test = data.frame(
Firm = replicate(2000, sample(letters,1)),
Year = sample(2000:2020, 2000, replace=T),
Firm_size= ceiling(runif(2000,2000,5000)),
Expenditures = runif(2000, 10000,100000)
) %>% group_by(Firm,Year) %>% slice_head(n=1)

How to loop with two lists in R

I have a dataset with demographic information and with questions.
DF<-(Participant = c(1,2,3,4,5,6,7,8,9,10)
Male = c(1,0,1,1,0,1,0,0,1,0)
Female = c(0,1,0,0,1,0,1,1,0,1)
Q1 = c(9,6,5,4,5,1,3,5,5,2)
Q2 = c(2,4,5,4,2,1,3,5,4,2)
Q3 = c(6,8,2,7,5,2,1,1,6,3))
I have two lists (made from column titles), one of demographic information (Males, Females, age group etc) and one of questions with their associated response.
Demographic <- c(“Male”, “Female”, “Age_group_1”, “Age_group_2”…)
Questions<- c(“Q1”, “Q2”, Q3”, “Q4”…)
I need something along the lines of- if value in demographic column is equal to 1 then sum scores in all separate question columns. But I want to do this is a loop so I have the separate question scores (~300) for all columns in the demographic list (~80). Plus I want to save the output. I have no idea how to do this and I’m getting into a loop of bad programming myself!
The end result should resemble this:
M F
Q1 20 21
Q2 16 16
Q3 23 18
I would be grateful for any help!
Thanks in advance.
UPDATE:
With help from a friend, I have found a work around my problem. How do you make this more efficient though?
df.list <- list()
for(question in questions){
question.df <- (DF[, lapply(.SD,sum, na.rm=T), by=question,
.SDcols=c(demographic)])
df.list <- append(df.list, question.df)}
list_new <- bind_cols(df.list, .id = "column_label")
library(tidyr)
library(dplyr)
df <- data.frame(
Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3)
)
df %>%
mutate(sex = ifelse(Male == 1, "M", "F")) %>%
select(-Male, -Female) %>%
pivot_longer(cols = starts_with("Q"), names_to = "Q") %>%
group_by(sex, Q) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = sex)
gives:
Q F M
<chr> <dbl> <dbl>
1 Q1 21 24
2 Q2 16 16
3 Q3 18 23
Depending on what you want to do with the output, another approach is to use tables::tabular(), which can be used to generate additional statistics (e.g. percentages), as well as customizing row and column headings.
We'll generate a simple table using the data provided in the question.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
...and the output:
> tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
Processing multiple demographic variables
In the comments to my answer a question was asked about how to use tabular() with more than one demographic variable.
We can use a combination of lapply(), paste(), and substitute() to build the correct formula expressions for `tabular().
To illustrate the process we will add a second demographic variable, Income to the data frame listed above. Then we create a vector to represent the list of demographic variables for which we will generate tables. Finally, we use the vector with lapply() to produce the tables.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Income = c(rep("low",5),rep("high",5)),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$Sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(Sex)*(sum),data=df)
demoVars <- c("Sex","Income")
lapply(demoVars,function(x){
# generate a formula expression including the column variable
# and use substitute() to render it correctly within tabular()
theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
tabular(substitute(theExpr),data=df)
})
...and the output:
> lapply(demoVars,function(x){
+ # generate a formula expression including the column variable
+ # and use substitute() to render it correctly within tabular()
+ theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
+ tabular(substitute(theExpr),data=df)
+ })
[[1]]
Sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
[[2]]
Income
high low
sum sum
Q1 16 29
Q2 15 17
Q3 13 28
Note that we can enhance the solution further by saving the tables to an output object and rendering them in a printer friendly format as needed.

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

R: How to output subset calculations (n, %) using ddply

I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %

Resources