Quantile estimates for subpopulations where some subpopulations only have one case using srvyr and survey R packages - r

I am trying to produce estimates of the 25th percentile of a continuous variable for a series of sub-groups, where the data is taken from a survey that uses sampling weights. I am doing this in R using the survey and srvyr packages.
This issue I face is that in a small minority of cases a sub-group only has one observation and therefore a 25th percentile is meaningless. This would be fine however it results in a error which prevents the percentiles being calculated for those subgroups with sufficient observations.
Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
need at least two non-NA values to interpolate
The code runs when the offending groups are removed, however I have had to identify them manually which is far from ideal.
Is there a way to achieve the same outcome but where for single observation groups an NA, or just the value of that observation, is outputted rather than an error? Alternatively is there a neat way of automatically excluding such groups from the calculation?
Below is a reproducible example to illustrate my issue using the apistrat dataset from the survey package.
library(dplyr)
library(survey)
library(srvyr)
data(api)
#25th percentile of api00 by school type and whether school is year round or not
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
#need at least two non-NA values to interpolate
apistrat %>% group_by(yr.rnd, stype) %>% tally() %>% filter(n==1)
#one group out of 6 has only a single api00 observation and therefore a quantile can't be interpolated
#Removing that one group means the code can now run as intended
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
filter(!(yr.rnd=="Yes"&stype=="H")) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Get the same error if you do it the 'survey' package way
dstrat <- svydesign(id=~1,strata=~stype,data=apistrat, fpc=~fpc)
svyby(~api99, ~stype+yr.rnd, dstrat, svyquantile, quantiles=0.25)

One work-around is to wrap the call to svyquantile() using tryCatch()
> svyq<-function( ...){tryCatch(svyquantile(...), error=function(e) matrix(NA,1,1))}
> svyby(~api99, ~stype+yr.rnd, dstrat, svyq, quantiles=0.25,keep.var=FALSE,na.rm=TRUE)
stype yr.rnd statistic
E.No E No 560.50
H.No H No 532.75
M.No M No 509.00
E.Yes E Yes 456.00
H.Yes H Yes NA
M.Yes M Yes 436.00
With quantiles and svyby you need to be explicit about whether you want standard errors -- the code above doesn't. If you want standard errors, you'd need the error= branch of tryCatch to return an actual svyquantile object with NAs in it.

Related

Using boot::boot() function with grouped variables in R

This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.
My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.
I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.
Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?
Inside the summarise() call, I reference the original data with "." - Possibly another mistake?
The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.
Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?
library(tidyverse)
library(boot)
set.seed(20)
sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
weights = sample(10,150, replace = TRUE))
##Define function to bootstrap
##I'm using stats::weighted.mean() which needs to take in two arguments
##############
my.function = function(data,index){
d = data[index,] #create bootstrap sample of all columns of original data?
return(weighted.mean(d$counts, d$weights)) #calculate weighted mean using 'counts' and 'weights' columns
}
##############
## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits
## I pass data to boot using "." thinking that this would only pass each grouped subset of data
##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset.
sample.data %>%
group_by(letter) %>%
summarise(avg = mean(counts),
wtd.avg = weighted.mean(counts, weights),
CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])
And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above
#Here is a rough 95% confidence interval estimate as +/- 1.96* Standard Error
mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))
The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot is only called once.
library(boot)
sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
})
do.call(rbind, y)
# wtd.avg CI.LL CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872
Final clean-up:
rm(sp)
A dplyr solution could be the following. It also calls map_dfr from package purrr.
library(boot)
library(dplyr)
sample.data %>%
group_split(letter) %>%
purrr::map_dfr(
function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
}
)
# wtd.avg CI.LL CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

Arranging tables in R

I am working with the dataset twinData in R. I have two questions related to data wrangling.
How would I go about listing only the combinations of cohort and zygosity where the twins’ heights are significantly similar.
My prior code was to create a new variable to indicate whether the correlation coefficient between ht1 and ht2 in the particular subgroup is greater 0.5, with 95 percent confidence.
sig_twin_cor <- twinData %>%
group_by(cohort,zygosity) %>%
do(tidy( cor.test(~ ht1 + ht2, alternative = "greater" , data = . ))) %>%
arrange(desc(estimate)) %>%
mutate(Greater0.5 = ifelse(estimate>0.5,"Yes","No"))
sig_twin_cor
Second, I need to transform the dataset twinData into a narrow form using gather(). Can someone show me how to do this?
Thanks!

how to calculate pooled standard deviation in R / or how to use the function pooled.sd() in Rstudio

How do I calculate the pooled standard deviation in R?
Below is the code to my dataset(As my dataset contains many entries I cannot copy-paste it here)
install.packages("Sleuth3")
library(Sleuth3)
View(ex0126)
To find the mean and standard deviation for each group individually(i.e., individual groups are party R and D) I have got it using the below R code.
library(Sleuth3)
ex0126
View(ex0126)
#Average of each group individually for party (R,D)
meanOfR <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), mean, na.rm=TRUE), Group.1=='R')
meanOfR
meanOfD <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), mean, na.rm=TRUE), Group.1=='D')
meanOfD
#Sample standard deviation for party (R,D)
sdOfR <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), sd, na.rm=TRUE), Group.1=='R')
sdOfR
sdOfD <- subset(aggregate(ex0126[, 4:10], list(ex0126$Party), sd, na.rm=TRUE), Group.1=='D')
sdOfD
But how to find the pooled standard deviation for the above sample standard deviation for Party R and D
It depends which pooled estimate you want. Using the most general estimate with unequal size of grouping
data(ex0126, package = Sleuth3)
library(dplyr)
#' Calculate pooled variance given a data.frame with columns (var, n) for each group.
#' All other columns are ignored
pooled_var <- function(df){
var <- if('sd' %in% names(df)) df$sd^2 else df$var
d <- dim(var)
if(d[1] != (n <- nrow(df)))
stop('inconsistent size of variance and n')
if(length(d) == 2)
colSums(sweep(var, 1, df$n - 1, '*')) / (sum(df$n) - n)
else
sum(var * ( df$n - 1 )) / (sum(df$n) - nrow(df))
}
ex0126 %>%
select(4:10, Party) %>%
group_by(Party) %>%
na.omit() %>%
summarise(var = across(1:6, var), n = n()) %>%
pooled_var() %>%
sqrt()
Note that
select chooses the columns i want to use
na.omit is used to avoid including missing values in variance calculations
group_by tells my pipe that everything needs to be done to each group in Party
summarise/summarize is used to aggregate a function across rows
across is used to perform the same action over multiple columns.
The output of across is itself a tibble (data.frame like structure), so df$var becomes a tibble in pooled_var
by default summarize calls ungroup at the end. All calls following are no longer in each "group".
in pooled_var Ii assume a column var and n exist, and simply use standard formulas to calculate the pooled variance.
Within pooled_var I handle both single vectors and multiple columns based on whether df$var has multiple dimensions or not.
And sqrt is called at the end to go from pooled_var to a pooled standard deviation.
Use the sample.decomp function in the utilities package
Since you have access to the underlying dataset, it is possible to compute the pooled standard deviation directly on the underlying pooled data. However, you can also compute the pooled standard deviation from the pooled moments and group sizes. This is implemented in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.
How to use the function: I am going to assume that in addition to computing the moments, you can also compute the sizes of the two groups, which I will designate as sizeR and sizeG. You can use the sample.decomp function to obtain the pooled sample moments from the subgroup sample moments.
#Input the sample statistics for subgroups
N <- c(sizeR, sizeG)
MEAN <- c(meanOfR, meanOfG)
SD <- c(sdOfR, sdOfG)
#Compute sample decomposition
library(utilities)
sample.decomp(n = N, sample.mean = MEAN, sample.sd = SD, include.sd = TRUE)
Since you have not given the values of your moments and group sizes, I cannot show you the pooled standard deviation you get as your output. However, the above code will give you a table showing the moments of the input groups and the pooled sample. This will include the pooled standard deviation.

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

Resources