Applying survey weights, and a weighted average concurrently - r

I have a survey for which I need to do two things;
I need to apply survey weights to a set of variables using the survey package to retrieve the 'weighted' mean AND
I need to find the weighted average of those variables.
I only want the final weighted mean for each variable after doing both these things.
I know how to find the survey weighted mean and the weighted average separately, but I do not know how to apply them together, or in which order to apply these weights. Here is an example below of my data, and how I could find the 'survey weighted mean' and the 'weighted average' separately.
Please see below for sample data:
library(survey)
dat_in <- read_table2("code CCS trad_sec Q1 enrolled wgt
23 TRUE sec 20 400 1.4
66 FALSE trad 40 20 3.0
34 TRUE sec 30 400 4.4
78 FALSE sec 40 25 2.2
84 TRUE trad 20 25 3.7
97 FALSE sec 10 500 4.1
110 TRUE sec 80 1000 4.5
123 FALSE trad 33 679 4.8
137 TRUE sec 34 764 5.2
150 FALSE sec 43 850 5.6
163 TRUE trad 45 935 6.0
177 FALSE trad 46 1020 6.4
190 TRUE trad 48 1105 6.7
203 FALSE trad 50 1190 7.1
217 TRUE trad 52 1276 7.5
230 FALSE trad 53 1361 7.9
243 TRUE trad 55 1446 8.3
256 FALSE trad 57 1531 8.6
270 TRUE sec 59 1616 9.0
283 FALSE sec 60 1701 9.4
296 TRUE sec 62 1787 9.8
310 FALSE sec 64 1872 10.2
")
1.To apply survey weights:
Create survey design
SurveyDesign<- svydesign(id =~code,
weights = ~wgt,
data = dat_in)
Find weighted mean and tabulations
# For CCS FALSE, sec
svymean(~Q1, subset(SurveyDesign,CCS=="FALSE" & trad_sec %in% c("sec")), na.rm = T)
# For CCS TRUE, sec
svymean(~Q1, subset(SurveyDesign,CCS=="TRUE" & trad_sec %in% c("sec")), na.rm = T)
2. To find weighted average:
Weighted average based on enrollment
*edited based on comment
dat_in %>% group_by(CCS, trad_sec) %>% mutate(wgtQ1 = weighted.mean(Q1, w = enrolled))
Possible solution that combines 1 and 2? (based on crowd-source)
generate weighted average by group
dat_in2 <- dat_in %>%
group_by(CCS, trad_sec) %>%
mutate(wgtQ1 = weighted.mean(Q1, w = enrolled)) %>%
ungroup
Create survey design
SurveyDesign2<- svydesign(id =~code,
weights = ~wgt,
data = dat_in2)
**Run mean on aggreated weighted average
svymean(~wgtQ1, subset(SurveyDesign2,CCS=="FALSE" & trad_sec %in% c("sec")), na.rm = T)
My intuition is that I should apply the weighted average first and THEN apply the survey weights? This above solution seems funky because each row is the weighted average for each group (CCS,trad_sec), whereas the designs object should be fed dis-aggregated data?
All suggestions much appreciated!

I assume you care about standard error estimates (since otherwise you can just multiply the two sets of weights and use weighted.mean). If so, it matters whether there is sampling uncertainty in the enrolled variable as well as in Q1, and whether the sampling weights should be applied to that variable too. If not, use svyby to get the group means and svycontrast to weight them
> means<-svyby(~Q1, ~CCS, svymean, design=subset(SurveyDesign, trad_sec %in% "sec"),
covmat=TRUE)
> means
CCS Q1 se
FALSE FALSE 50.36825 7.767602
TRUE TRUE 53.51020 6.453270
> with(subset(dat_in, trad_sec=="sec"), by(enrolled, list(CCS), sum))
: FALSE
[1] 4948
---------------------------------------------------------------------------------------
: TRUE
[1] 5967
> svycontrast(means, c(4948/(4948+5967),5967/(4948+4967)))
contrast SE
contrast 55.036 5.2423
If you want sampling weights applied to enrolled as well, I think you want svyratio to estimate a sampling-weighted version of
sum(enrolled*Q1)/sum(enrolled). You can do that one at a time:
> svyratio(~I(Q1*enrolled),~enrolled,
design=subset(SurveyDesign, trad_sec=="sec" & CCS==TRUE))
Ratio estimator: svyratio.survey.design2(~I(Q1 * enrolled), ~enrolled, design = subset(SurveyDesign,
trad_sec == "sec" & CCS == TRUE))
Ratios=
enrolled
I(Q1 * enrolled) 58.41278
SEs=
enrolled
I(Q1 * enrolled) 3.838715
> svyratio(~I(Q1*enrolled),~enrolled,
design=subset(SurveyDesign, trad_sec=="sec" & CCS==FALSE))
Ratio estimator: svyratio.survey.design2(~I(Q1 * enrolled), ~enrolled, design = subset(SurveyDesign,
trad_sec == "sec" & CCS == FALSE))
Ratios=
enrolled
I(Q1 * enrolled) 57.42204
SEs=
enrolled
I(Q1 * enrolled) 4.340065
or with svyby
> svyby(~I(Q1*enrolled),~CCS, svyratio, denom=~enrolled,
design=subset(SurveyDesign, trad_sec=="sec"))
CCS I(Q1 * enrolled)/enrolled se.I(Q1 * enrolled)/enrolled
FALSE FALSE 57.42204 4.340065
TRUE TRUE 58.41278 3.838715
(a note: it helps if you specify all the packages needed for your example code to run; in your case readr for read_table2)

Related

How to perform repeated k-fold cross validation in R with DAAG package?

I have created a 3-fold linear regression model using the HousePrices data set of DAAG package. I have read some of the threads in here and in Cross Validated and it was mentioned multiple times that the cross validation must be repeated many times (like 50 or 100) for robustness. I'm not sure what it means? Does it mean to simply run the code 50 times and calculate the average of the overall ms?
> cv.lm(data = DAAG::houseprices, form.lm = formula(sale.price ~ area+bedrooms),
+ m = 3, dots = FALSE, seed = 29, plotit = c("Observed","Residual"),
+ main="Small symbols show cross-validation predicted values",
+ legend.pos="topleft", printit = TRUE)
Analysis of Variance Table
Response: sale.price
Df Sum Sq Mean Sq F value Pr(>F)
area 1 18566 18566 17.0 0.0014 **
bedrooms 1 17065 17065 15.6 0.0019 **
Residuals 12 13114 1093
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
fold 1
Observations in test set: 5
11 20 21 22 23
Predicted 206 249 259.8 293.3 378
cvpred 204 188 199.3 234.7 262
sale.price 215 255 260.0 293.0 375
CV residual 11 67 60.7 58.3 113
Sum of squares = 24351 Mean square = 4870 n = 5
fold 2
Observations in test set: 5
10 13 14 17 18
Predicted 220.5 193.6 228.8 236.6 218.0
cvpred 226.1 204.9 232.6 238.8 224.1
sale.price 215.0 112.7 185.0 276.0 260.0
CV residual -11.1 -92.2 -47.6 37.2 35.9
Sum of squares = 13563 Mean square = 2713 n = 5
fold 3
Observations in test set: 5
9 12 15 16 19
Predicted 190.5 286.3 208.6 193.3 204
cvpred 174.8 312.5 200.8 178.9 194
sale.price 192.0 274.0 212.0 220.0 222
CV residual 17.2 -38.5 11.2 41.1 27
Sum of squares = 4323 Mean square = 865 n = 5
Overall (Sum over all 5 folds)
ms
2816
Every time I repeat it I get this same ms=2816. Can someone please explain what exactly it means to repeat the CV 100 times? Because repeating this code 100 times doesn't seem to change the ms.
Repeating this code 100 times will not change anything. You have set a seed which means that your sets are always the same sets, which means with three folds, you will have the same three folds, so all 100 times you will get the same mean square error.
It does not seem like you have enough samples to support 50 or 100 folds would be appropriate. And there is NO set number of folds that is appropriate across all sets of data.
The number of folds should be reasonable such that you have sufficient testing data.
Also, you do not want to run multiple different CV models with different seeds, to try to find the best performing seed, because that form of error hacking is a proxy for overfitting.
You should groom your data well, engineer and transform your variables properly pick a reasonable number of folds, set a seed so your stakeholders can repeat your findings and then build your model.

extract beta, sd and P-value from meta-regression using meta package in r to a nice output

I am using the code below to do meta-regression in R and repeat it several time for different variables.
My dataframe and codes are as follow
data<-read.table(text="Studlab PCI.total.FU CABG.total.FU PCI CABG Mean.Age Females..
A 4515 4485 45 51 65.1 22.35
B 4740 4785 74 49 65.95 23.15
C 3621.4 3598.6 41 31 63.15 28.65
D 2337 2314.2 20 29 60 30.5
E 1835.2 1835.2 20 16 66.2 22
F 2014.8 2033.2 11 6 64.45 28.55
G 1125 1125 4 5 61.95 20.65
H 1500 1500 6 3 62.25 23.5
I 976 1000 11 3 61.5 21
J 202 194 10 0 62.4 1", sep="", header=T)
library(meta);library(metafor)
mr <- metainc( PCI, PCI.total.FU,CABG, CABG.total.FU,
data = data, studlab = Studlab, method = "Inverse")
Then for meta-regression I used the following code
MEG<-metareg (mr, ~Mean.Age);MEG ;
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
# Then I repeat meta-regression with another variable
MEG<-metareg (mr, ~Females..);MEG
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
and so on. So; b,se, pval and paste0 steps will be repeated frequently to get the needed output
The content of MEG is shown in the screenshot below.
My question is there is anyway to repeat this function (those repeated steps) several times with different variables (here I used "Mean.Age" then I used "Females..". In another term , I reproduce several MEG with different variables. I am thinking if there is anyway like Macro or so to call those function repeatedly without continuous copy and paste the code several times
Any advice will be greatly appreciated.
I am doing that to finally create a table like this

Stratifying multiple columns for cross-validation

There are many ways I've seen to stratify a sample by a single variable to use for cross-validation. The caret package does this nicely with the createFolds() function. By default it seems that caret will partition such that each fold has roughly the same target event rate.
What I want to do though is stratify by the target rate and by time. I've found a function that can partially do this, it's the splitstackshape package and uses the stratified() function. The issue with that function though is it returns a single sample, it doesn't split the data into k groups under the given conditions.
Here's some dummy data to reproduce.
set.seed(123)
time = rep(seq(1:10),100)
target = rbinom(n=100, size=1, prob=0.3)
data = as.data.frame(cbind(time,target))
table(data$time,data$target)
0 1
1 60 40
2 80 20
3 80 20
4 60 40
5 80 20
6 80 20
7 60 40
8 60 40
9 70 30
10 80 20
As you can see, the target event rate is not the same across time. It's 40% in time 1 and 20% in time 2, etc. I want to preserve this when creating the folds used for cross-validation. If I understand correctly, caret will partition by the overall event rate.
table(data$target)
0 1
710 290
This rate of ~30% will be preserved overall, but target event rate over time will not.
We can get one sample like this:
library(splitstackshape)
train.index <- stratified(data,c("target","time"),size=.2)
I need to repeat this though 4 more times for a 5-fold cross validation and it needs to be done such that once a row is assigned it can't be assigned again. I feel like there should be a function designed for this already. Any ideas?
I know this post is old but I just had the same problem and I couldn't find another solution. In case anyone else needs an answer, here's the solution I'm implementing.
library(data.table)
mystratified <- function(indt, group, NUM_FOLDS) {
indt <- setDT(copy(indt))
if (is.numeric(group))
group <- names(indt)[group]
temp_grp <- temp_ind <- NULL
indt[, `:=`(temp_ind, .I)]
indt[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
samp_sizes <- indt[, .N, by = group]
samp_sizes[, `:=`(temp_grp, do.call(paste, .SD)), .SDcols = group]
inds <- split(indt$temp_ind, indt$temp_grp)[samp_sizes$temp_grp]
z = unlist(inds,use.names=F)
model_folds <- suppressWarnings(split(z, 1:NUM_FOLDS))
}
Which is basically a rewriting of splitstackshape::stratified. It works like the following, giving as output a list of validation indeces for each fold.
myfolds = mystratified(indt = data, group = colnames(data), NUM_FOLDS = 5)
str(myfolds)
List of 5
$ 1: int [1:200] 1 91 181 261 351 441 501 591 681 761 ...
$ 2: int [1:200] 41 101 191 281 361 451 541 601 691 781 ...
$ 3: int [1:200] 51 141 201 291 381 461 551 641 701 791 ...
$ 4: int [1:200] 61 151 241 301 391 481 561 651 741 801 ...
$ 5: int [1:200] 81 161 251 341 401 491 581 661 751 841 ...
So, for instance the train and validation data for each fold are:
# first fold
train = data[-myfolds[[1]],]
valid = data[myfolds[[1]],]
# second fold
train = data[-myfolds[[2]],]
valid = data[myfolds[[2]],]
# etc...

R - caret createDataPartition returns more samples than expected

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:
library(caret)
createDataPartition(iris$Species, p=0.1)
# [1] 12 22 26 41 42 57 63 79 89 93 114 117 134 137 142
createDataPartition(iris$Sepal.Length, p=0.1)
# [1] 1 27 44 46 54 68 72 77 83 84 93 99 104 109 117 132 134
I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.
Any ideas as to why I get these results?
Sepal.Length is a numeric feature; from the online documentation:
For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.
groups: for numeric y, the number of breaks in the quantiles
with default value:
groups = min(5, length(y))
Here is what happens in your case:
Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:
> summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:
l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8)) # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4)) # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9)) # 35
Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:
ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17
Bingo! :)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Resources