Raking Weights on Nested Data: R Output Doesn't Match Stata Output - r

Introduction
I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.
Sample Data
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
Raking Code
(See here and here for in-depth examples of using anesrake in R).
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
Comparison with Stata Output
The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382

The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
Is that what you want?

Related

Is it possible to specify a continuous AR1 correlation structure in a GEE model using R?

I've got repeated measurements data in which patients are measured an irregular amount of times (2 through 6 times per patient) and also with unequally spaced time intervals (some subsequent measures are 6 months apart, some 3 years). Is it possible to model this in a GEE model? For example by specifying a continuous AR1 correlation structure?
I've got some example data:
library(tidyverse)
library(magrittr)
library(geepack)
library(broom)
example_data <- structure(list(pat_id = c(2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13), measurement_number = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 2L, 3L, 1L,
2L), time = c(0, 0.545, 2.168, 2.68, 3.184, 5.695, 0, 1.892,
0, 0.939, 1.451, 1.955, 4.353, 0, 4.449, 0, 0.465, 4.005, 0,
0.364), age_standardized = c(-0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -1.76464003778333, -1.76464003778333, -0.667610044472762,
-0.667610044472762, -0.667610044472762, -0.667610044472762, -0.667610044472762,
0.142696200586183, 0.142696200586183, 0.00556745142236116, 0.00556745142236116,
0.00556745142236116, 0.0554324511182961, 0.0554324511182961),
sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), outcome = c(4241.943359375, 4456.4,
6533.673242397, 7255.561628906, 7594.527875667, 6416.4, 373.782029756049,
614.318359374, 6675.19041238403, 10623.94276368, 10849.01013281,
10627.30859375, 13213, 541.40780090332, 2849.5551411438,
2136.2, 2098.1, 2063.9, 5753.56313232422, 5108.199752386)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
head(example_data)
# A tibble: 6 x 6
pat_id measurement_number time age_standardized sex outcome
<dbl> <int> <dbl> <dbl> <fct> <dbl>
1 2 1 0 -0.0942 Female 4242.
2 2 2 0.545 -0.0942 Female 4456.
3 2 3 2.17 -0.0942 Female 6534.
4 2 4 2.68 -0.0942 Female 7256.
5 2 5 3.18 -0.0942 Female 7595.
6 2 6 5.70 -0.0942 Female 6416.
I actually have also modelled these data with a linear mixed model (using nlme specifying a continuous AR1), but my supervisor asked me to also explore using a GEE, thats why I ask.
I've read that, using the geepack package, it is possible to define the correlation structure yourself, but I can't code that well to see if it is possible to define the structure so that rho is adjusted for the time interval in between measurements (by making it rho^s where s is the number of time units).

how to plot a graph means over time?

I have 6 treatment groups (Control, pH7, pH8, pH9, pH10, pH11) from 6 different samples (1, 2, 3, 4, 5, 6) & measure variable - od
`data.frame': 288 obs. of 3 variables:
$ od : num 0.086 0.086 0.085 0.086 0.093 0.087 0.087 0.087 0.089 0.094 ...
$ sample: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ medium: Factor w/ 6 levels "Control","pH10",..: 1 1 1 1 1 1 1 1 4 4 ...
I would like x axis to represent all the treatment groups & the dots on the graph pH means of each sample
How could I produce a graph similar to this one:
Sample data.
structure(list(od = c(0.086, 0.086, 0.085, 0.086, 0.093, 0.087, 0.087, 0.087, 0.089, 0.094, 0.087, 0.088, 0.09, 0.088, 0.087, 0.088, 0.086, 0.087, 0.095, 0.096), sample = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), medium = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), .Label = c("Control", "pH10", "pH11", "pH7", "pH8", "pH9"), class = "factor")), row.names = c(NA, 20L), class = "data.frame")
OK this is probably not quite right yet but I'll post it as and answer to be edited based on more info (little tricky without the complete data and just a photo!).
Looks like (in the photo) you have a data point per 'sample' that must be the mean of the 'od' for that 'medium' (treatment) for that 'sample' (day?)?
So I summarised the data to give me the means of each treatment group for each sample day:
#if you do not have the following packages then install.packages("tidyverse")
library(ggplot2)
library(tidyverse)
### REPLACE 'df' below with your dataframe name
df_summary <- df %>%
group_by(sample, medium) %>%
summarise(od_mean = mean(od))
head(df_summary)
Now I plot that with mean_od as the value on y, sample on x and separate them by medium with colour = medium:
plot <- ggplot(df_summary, aes(x = sample, y = od_mean, colour = medium))+
geom_point()+ #add points to the plot
geom_smooth()+ #add a smoother
xlab("Day")+ #change axis lables
ylab("Bacterial Density")
plot
Let me know if this is approaching what you need.

Subset using minimum number of values positioned around maximum value

Situation: I have a list of data sets, collected by different loggers, like so:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "logger1", class = "factor"), OriginalTraitValue = c(0.37968,
0.455131, 0.606376, 0.910194, 1.19499, 1.55612, 1.91735, 2.35493,
2.60147, 2.42803, 1.66277, 1.12656, 0.628537), Temp = c(11.7334,
14.627, 19.3428, 24.5959, 29.6344, 34.7809, 39.606, 44.5389,
49.7914, 54.8254, 59.6391, 64.6695, 69.7002)), class = "data.frame", row.names = c(NA,
-13L))
Task: I only want to keep data sets that have a minimum of two recorded Temp values, before and after max(OriginalTraitValue).
I hope this plot makes it clearer. Red = maximum value, Green = values required to keep a data set.
Question
How do I do this in R, e.g. using dplyr?
I have managed to identify the Temp value corresponding to max(OriginalTraitValue) using df$Temp[df$OriginalTraitValue == max(df$OriginalTraitValue)], but I'm struggling with the necessary position arguments to filter the data sets.
EDIT
The example above represents a data set I would like to keep. The full data set looks something like this:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("logger1", "logger2", "logger3"
), class = "factor"), OriginalTraitValue = c(3.36e-11, 3.68e-11,
5.12e-11, 6.24e-11, 6.72e-11, 8.64e-11, 1.04e-10, 1.1e-10, 1.18e-10,
90.34189, 86.332214, 108.00114, 111.190155, 114.34427, 135.1673,
139.18198, 142.76979, 145.09233, 0.002, 0.06, 0.07, 0.15, 0.17,
0.17, 0.18, 0.18, 0.15, 0.07, 0.09), Temp = c(16, 18, 20, 22,
24, 26, 28, 30, 32, 16.726307, 17.376368, 20.193129, 25.06135,
25.060663, 29.875113, 29.924177, 30.422773, 34.417274, 10, 12.5,
15, 18, 20, 22.5, 25, 27.5, 30, 32.5, 35)), class = "data.frame", row.names = c(NA,
-29L))
> summary(df)
ID OriginalTraitValue Temp
logger1: 9 Min. : 0.00 Min. :10.00
logger2: 9 1st Qu.: 0.00 1st Qu.:18.00
logger3:11 Median : 0.15 Median :25.00
Mean : 37.02 Mean :23.90
3rd Qu.: 90.34 3rd Qu.:29.92
Max. :145.09 Max. :35.00
In this data set, I would only keep ID as logger3, since only logger3 contains at least 2 values before and after max(OriginalTraitValue).
Try:
library(dplyr)
df %>%
group_by(ID) %>%
slice(which.max(OriginalTraitValue) + -2:2) %>%
filter(n() == 5)
Output:
# A tibble: 5 x 3
# Groups: ID [1]
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
If you'd like to filter the whole group and not just the 5 observations in questions, you could also do something like:
df %>%
group_by(ID) %>%
filter(any(cumsum(row_number() %in% c(which.max(OriginalTraitValue) + -2:2)) == 5))
You can achieve that with dplyr filter
df %>%
group_by(ID) %>%
filter(abs(which(OriginalTraitValue == max(OriginalTraitValue)) - row_number()) <= 2)
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6

Group Rows Based On Column Value And Keep Row With Minimum Value In R

In the data set below, I want to first check which rows for the column U and D have same value. Then, for such set of rows having U and V as same value, I want to keep that row which has minimum value for columns Mean, Min and Max. For the data I have, these three will always have minimum values for the same row out of the group of rows where U and V match.
I tried group() function, but it hasn't yielded in output as I want. Please suggest any efficient approach.
Input Data
data <- structure(list(A = c(0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18,
0.18, NA, NA, NA, NA, NA, NA), B = c(0.33, 0.33, 0.33, 0.33,
0.33, 0.33, 0.33, 0.33, 1, 2, 2, 2, 3, 4), C = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Yes", class = "factor"),
U = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC-001", "PQR-001"), class = "factor"),
D = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC", "PQR"), class = "factor"),
E = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 6L, 1L, 1L, 2L,
2L, 3L, 3L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"),
F = c(22000014L, 22000031L, 22000033L, 22000025L, 22000028L,
22000020L, 22000021L, 22000015L, 11100076L, 11200076L, 11100077L,
11200077L, 11100078L, 11200078L), G = c(0, 0, 0, 0, 0, 0,
0, 0, -0.1, -0.1, -0.1, -0.1, 0.2, 0.2), H = c(100, 100,
100, 100, 100, 100, 100, 100, 1.2, 1.2, 1.2, 1.2, 0.9, 0.9
), I = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("us", "V"), class = "factor"),
Mean = c(38.72, 37.52111111, 38.44166667, 39.23666667, 39.35888889,
38.96, 38.95333333, 38.41777778, 0.691707061, 0.691554561,
0.691516833, 0.691423506, 0.763736, 0.764015761), Min = c(34.05,
33.25, 33.31, 35.14, 33.91, 33.78, 33.78, 33.75, 0.6911166,
0.6908743, 0.6908813, 0.6907286, 0.7609318, 0.7616949), Max = c(43.83,
42.12, 43.57, 44.03, 44.88, 44.03, 44.02, 43.52, 0.692533,
0.6922278, 0.6923681, 0.6919283, 0.7674736, 0.7668633)), class = "data.frame", row.names = c(NA,
-14L))
Expected Ouput
output <- read.table(header = TRUE, text = " A B C U D E F G H I Mean Min Max
+ 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.52111111 33.25 42.12
+ NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691423506 0.6907286 0.6919283
+ ")
You may check with order and duplicated all from base R
data = data[order(data$Mean),]
output = data[!duplicated(data[c("U","D")]),]
output
A B C U D E F G H I Mean Min Max
12 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
2 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
If you want dplyr
library(dplyr)
data %>% group_by(U, D) %>% slice(which.min(Mean))
The cleanest way to do this would be with dplyr
library(dplyr)
data %>% group_by(U, D) %>% filter(Mean == min(Mean))
The output looks like this
A B C U D E F G H I Mean Min Max
<dbl> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.5 33.2 42.1
2 NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691 0.691 0.692
Consider aggregating then joining back to original data. Below names() is used to re-order columns and merge omits by since all columns in aggregate resultset will be matched:
agg_df <- aggregate(cbind(Mean, Min, Max) ~ U + D, data, FUN=min)
merge(data, agg_df)[names(data)]
# A B C U D E F G H I Mean Min Max
# 1 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
# 2 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283

R: how to visualize the relationship between continuous and categorical data

I have the following data.frame which contains 3 categorical variables (different types of vascular pathology) and 1 continuous variable (Output). I'm interested in seeing the relationship between Output and the different types of vascular pathologies, i.e. is higher/lower output associated with mild/severe pathology?
> dput(df)
structure(list(Vascular_Pathology_M = structure(c(1L, 2L, 3L,
1L, 1L, 2L, 4L, 3L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe", "Severe"), class = "factor"), Vascular_Pathology_F = structure(c(4L,
2L, 1L, 1L, 1L, 1L, 2L, 4L, 1L, 1L), .Label = c("Absent", "Mild",
"Mild/Moderate", "Moderate/Severe", "Severe"), class = "factor"),
Vascular_Pathology_O = structure(c(1L, 3L, 4L, 3L, 1L, 2L,
1L, 1L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe"), class = "factor"), Output = c(1.01789418758932,
1.05627630598801, 1.49233946102323, 1.38192374975672, 1.13097652937671,
0.861306979571144, 0.707820561413699, 1.16628243128399, 0.983163398006992,
1.23972603843843)), .Names = c("Vascular_Pathology_M", "Vascular_Pathology_F",
"Vascular_Pathology_O", "Output"), row.names = c(1L, 3L, 4L,
5L, 6L, 7L, 8L, 10L, 11L, 12L), class = "data.frame")
> df
Vascular_Pathology_M Vascular_Pathology_F Vascular_Pathology_O Output
1 Absent Moderate/Severe Absent 1.0178942
3 Mild Mild Mild/Moderate 1.0562763
4 Mild/Moderate Absent Moderate/Severe 1.4923395
5 Absent Absent Mild/Moderate 1.3819237
6 Absent Absent Absent 1.1309765
7 Mild Absent Mild 0.8613070
8 Moderate/Severe Mild Absent 0.7078206
10 Mild/Moderate Moderate/Severe Absent 1.1662824
11 Absent Absent Absent 0.9831634
12 Mild Absent Mild 1.2397260
You could look at the interaction of the various pathologies. For example, with a barplot
## Make the interaction variable
df$interact <- interaction(df[, 1:3], sep="_")
## Look at means of groups
library(dplyr)
df %>% group_by(interact) %>%
dplyr::summarise(Output = mean(Output)) -> means
ggplot(means, aes(interact, Output))+
geom_bar(stat="identity") +
theme(axis.text=element_text(angle=90)) +
xlab("Interaction")
or with points
ggplot(df, aes(interact, Output))+
geom_point() +
theme(axis.text=element_text(angle=45, hjust=1)) +
xlab("Interaction") +
geom_point(data=means, col="red") +
ylim(0, 1.6)
You can simply plot the output against the categorical variables
plot(df[, 1], df[, 4])
plot(df[, 2], df[, 4])
plot(df[, 3], df[, 4])
You have a 4 dimensional dataset. One option is to do a scatter plot (x/y = two dimensions), in a small multiple series (there's one more dimension), and map the Output variable to something visual like size (there's a fourth dimension).
Example, after putting your data in a data.frame called my_dat (since df is already assigned to a function in R). Points are jittered to show the multiple observations per point, and colored by Y position to help make clear which point goes with which category.
library(ggplot2)
my_dat$O_with_labels <-
factor(my_dat[, 3], labels=paste('Vasc Path O:', levels(my_dat[, 3])))
ggplot(my_dat,
aes(x=Vascular_Pathology_M, y=Vascular_Pathology_F)) +
geom_jitter(aes(size=Output, color=Vascular_Pathology_F)) +
facet_wrap(~O_with_labels) +
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))

Resources