Introduction
I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.
Sample Data
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
Raking Code
(See here and here for in-depth examples of using anesrake in R).
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
Comparison with Stata Output
The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382
The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
Is that what you want?
Related
I've got repeated measurements data in which patients are measured an irregular amount of times (2 through 6 times per patient) and also with unequally spaced time intervals (some subsequent measures are 6 months apart, some 3 years). Is it possible to model this in a GEE model? For example by specifying a continuous AR1 correlation structure?
I've got some example data:
library(tidyverse)
library(magrittr)
library(geepack)
library(broom)
example_data <- structure(list(pat_id = c(2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13), measurement_number = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 2L, 3L, 1L,
2L), time = c(0, 0.545, 2.168, 2.68, 3.184, 5.695, 0, 1.892,
0, 0.939, 1.451, 1.955, 4.353, 0, 4.449, 0, 0.465, 4.005, 0,
0.364), age_standardized = c(-0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -1.76464003778333, -1.76464003778333, -0.667610044472762,
-0.667610044472762, -0.667610044472762, -0.667610044472762, -0.667610044472762,
0.142696200586183, 0.142696200586183, 0.00556745142236116, 0.00556745142236116,
0.00556745142236116, 0.0554324511182961, 0.0554324511182961),
sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), outcome = c(4241.943359375, 4456.4,
6533.673242397, 7255.561628906, 7594.527875667, 6416.4, 373.782029756049,
614.318359374, 6675.19041238403, 10623.94276368, 10849.01013281,
10627.30859375, 13213, 541.40780090332, 2849.5551411438,
2136.2, 2098.1, 2063.9, 5753.56313232422, 5108.199752386)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
head(example_data)
# A tibble: 6 x 6
pat_id measurement_number time age_standardized sex outcome
<dbl> <int> <dbl> <dbl> <fct> <dbl>
1 2 1 0 -0.0942 Female 4242.
2 2 2 0.545 -0.0942 Female 4456.
3 2 3 2.17 -0.0942 Female 6534.
4 2 4 2.68 -0.0942 Female 7256.
5 2 5 3.18 -0.0942 Female 7595.
6 2 6 5.70 -0.0942 Female 6416.
I actually have also modelled these data with a linear mixed model (using nlme specifying a continuous AR1), but my supervisor asked me to also explore using a GEE, thats why I ask.
I've read that, using the geepack package, it is possible to define the correlation structure yourself, but I can't code that well to see if it is possible to define the structure so that rho is adjusted for the time interval in between measurements (by making it rho^s where s is the number of time units).
I have 6 treatment groups (Control, pH7, pH8, pH9, pH10, pH11) from 6 different samples (1, 2, 3, 4, 5, 6) & measure variable - od
`data.frame': 288 obs. of 3 variables:
$ od : num 0.086 0.086 0.085 0.086 0.093 0.087 0.087 0.087 0.089 0.094 ...
$ sample: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ medium: Factor w/ 6 levels "Control","pH10",..: 1 1 1 1 1 1 1 1 4 4 ...
I would like x axis to represent all the treatment groups & the dots on the graph pH means of each sample
How could I produce a graph similar to this one:
Sample data.
structure(list(od = c(0.086, 0.086, 0.085, 0.086, 0.093, 0.087, 0.087, 0.087, 0.089, 0.094, 0.087, 0.088, 0.09, 0.088, 0.087, 0.088, 0.086, 0.087, 0.095, 0.096), sample = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), medium = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), .Label = c("Control", "pH10", "pH11", "pH7", "pH8", "pH9"), class = "factor")), row.names = c(NA, 20L), class = "data.frame")
OK this is probably not quite right yet but I'll post it as and answer to be edited based on more info (little tricky without the complete data and just a photo!).
Looks like (in the photo) you have a data point per 'sample' that must be the mean of the 'od' for that 'medium' (treatment) for that 'sample' (day?)?
So I summarised the data to give me the means of each treatment group for each sample day:
#if you do not have the following packages then install.packages("tidyverse")
library(ggplot2)
library(tidyverse)
### REPLACE 'df' below with your dataframe name
df_summary <- df %>%
group_by(sample, medium) %>%
summarise(od_mean = mean(od))
head(df_summary)
Now I plot that with mean_od as the value on y, sample on x and separate them by medium with colour = medium:
plot <- ggplot(df_summary, aes(x = sample, y = od_mean, colour = medium))+
geom_point()+ #add points to the plot
geom_smooth()+ #add a smoother
xlab("Day")+ #change axis lables
ylab("Bacterial Density")
plot
Let me know if this is approaching what you need.
Situation: I have a list of data sets, collected by different loggers, like so:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "logger1", class = "factor"), OriginalTraitValue = c(0.37968,
0.455131, 0.606376, 0.910194, 1.19499, 1.55612, 1.91735, 2.35493,
2.60147, 2.42803, 1.66277, 1.12656, 0.628537), Temp = c(11.7334,
14.627, 19.3428, 24.5959, 29.6344, 34.7809, 39.606, 44.5389,
49.7914, 54.8254, 59.6391, 64.6695, 69.7002)), class = "data.frame", row.names = c(NA,
-13L))
Task: I only want to keep data sets that have a minimum of two recorded Temp values, before and after max(OriginalTraitValue).
I hope this plot makes it clearer. Red = maximum value, Green = values required to keep a data set.
Question
How do I do this in R, e.g. using dplyr?
I have managed to identify the Temp value corresponding to max(OriginalTraitValue) using df$Temp[df$OriginalTraitValue == max(df$OriginalTraitValue)], but I'm struggling with the necessary position arguments to filter the data sets.
EDIT
The example above represents a data set I would like to keep. The full data set looks something like this:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("logger1", "logger2", "logger3"
), class = "factor"), OriginalTraitValue = c(3.36e-11, 3.68e-11,
5.12e-11, 6.24e-11, 6.72e-11, 8.64e-11, 1.04e-10, 1.1e-10, 1.18e-10,
90.34189, 86.332214, 108.00114, 111.190155, 114.34427, 135.1673,
139.18198, 142.76979, 145.09233, 0.002, 0.06, 0.07, 0.15, 0.17,
0.17, 0.18, 0.18, 0.15, 0.07, 0.09), Temp = c(16, 18, 20, 22,
24, 26, 28, 30, 32, 16.726307, 17.376368, 20.193129, 25.06135,
25.060663, 29.875113, 29.924177, 30.422773, 34.417274, 10, 12.5,
15, 18, 20, 22.5, 25, 27.5, 30, 32.5, 35)), class = "data.frame", row.names = c(NA,
-29L))
> summary(df)
ID OriginalTraitValue Temp
logger1: 9 Min. : 0.00 Min. :10.00
logger2: 9 1st Qu.: 0.00 1st Qu.:18.00
logger3:11 Median : 0.15 Median :25.00
Mean : 37.02 Mean :23.90
3rd Qu.: 90.34 3rd Qu.:29.92
Max. :145.09 Max. :35.00
In this data set, I would only keep ID as logger3, since only logger3 contains at least 2 values before and after max(OriginalTraitValue).
Try:
library(dplyr)
df %>%
group_by(ID) %>%
slice(which.max(OriginalTraitValue) + -2:2) %>%
filter(n() == 5)
Output:
# A tibble: 5 x 3
# Groups: ID [1]
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
If you'd like to filter the whole group and not just the 5 observations in questions, you could also do something like:
df %>%
group_by(ID) %>%
filter(any(cumsum(row_number() %in% c(which.max(OriginalTraitValue) + -2:2)) == 5))
You can achieve that with dplyr filter
df %>%
group_by(ID) %>%
filter(abs(which(OriginalTraitValue == max(OriginalTraitValue)) - row_number()) <= 2)
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
In the data set below, I want to first check which rows for the column U and D have same value. Then, for such set of rows having U and V as same value, I want to keep that row which has minimum value for columns Mean, Min and Max. For the data I have, these three will always have minimum values for the same row out of the group of rows where U and V match.
I tried group() function, but it hasn't yielded in output as I want. Please suggest any efficient approach.
Input Data
data <- structure(list(A = c(0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18,
0.18, NA, NA, NA, NA, NA, NA), B = c(0.33, 0.33, 0.33, 0.33,
0.33, 0.33, 0.33, 0.33, 1, 2, 2, 2, 3, 4), C = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Yes", class = "factor"),
U = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC-001", "PQR-001"), class = "factor"),
D = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC", "PQR"), class = "factor"),
E = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 6L, 1L, 1L, 2L,
2L, 3L, 3L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"),
F = c(22000014L, 22000031L, 22000033L, 22000025L, 22000028L,
22000020L, 22000021L, 22000015L, 11100076L, 11200076L, 11100077L,
11200077L, 11100078L, 11200078L), G = c(0, 0, 0, 0, 0, 0,
0, 0, -0.1, -0.1, -0.1, -0.1, 0.2, 0.2), H = c(100, 100,
100, 100, 100, 100, 100, 100, 1.2, 1.2, 1.2, 1.2, 0.9, 0.9
), I = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("us", "V"), class = "factor"),
Mean = c(38.72, 37.52111111, 38.44166667, 39.23666667, 39.35888889,
38.96, 38.95333333, 38.41777778, 0.691707061, 0.691554561,
0.691516833, 0.691423506, 0.763736, 0.764015761), Min = c(34.05,
33.25, 33.31, 35.14, 33.91, 33.78, 33.78, 33.75, 0.6911166,
0.6908743, 0.6908813, 0.6907286, 0.7609318, 0.7616949), Max = c(43.83,
42.12, 43.57, 44.03, 44.88, 44.03, 44.02, 43.52, 0.692533,
0.6922278, 0.6923681, 0.6919283, 0.7674736, 0.7668633)), class = "data.frame", row.names = c(NA,
-14L))
Expected Ouput
output <- read.table(header = TRUE, text = " A B C U D E F G H I Mean Min Max
+ 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.52111111 33.25 42.12
+ NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691423506 0.6907286 0.6919283
+ ")
You may check with order and duplicated all from base R
data = data[order(data$Mean),]
output = data[!duplicated(data[c("U","D")]),]
output
A B C U D E F G H I Mean Min Max
12 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
2 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
If you want dplyr
library(dplyr)
data %>% group_by(U, D) %>% slice(which.min(Mean))
The cleanest way to do this would be with dplyr
library(dplyr)
data %>% group_by(U, D) %>% filter(Mean == min(Mean))
The output looks like this
A B C U D E F G H I Mean Min Max
<dbl> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.5 33.2 42.1
2 NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691 0.691 0.692
Consider aggregating then joining back to original data. Below names() is used to re-order columns and merge omits by since all columns in aggregate resultset will be matched:
agg_df <- aggregate(cbind(Mean, Min, Max) ~ U + D, data, FUN=min)
merge(data, agg_df)[names(data)]
# A B C U D E F G H I Mean Min Max
# 1 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
# 2 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
I have the following data.frame which contains 3 categorical variables (different types of vascular pathology) and 1 continuous variable (Output). I'm interested in seeing the relationship between Output and the different types of vascular pathologies, i.e. is higher/lower output associated with mild/severe pathology?
> dput(df)
structure(list(Vascular_Pathology_M = structure(c(1L, 2L, 3L,
1L, 1L, 2L, 4L, 3L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe", "Severe"), class = "factor"), Vascular_Pathology_F = structure(c(4L,
2L, 1L, 1L, 1L, 1L, 2L, 4L, 1L, 1L), .Label = c("Absent", "Mild",
"Mild/Moderate", "Moderate/Severe", "Severe"), class = "factor"),
Vascular_Pathology_O = structure(c(1L, 3L, 4L, 3L, 1L, 2L,
1L, 1L, 1L, 2L), .Label = c("Absent", "Mild", "Mild/Moderate",
"Moderate/Severe"), class = "factor"), Output = c(1.01789418758932,
1.05627630598801, 1.49233946102323, 1.38192374975672, 1.13097652937671,
0.861306979571144, 0.707820561413699, 1.16628243128399, 0.983163398006992,
1.23972603843843)), .Names = c("Vascular_Pathology_M", "Vascular_Pathology_F",
"Vascular_Pathology_O", "Output"), row.names = c(1L, 3L, 4L,
5L, 6L, 7L, 8L, 10L, 11L, 12L), class = "data.frame")
> df
Vascular_Pathology_M Vascular_Pathology_F Vascular_Pathology_O Output
1 Absent Moderate/Severe Absent 1.0178942
3 Mild Mild Mild/Moderate 1.0562763
4 Mild/Moderate Absent Moderate/Severe 1.4923395
5 Absent Absent Mild/Moderate 1.3819237
6 Absent Absent Absent 1.1309765
7 Mild Absent Mild 0.8613070
8 Moderate/Severe Mild Absent 0.7078206
10 Mild/Moderate Moderate/Severe Absent 1.1662824
11 Absent Absent Absent 0.9831634
12 Mild Absent Mild 1.2397260
You could look at the interaction of the various pathologies. For example, with a barplot
## Make the interaction variable
df$interact <- interaction(df[, 1:3], sep="_")
## Look at means of groups
library(dplyr)
df %>% group_by(interact) %>%
dplyr::summarise(Output = mean(Output)) -> means
ggplot(means, aes(interact, Output))+
geom_bar(stat="identity") +
theme(axis.text=element_text(angle=90)) +
xlab("Interaction")
or with points
ggplot(df, aes(interact, Output))+
geom_point() +
theme(axis.text=element_text(angle=45, hjust=1)) +
xlab("Interaction") +
geom_point(data=means, col="red") +
ylim(0, 1.6)
You can simply plot the output against the categorical variables
plot(df[, 1], df[, 4])
plot(df[, 2], df[, 4])
plot(df[, 3], df[, 4])
You have a 4 dimensional dataset. One option is to do a scatter plot (x/y = two dimensions), in a small multiple series (there's one more dimension), and map the Output variable to something visual like size (there's a fourth dimension).
Example, after putting your data in a data.frame called my_dat (since df is already assigned to a function in R). Points are jittered to show the multiple observations per point, and colored by Y position to help make clear which point goes with which category.
library(ggplot2)
my_dat$O_with_labels <-
factor(my_dat[, 3], labels=paste('Vasc Path O:', levels(my_dat[, 3])))
ggplot(my_dat,
aes(x=Vascular_Pathology_M, y=Vascular_Pathology_F)) +
geom_jitter(aes(size=Output, color=Vascular_Pathology_F)) +
facet_wrap(~O_with_labels) +
theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))