Related
I've got repeated measurements data in which patients are measured an irregular amount of times (2 through 6 times per patient) and also with unequally spaced time intervals (some subsequent measures are 6 months apart, some 3 years). Is it possible to model this in a GEE model? For example by specifying a continuous AR1 correlation structure?
I've got some example data:
library(tidyverse)
library(magrittr)
library(geepack)
library(broom)
example_data <- structure(list(pat_id = c(2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13), measurement_number = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 2L, 3L, 1L,
2L), time = c(0, 0.545, 2.168, 2.68, 3.184, 5.695, 0, 1.892,
0, 0.939, 1.451, 1.955, 4.353, 0, 4.449, 0, 0.465, 4.005, 0,
0.364), age_standardized = c(-0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -1.76464003778333, -1.76464003778333, -0.667610044472762,
-0.667610044472762, -0.667610044472762, -0.667610044472762, -0.667610044472762,
0.142696200586183, 0.142696200586183, 0.00556745142236116, 0.00556745142236116,
0.00556745142236116, 0.0554324511182961, 0.0554324511182961),
sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), outcome = c(4241.943359375, 4456.4,
6533.673242397, 7255.561628906, 7594.527875667, 6416.4, 373.782029756049,
614.318359374, 6675.19041238403, 10623.94276368, 10849.01013281,
10627.30859375, 13213, 541.40780090332, 2849.5551411438,
2136.2, 2098.1, 2063.9, 5753.56313232422, 5108.199752386)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
head(example_data)
# A tibble: 6 x 6
pat_id measurement_number time age_standardized sex outcome
<dbl> <int> <dbl> <dbl> <fct> <dbl>
1 2 1 0 -0.0942 Female 4242.
2 2 2 0.545 -0.0942 Female 4456.
3 2 3 2.17 -0.0942 Female 6534.
4 2 4 2.68 -0.0942 Female 7256.
5 2 5 3.18 -0.0942 Female 7595.
6 2 6 5.70 -0.0942 Female 6416.
I actually have also modelled these data with a linear mixed model (using nlme specifying a continuous AR1), but my supervisor asked me to also explore using a GEE, thats why I ask.
I've read that, using the geepack package, it is possible to define the correlation structure yourself, but I can't code that well to see if it is possible to define the structure so that rho is adjusted for the time interval in between measurements (by making it rho^s where s is the number of time units).
Good morning,
I am currently working on animal body condition score (BCS). For each individual, I have several rows but not the same number of row from one to another. As columns I have the animal name (factor), the date (factor) when the BCS was recorded and BCS (numeric) itself.
There is an exemple of my data:
structure(list(name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("INDIV1",
"INDIV2", "INDIV3", "INDIV4", "INDIV5", "INDIV6",
"INDIV7", "INDIV8", "INDIV9", "INDIV10",
"INDIV11", "INDIV12", "INDIV13", "INDIV14", "INDIV15",
"INDIV16", "INDIV17", "INDIV18", "INDIV19",
"INDIV20", "INDIV21", "INDIV22", "INDIV23",
"INDIV24", "INDIV25", "INDIV26", "INDIV27",
"INDIV28", "INDIV29", "INDIV30", "INDIV31",
"INDIV32", "INDIV33", "INDIV34", "INDIV35",
"INDIV36", "INDIV37", "INDIV38", "INDIV39", "INDIV40",
"INDIV41", "INDIV42", "INDIV43", "INDIV44", "INDIV45",
"INDIV46", "INDIV47", "INDIV48", "INDIV49",
"INDIV50", "INDIV51", "INDIV52", "INDIV53",
"INDIV54", "INDIV55", "INDIV56", "INDIV57", "INDIV58",
"INDIV59", "INDIV60", "INDIV61", "INDIV62",
"INDIV63", "INDIV64", "INDIV65", "INDIV66",
"INDIV67", "INDIV68", "INDIV69", "INDIV70",
"INDIV71", "INDIV72", "INDIV73", "INDIV74",
"INDIV75", "INDIV76", "INDIV77", "INDIV78", "INDIV79",
"INDIV80", "INDIV81", "INDIV82", "INDIV83",
"INDIV84", "INDIV85", "INDIV86", "INDIV87",
"INDIV88", "INDIV89", "INDIV90",
"INDIV91", "INDIV92", "INDIV93", "INDIV94",
"INDIV95", "INDIV96", "INDIV97", "INDIV98",
"INDIV99", "INDIV100", "INDIV101", "INDIV102", "INDIV103",
"INDIV104", "INDIV105", "INDIV106", "INDIV107", "INDIV108",
"INDIV109", "INDIV110", "INDIV111", "INDIV112",
"INDIV113", "INDIV114", "INDIV115", "INDIV116",
"INDIV117", "INDIV118"), class = "factor"), date = structure(c(4L,
4L, 4L, 36L, 36L, 36L, 8L, 8L, 8L, 21L, 21L, 21L, 38L, 38L, 38L,
1L, 1L, 1L, 4L, 4L), .Label = c("03/10/2019", "03/12/2019", "04/12/2019",
"05/02/2019", "06/02/2019", "07/04/2019", "08/01/2019", "10/04/2019",
"10/12/2019", "11/02/2019", "11/09/2019", "11/12/2019", "12/08/2019",
"12/09/2019", "12/12/2019", "13/02/2019", "13/03/2019", "13/08/2019",
"13/09/2019", "14/05/2019", "14/06/2019", "14/11/2019", "15/07/2019",
"15/10/2019", "15/11/2019", "16/01/2019", "16/04/2019", "16/07/2019",
"16/10/2019", "17/05/2019", "18/06/2019", "18/10/2019", "19/03/2019",
"19/06/2019", "19/12/2019", "20/03/2019", "21/03/2019", "23/07/2019",
"25/04/2019", "26/04/2019", "27/09/2019", "28/01/2019", "28/05/2019",
"28/06/2019", "31/05/2019"), class = "factor"), BCS = c(4, 4,
4, 4, 4, 4, 4, 4, 4, 4.75, 4.75, 4.75, 4.75, 4.75, 4.75, 4.5,
4.5, 4.5, 2.25, 2.25)), row.names = c(NA, 20L), class = "data.frame")
My goal here, is to identify individuals with a BCS >= 4 for each measurement.
I have tried to make up functions using if and while statements but, so far, I can't get the information I am looking for...
I apologize in advance if this kind of question have been asked previously.
Thank you for your futur help !
I named the data frame you provided df, so try:
df = droplevels(df)
tapply(df$BCS>=4,df$name,all)
INDIV1 INDIV2
TRUE FALSE
The step above takes makes a boolean out of each BCS value, if >=4 it becomes TRUE, and then tapply splits this boolean according to name, and you ask whether all is true using all.
From the result above, it means INDIV1 has all BCS>=4
To get the names, do:
names(which(tapply(df$BCS>=4,df$name,all)))
[1] "INDIV1"
Not very clear about your objective by
to identify individuals with a BCS >= 4 for each measurement
Maybe something like below is your desired output
> aggregate(BCS~name,df, FUN = function(x) all(x>=4))
name BCS
1 INDIV1 TRUE
2 INDIV2 FALSE
We can use tidyverse
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(BCS = all(BCS >= 4))
# A tibble: 2 x 2
# name BCS
# <fct> <lgl>
#1 INDIV1 TRUE
#2 INDIV2 FALSE
Situation: I have a list of data sets, collected by different loggers, like so:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "logger1", class = "factor"), OriginalTraitValue = c(0.37968,
0.455131, 0.606376, 0.910194, 1.19499, 1.55612, 1.91735, 2.35493,
2.60147, 2.42803, 1.66277, 1.12656, 0.628537), Temp = c(11.7334,
14.627, 19.3428, 24.5959, 29.6344, 34.7809, 39.606, 44.5389,
49.7914, 54.8254, 59.6391, 64.6695, 69.7002)), class = "data.frame", row.names = c(NA,
-13L))
Task: I only want to keep data sets that have a minimum of two recorded Temp values, before and after max(OriginalTraitValue).
I hope this plot makes it clearer. Red = maximum value, Green = values required to keep a data set.
Question
How do I do this in R, e.g. using dplyr?
I have managed to identify the Temp value corresponding to max(OriginalTraitValue) using df$Temp[df$OriginalTraitValue == max(df$OriginalTraitValue)], but I'm struggling with the necessary position arguments to filter the data sets.
EDIT
The example above represents a data set I would like to keep. The full data set looks something like this:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("logger1", "logger2", "logger3"
), class = "factor"), OriginalTraitValue = c(3.36e-11, 3.68e-11,
5.12e-11, 6.24e-11, 6.72e-11, 8.64e-11, 1.04e-10, 1.1e-10, 1.18e-10,
90.34189, 86.332214, 108.00114, 111.190155, 114.34427, 135.1673,
139.18198, 142.76979, 145.09233, 0.002, 0.06, 0.07, 0.15, 0.17,
0.17, 0.18, 0.18, 0.15, 0.07, 0.09), Temp = c(16, 18, 20, 22,
24, 26, 28, 30, 32, 16.726307, 17.376368, 20.193129, 25.06135,
25.060663, 29.875113, 29.924177, 30.422773, 34.417274, 10, 12.5,
15, 18, 20, 22.5, 25, 27.5, 30, 32.5, 35)), class = "data.frame", row.names = c(NA,
-29L))
> summary(df)
ID OriginalTraitValue Temp
logger1: 9 Min. : 0.00 Min. :10.00
logger2: 9 1st Qu.: 0.00 1st Qu.:18.00
logger3:11 Median : 0.15 Median :25.00
Mean : 37.02 Mean :23.90
3rd Qu.: 90.34 3rd Qu.:29.92
Max. :145.09 Max. :35.00
In this data set, I would only keep ID as logger3, since only logger3 contains at least 2 values before and after max(OriginalTraitValue).
Try:
library(dplyr)
df %>%
group_by(ID) %>%
slice(which.max(OriginalTraitValue) + -2:2) %>%
filter(n() == 5)
Output:
# A tibble: 5 x 3
# Groups: ID [1]
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
If you'd like to filter the whole group and not just the 5 observations in questions, you could also do something like:
df %>%
group_by(ID) %>%
filter(any(cumsum(row_number() %in% c(which.max(OriginalTraitValue) + -2:2)) == 5))
You can achieve that with dplyr filter
df %>%
group_by(ID) %>%
filter(abs(which(OriginalTraitValue == max(OriginalTraitValue)) - row_number()) <= 2)
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
In the data set below, I want to first check which rows for the column U and D have same value. Then, for such set of rows having U and V as same value, I want to keep that row which has minimum value for columns Mean, Min and Max. For the data I have, these three will always have minimum values for the same row out of the group of rows where U and V match.
I tried group() function, but it hasn't yielded in output as I want. Please suggest any efficient approach.
Input Data
data <- structure(list(A = c(0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18,
0.18, NA, NA, NA, NA, NA, NA), B = c(0.33, 0.33, 0.33, 0.33,
0.33, 0.33, 0.33, 0.33, 1, 2, 2, 2, 3, 4), C = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Yes", class = "factor"),
U = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC-001", "PQR-001"), class = "factor"),
D = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC", "PQR"), class = "factor"),
E = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 6L, 1L, 1L, 2L,
2L, 3L, 3L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"),
F = c(22000014L, 22000031L, 22000033L, 22000025L, 22000028L,
22000020L, 22000021L, 22000015L, 11100076L, 11200076L, 11100077L,
11200077L, 11100078L, 11200078L), G = c(0, 0, 0, 0, 0, 0,
0, 0, -0.1, -0.1, -0.1, -0.1, 0.2, 0.2), H = c(100, 100,
100, 100, 100, 100, 100, 100, 1.2, 1.2, 1.2, 1.2, 0.9, 0.9
), I = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("us", "V"), class = "factor"),
Mean = c(38.72, 37.52111111, 38.44166667, 39.23666667, 39.35888889,
38.96, 38.95333333, 38.41777778, 0.691707061, 0.691554561,
0.691516833, 0.691423506, 0.763736, 0.764015761), Min = c(34.05,
33.25, 33.31, 35.14, 33.91, 33.78, 33.78, 33.75, 0.6911166,
0.6908743, 0.6908813, 0.6907286, 0.7609318, 0.7616949), Max = c(43.83,
42.12, 43.57, 44.03, 44.88, 44.03, 44.02, 43.52, 0.692533,
0.6922278, 0.6923681, 0.6919283, 0.7674736, 0.7668633)), class = "data.frame", row.names = c(NA,
-14L))
Expected Ouput
output <- read.table(header = TRUE, text = " A B C U D E F G H I Mean Min Max
+ 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.52111111 33.25 42.12
+ NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691423506 0.6907286 0.6919283
+ ")
You may check with order and duplicated all from base R
data = data[order(data$Mean),]
output = data[!duplicated(data[c("U","D")]),]
output
A B C U D E F G H I Mean Min Max
12 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
2 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
If you want dplyr
library(dplyr)
data %>% group_by(U, D) %>% slice(which.min(Mean))
The cleanest way to do this would be with dplyr
library(dplyr)
data %>% group_by(U, D) %>% filter(Mean == min(Mean))
The output looks like this
A B C U D E F G H I Mean Min Max
<dbl> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.5 33.2 42.1
2 NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691 0.691 0.692
Consider aggregating then joining back to original data. Below names() is used to re-order columns and merge omits by since all columns in aggregate resultset will be matched:
agg_df <- aggregate(cbind(Mean, Min, Max) ~ U + D, data, FUN=min)
merge(data, agg_df)[names(data)]
# A B C U D E F G H I Mean Min Max
# 1 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
# 2 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
Introduction
I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.
Sample Data
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
Raking Code
(See here and here for in-depth examples of using anesrake in R).
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
Comparison with Stata Output
The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382
The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
Is that what you want?