Situation: I have a list of data sets, collected by different loggers, like so:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "logger1", class = "factor"), OriginalTraitValue = c(0.37968,
0.455131, 0.606376, 0.910194, 1.19499, 1.55612, 1.91735, 2.35493,
2.60147, 2.42803, 1.66277, 1.12656, 0.628537), Temp = c(11.7334,
14.627, 19.3428, 24.5959, 29.6344, 34.7809, 39.606, 44.5389,
49.7914, 54.8254, 59.6391, 64.6695, 69.7002)), class = "data.frame", row.names = c(NA,
-13L))
Task: I only want to keep data sets that have a minimum of two recorded Temp values, before and after max(OriginalTraitValue).
I hope this plot makes it clearer. Red = maximum value, Green = values required to keep a data set.
Question
How do I do this in R, e.g. using dplyr?
I have managed to identify the Temp value corresponding to max(OriginalTraitValue) using df$Temp[df$OriginalTraitValue == max(df$OriginalTraitValue)], but I'm struggling with the necessary position arguments to filter the data sets.
EDIT
The example above represents a data set I would like to keep. The full data set looks something like this:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("logger1", "logger2", "logger3"
), class = "factor"), OriginalTraitValue = c(3.36e-11, 3.68e-11,
5.12e-11, 6.24e-11, 6.72e-11, 8.64e-11, 1.04e-10, 1.1e-10, 1.18e-10,
90.34189, 86.332214, 108.00114, 111.190155, 114.34427, 135.1673,
139.18198, 142.76979, 145.09233, 0.002, 0.06, 0.07, 0.15, 0.17,
0.17, 0.18, 0.18, 0.15, 0.07, 0.09), Temp = c(16, 18, 20, 22,
24, 26, 28, 30, 32, 16.726307, 17.376368, 20.193129, 25.06135,
25.060663, 29.875113, 29.924177, 30.422773, 34.417274, 10, 12.5,
15, 18, 20, 22.5, 25, 27.5, 30, 32.5, 35)), class = "data.frame", row.names = c(NA,
-29L))
> summary(df)
ID OriginalTraitValue Temp
logger1: 9 Min. : 0.00 Min. :10.00
logger2: 9 1st Qu.: 0.00 1st Qu.:18.00
logger3:11 Median : 0.15 Median :25.00
Mean : 37.02 Mean :23.90
3rd Qu.: 90.34 3rd Qu.:29.92
Max. :145.09 Max. :35.00
In this data set, I would only keep ID as logger3, since only logger3 contains at least 2 values before and after max(OriginalTraitValue).
Try:
library(dplyr)
df %>%
group_by(ID) %>%
slice(which.max(OriginalTraitValue) + -2:2) %>%
filter(n() == 5)
Output:
# A tibble: 5 x 3
# Groups: ID [1]
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
If you'd like to filter the whole group and not just the 5 observations in questions, you could also do something like:
df %>%
group_by(ID) %>%
filter(any(cumsum(row_number() %in% c(which.max(OriginalTraitValue) + -2:2)) == 5))
You can achieve that with dplyr filter
df %>%
group_by(ID) %>%
filter(abs(which(OriginalTraitValue == max(OriginalTraitValue)) - row_number()) <= 2)
ID OriginalTraitValue Temp
<fct> <dbl> <dbl>
1 logger1 1.92 39.6
2 logger1 2.35 44.5
3 logger1 2.60 49.8
4 logger1 2.43 54.8
5 logger1 1.66 59.6
Related
I want to loop over many columns to get a dplyr summary of age for each factor level of each column. I also want to add the column name to the dplyr table I created but I am having issues assigning that as well
I have tried to do the following using assign:
for(var in c("Sex", "Smoke", "Diabetes", "HIV")) {
assign(paste0("mean_",var))<-df%>%group_by(var) %>%
summarise(meanAge=mean(Age), sdAge=sd(Age))
}
I basically want summary tables of age for each column (mean_Sex, mean_Smoke, mean_Diabetes, and mean_HIV)
But I am getting error:
Error in group_by_prepare():
! Must group by variables found in .data.
Column var is not found.
Run rlang::last_error() to see where the error occurred.
Can anyone help with a way to do this?
data example:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
Sex = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
2L, 1L), .Label = c("F", "M"), class = "factor"), Smoke = structure(c(3L,
1L, 1L, 3L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("N",
"NA", "Y"), class = "factor"), Diabetes = structure(c(3L,
1L, 3L, 3L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L), .Label = c("N",
"NA", "Y"), class = "factor"), HIV = structure(c(1L, 1L,
2L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 2L, 1L), .Label = c("N",
"NA", "Y"), class = "factor"), Age = c(23, 24, 43, 35, 18,
29, 25, 17, 22, 20, 55, 54)), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
library(dplyr); library(tidyr)
df %>%
pivot_longer(Sex:HIV) %>%
group_by(name, value) %>%
summarize(meanAge = mean(Age), sdAge = sd(Age)) %>%
ungroup()
name value meanAge sdAge
<chr> <fct> <dbl> <dbl>
1 Diabetes N 21 3.61
2 Diabetes NA 36.8 20.5
3 Diabetes Y 31 8.12
4 HIV N 28.6 14.3
5 HIV NA 38.3 19.4
6 HIV Y 26.8 7.14
7 Sex F 30.8 14.4
8 Sex M 30.1 13.8
9 Smoke N 35.5 16.4
10 Smoke NA 24 5.57
11 Smoke Y 30.2 14.9
or for differently shaped output:
df %>%
pivot_longer(Sex:HIV) %>%
group_by(name, value) %>%
summarize(meanAge = mean(Age), sdAge = sd(Age), .groups = "drop") %>%
pivot_wider(names_from = value, values_from = meanAge:sdAge, names_vary = "slowest")
# A tibble: 4 × 11
name meanAge_N sdAge_N meanAge_NA sdAge_NA meanAge_Y sdAge_Y meanAge_F sdAge_F meanAge_M sdAge_M
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Diabetes 21 3.61 36.8 20.5 31 8.12 NA NA NA NA
2 HIV 28.6 14.3 38.3 19.4 26.8 7.14 NA NA NA NA
3 Sex NA NA NA NA NA NA 30.8 14.4 30.1 13.8
4 Smoke 35.5 16.4 24 5.57 30.2 14.9 NA NA NA NA
I've got repeated measurements data in which patients are measured an irregular amount of times (2 through 6 times per patient) and also with unequally spaced time intervals (some subsequent measures are 6 months apart, some 3 years). Is it possible to model this in a GEE model? For example by specifying a continuous AR1 correlation structure?
I've got some example data:
library(tidyverse)
library(magrittr)
library(geepack)
library(broom)
example_data <- structure(list(pat_id = c(2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13), measurement_number = c(1L, 2L, 3L,
4L, 5L, 6L, 1L, 2L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 1L, 2L, 3L, 1L,
2L), time = c(0, 0.545, 2.168, 2.68, 3.184, 5.695, 0, 1.892,
0, 0.939, 1.451, 1.955, 4.353, 0, 4.449, 0, 0.465, 4.005, 0,
0.364), age_standardized = c(-0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -0.0941625479695087, -0.0941625479695087,
-0.0941625479695087, -1.76464003778333, -1.76464003778333, -0.667610044472762,
-0.667610044472762, -0.667610044472762, -0.667610044472762, -0.667610044472762,
0.142696200586183, 0.142696200586183, 0.00556745142236116, 0.00556745142236116,
0.00556745142236116, 0.0554324511182961, 0.0554324511182961),
sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), outcome = c(4241.943359375, 4456.4,
6533.673242397, 7255.561628906, 7594.527875667, 6416.4, 373.782029756049,
614.318359374, 6675.19041238403, 10623.94276368, 10849.01013281,
10627.30859375, 13213, 541.40780090332, 2849.5551411438,
2136.2, 2098.1, 2063.9, 5753.56313232422, 5108.199752386)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
head(example_data)
# A tibble: 6 x 6
pat_id measurement_number time age_standardized sex outcome
<dbl> <int> <dbl> <dbl> <fct> <dbl>
1 2 1 0 -0.0942 Female 4242.
2 2 2 0.545 -0.0942 Female 4456.
3 2 3 2.17 -0.0942 Female 6534.
4 2 4 2.68 -0.0942 Female 7256.
5 2 5 3.18 -0.0942 Female 7595.
6 2 6 5.70 -0.0942 Female 6416.
I actually have also modelled these data with a linear mixed model (using nlme specifying a continuous AR1), but my supervisor asked me to also explore using a GEE, thats why I ask.
I've read that, using the geepack package, it is possible to define the correlation structure yourself, but I can't code that well to see if it is possible to define the structure so that rho is adjusted for the time interval in between measurements (by making it rho^s where s is the number of time units).
In the data set below, I want to first check which rows for the column U and D have same value. Then, for such set of rows having U and V as same value, I want to keep that row which has minimum value for columns Mean, Min and Max. For the data I have, these three will always have minimum values for the same row out of the group of rows where U and V match.
I tried group() function, but it hasn't yielded in output as I want. Please suggest any efficient approach.
Input Data
data <- structure(list(A = c(0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18,
0.18, NA, NA, NA, NA, NA, NA), B = c(0.33, 0.33, 0.33, 0.33,
0.33, 0.33, 0.33, 0.33, 1, 2, 2, 2, 3, 4), C = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Yes", class = "factor"),
U = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC-001", "PQR-001"), class = "factor"),
D = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("ABC", "PQR"), class = "factor"),
E = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 6L, 1L, 1L, 2L,
2L, 3L, 3L), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"),
F = c(22000014L, 22000031L, 22000033L, 22000025L, 22000028L,
22000020L, 22000021L, 22000015L, 11100076L, 11200076L, 11100077L,
11200077L, 11100078L, 11200078L), G = c(0, 0, 0, 0, 0, 0,
0, 0, -0.1, -0.1, -0.1, -0.1, 0.2, 0.2), H = c(100, 100,
100, 100, 100, 100, 100, 100, 1.2, 1.2, 1.2, 1.2, 0.9, 0.9
), I = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("us", "V"), class = "factor"),
Mean = c(38.72, 37.52111111, 38.44166667, 39.23666667, 39.35888889,
38.96, 38.95333333, 38.41777778, 0.691707061, 0.691554561,
0.691516833, 0.691423506, 0.763736, 0.764015761), Min = c(34.05,
33.25, 33.31, 35.14, 33.91, 33.78, 33.78, 33.75, 0.6911166,
0.6908743, 0.6908813, 0.6907286, 0.7609318, 0.7616949), Max = c(43.83,
42.12, 43.57, 44.03, 44.88, 44.03, 44.02, 43.52, 0.692533,
0.6922278, 0.6923681, 0.6919283, 0.7674736, 0.7668633)), class = "data.frame", row.names = c(NA,
-14L))
Expected Ouput
output <- read.table(header = TRUE, text = " A B C U D E F G H I Mean Min Max
+ 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.52111111 33.25 42.12
+ NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691423506 0.6907286 0.6919283
+ ")
You may check with order and duplicated all from base R
data = data[order(data$Mean),]
output = data[!duplicated(data[c("U","D")]),]
output
A B C U D E F G H I Mean Min Max
12 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
2 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
If you want dplyr
library(dplyr)
data %>% group_by(U, D) %>% slice(which.min(Mean))
The cleanest way to do this would be with dplyr
library(dplyr)
data %>% group_by(U, D) %>% filter(Mean == min(Mean))
The output looks like this
A B C U D E F G H I Mean Min Max
<dbl> <dbl> <fct> <fct> <fct> <fct> <int> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 0.18 0.33 Yes ABC-001 ABC B 22000031 0 100 us 37.5 33.2 42.1
2 NA 2 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.691 0.691 0.692
Consider aggregating then joining back to original data. Below names() is used to re-order columns and merge omits by since all columns in aggregate resultset will be matched:
agg_df <- aggregate(cbind(Mean, Min, Max) ~ U + D, data, FUN=min)
merge(data, agg_df)[names(data)]
# A B C U D E F G H I Mean Min Max
# 1 0.18 0.33 Yes ABC-001 ABC B 22000031 0.0 100.0 us 37.5211111 33.2500000 42.1200000
# 2 NA 2.00 Yes PQR-001 PQR B 11200077 -0.1 1.2 V 0.6914235 0.6907286 0.6919283
I have these example data
Data <- structure(list(IndID = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), .Label = c("1",
"2", "3", "4", "5", "56", "58", "59", "60", "63"), class = "factor"),
Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("BHS",
"MTG"), class = "factor"), Season = structure(c(1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L), .Label = c("Summer", "Winter"), class = "factor"),
Percent = c(0.992, 0.992, 0.996, 0.976, 0.995, 0.871, 0.996,
0.996, 0.916, 0.875, 0.652, 0.802, 0.964, 0.673, 0.956, 0.879,
0.972, 0.782, 0.968, 0.832)), .Names = c("IndID", "Species",
"Season", "Percent"), row.names = c(NA, -20L), class = "data.frame")
Which look like this
> head(Data)
IndID Species Season Percent
1 1 BHS Summer 0.992
2 1 BHS Winter 0.992
3 2 BHS Summer 0.996
4 2 BHS Winter 0.976
5 3 BHS Winter 0.995
6 3 BHS Summer 0.871
There are 10 unique individuals that belong to one of two species (BHS or MTG). For each individual (IndID), there is a Percent value for each Season (Winter and Summer).
For each Species, I want to select the two individuals that have the highest average Percent value.
EDIT Also see my note below. I did not post a specific outcome because there are multiple that would work for my needs. Because I need a measure Percent for each Season, I thought taking the average of the Percent would be the best approach to select the top individuals. Percent was measured for each season, but I want to select the highest rank IndID. I could also rank IndID by the sum of Percent (rather than average).
In addition to the 2nd chunk of code posted by #akrun, a vector of 4 IndIDs (the two top ranked for each species) would also have been a fine output.
Thanks in advance for your help.
Assuming that you would like a dplyr solution (from the tag), we group the data by 'Species', order 'Percent' column in descending (arrange) and use slice to get the first two rows per each 'Species'
library(dplyr)
Data %>%
group_by(Species) %>%
arrange(desc(Percent)) %>%
slice(1:2)
# IndID Species Season Percent
#1 2 BHS Summer 0.996
#2 4 BHS Summer 0.996
#3 60 MTG Summer 0.972
#4 63 MTG Summer 0.968
An expected output would have been easier. If this is based on average percentage, we group by 'Species' and 'IndID', create a new column 'AvgPercent' based on the mean of 'Percent', we group by 'Species', order the 'AvgPercent' column in descending order and get the 1st two 'IndID'
Data %>%
group_by(Species, IndID) %>%
mutate(AvgPercent=mean(Percent)) %>%
group_by(Species) %>%
arrange(desc(AvgPercent)) %>%
slice(1:4) %>%
select(-AvgPercent) %>%
filter(!duplicated(IndID))
# IndID Species Season Percent
#1 4 BHS Summer 0.996
#2 1 BHS Summer 0.992
#3 59 MTG Summer 0.956
#4 63 MTG Summer 0.968
Another option making use of tidyr's gather and spread
library(dplyr)
library(tidyr)
Data %>%
spread(Season, Percent) %>%
mutate(avg = (Summer + Winter)/2) %>%
group_by(Species) %>%
arrange(desc(avg)) %>%
top_n(2)
# IndID Species Summer Winter avg
#1 4 BHS 0.996 0.996 0.9960
#2 1 BHS 0.992 0.992 0.9920
#3 59 MTG 0.956 0.879 0.9175
#4 63 MTG 0.968 0.832 0.9000
Here is a data.table approach
library(data.table)
setDT(Data)[, avg := mean(Percent), by = .(IndID, Species)]
Data[Data[Season=="Summer", .I[order(avg, decreasing = T)[1:2]], by = Species]$V1]
# IndID Species Season Percent avg
#1: 4 BHS Summer 0.996 0.9960
#2: 1 BHS Summer 0.992 0.9920
#3: 59 MTG Summer 0.956 0.9175
#4: 63 MTG Summer 0.968 0.9000
Or with plyr
ddply(Data, "Species", function(x) sort(x[, "Percent"], T))[, 1:3]
Species V1 V2
1 BHS 0.996 0.996
2 MTG 0.972 0.968
A data.table solution (library(data.table)).
d <- data.table(Data) Wraps your Data into a data.table object.
Make a new table that also lists the average Percent (between Summer and Winter for each individual).
t <- d[, meanPercent := mean(Percent), by = IndID]
Merge some rows based on IndID
t <- t[, .SD[, list(Species, meanPercent)][1], by = IndID]
Finally select the top two individuals by the average Percent for each Species.
t[order(-meanPercent)][Species == "BHS"][1:2]
t[order(-meanPercent)][Species == "MTG"][1:2]
Introduction
I have multilevel survey data of teachers nested in schools. I have manually calculated design weights and non-response adjustment weights based on probability selection and response rate (oldwt below). Now I want to create post-stratification weights by raking on two marginals: the sex (male or female) of and the employment status (full-time or not full-time) of the teacher. With the help of kind people at Statalist (see here), I have seemingly done this in Stata successfully. However, in trying to replicate the results in R, I come up with vastly different output.
Sample Data
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
Raking Code
(See here and here for in-depth examples of using anesrake in R).
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
Comparison with Stata Output
The issue is that the output from R doesn't match up with the output with Stata (even if I set force1 = TRUE), and it seems that the Stata output is the one that is right, making me think my sloppy R code is wrong. Is that the case?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382
The distribution of your targets in R should sum up one and represent the distribution in your population. Look at my example. I think that the force1 option will not compute the distribution you want at least each school has the same population weight. This is what force1 is doing:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
Is that what you want?