Rollsum with week numbers - r

This question comes from a previous one I posted a while ago:
rollsum with fixed dates
I can not make the given solution to work. I have a large data set, the interesting columns are:
id = c(145658, 145658, 145658, 145658, 145658, 145658, 145658, 145658, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
week_number = c(24, 35, 44, 71, 82, 117, 127, 142, 4, 15, 20, 24, 30, 36, 42, 46, 59, 67, 68, 71, 75, 78, 79, 86, 93, 96)
amount = c(51.9, 51.9, 51.9, 51.9, 51.9, 103.8, 51.9, 51.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 101.0, 168.9, 101.0, 101.0, 135.8, 168.9, 168.9, 67.9)
df = data.frame(id = id, week_number = week_number, amount = amount)
In reality, I have thousands of id's, and each has different week number. I want to calculate the rollsum on the "amount" column for n past weeks (including the present week) for each id.
An extreme example would be with the past 100 weeks. The results would look like:
past_100wk = c(NA, NA, NA, NA, NA, 363.3, 363.3, 363.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Again, this is an extreme case, but it shows the the results should give NA (or -1) when the row value is not included in the week_number window (100 weeks, in this case).
Thank you!

Related

How to subset a column for each dataframe in coefficient correlation (R) calculation in R?

I have two dataframes Vobs and Vest. See the example below:
dput(head(Vobs,20))
structure(list(ID = c("LAM_1", "LAM_2", "LAM_3", "LAM_4", "LAM_5",
"LAM_6", "LAM_7", "AUR_1", "AUR_2", "AUR_3", "AUR_4", "AUR_5",
"AUR_6"), SOS = c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24,
26), EOS = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27)), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
dput(head(Vest,30))
structure(list(ID = c("LAM", "LAM", "LAM", "LAM", "LAM", "AUR",
"AUR", "AUR", "AUR", "AUR", "AUR", "P0", "P01", "P01", "P02",
"P1", "P2", "P3", "P4", "P13", "P14", "P15", "P17", "P18", "P19",
"P20", "P22", "P23", "P24"), EVI_SOS = c(2, 6, 10, 14, NA, 20,
24, 28, 32, 36, NA, 42, 42, NA, 48, 48, 52, 56, 60, 64, 68, NA,
NA, 72, NA, 78, 82, 86, 90), EVI_EOS = c(3, 7, 11, 15, NA, 21,
25, 29, 33, 37, NA, 43, 43, NA, 49, 49, 53, 57, 61, 65, 69, NA,
NA, 73, NA, 79, 83, 87, 91), NDVI_SOS = c(4, 8, 12, 16, 18, 22,
26, 30, 34, 38, 40, 44, 44, 46, 50, 50, 54, 58, 62, 66, 70, NA,
NA, 74, 76, 80, 84, 88, 92), NDVI_EOS = c(5, 9, 13, 17, 19, 23,
27, 31, 35, 39, 41, 45, 45, 47, 51, 51, 55, 59, 63, 67, 71, NA,
NA, 75, 77, 81, 85, 89, 93)), row.names = c(NA, -29L), class = c("tbl_df",
"tbl", "data.frame"))
I want to do the correlation coefficient (R) between the two dataframes. As an example, I pretend to do the R between SOS column of Vobs and EVI_SOS column of Vest concerning the LAM ID (which exists in both dataframes).
In other words, I want to subset the data for the ID of interest. In this example, I'm interested in the LAM ID, for Vest and LAM_3 to LAM_7 (that is LAM_3, LAM_4, LAM_5, LAM_6, LAM_7) for Vobs.
I have been using this code:
cor(Vobs$SOS, Vest$EVI_SOS, use = "complete.obs") but I missed the ID subset for both columns of the two different dataframes. How can I do the subset using this code?
Any help will be much appreciated.
In your specific case, to subset a character variable with a sequential numerical suffix, try using sprint() to append the number and subset as follows:
sprintf("LAM_%s",3:7)
[1] "LAM_3" "LAM_4" "LAM_5" "LAM_6" "LAM_7"
So:
Vobs[Vobs$ID %in% sprintf("LAM_%s",3:7),"SOS"]
# SOS
# <dbl>
# 1 6
# 2 8
# 3 10
# 4 12
# 5 14
Since the Vest dataset just has LAM for the observations, you can subset easier. Try
cor(Vobs[Vobs$ID %in% sprintf("LAM_%s",3:7),"SOS"],
Vest[Vest$ID %in% "LAM","EVI_SOS"], use = "complete.obs")

How to subset a column for each dataframe in RMSE calculation in R?

I have two dataframes Vobs and Vest. See the example below:
dput(head(Vobs,20))
structure(list(ID = c("LAM_1", "LAM_2", "LAM_3", "LAM_4", "LAM_5",
"LAM_6", "LAM_7", "AUR_1", "AUR_2", "AUR_3", "AUR_4", "AUR_5",
"AUR_6"), SOS = c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24,
26), EOS = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27)), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
dput(head(Vest,30))
structure(list(ID = c("LAM", "LAM", "LAM", "LAM", "LAM", "AUR",
"AUR", "AUR", "AUR", "AUR", "AUR", "P0", "P01", "P01", "P02",
"P1", "P2", "P3", "P4", "P13", "P14", "P15", "P17", "P18", "P19",
"P20", "P22", "P23", "P24"), EVI_SOS = c(2, 6, 10, 14, NA, 20,
24, 28, 32, 36, NA, 42, 42, NA, 48, 48, 52, 56, 60, 64, 68, NA,
NA, 72, NA, 78, 82, 86, 90), EVI_EOS = c(3, 7, 11, 15, NA, 21,
25, 29, 33, 37, NA, 43, 43, NA, 49, 49, 53, 57, 61, 65, 69, NA,
NA, 73, NA, 79, 83, 87, 91), NDVI_SOS = c(4, 8, 12, 16, 18, 22,
26, 30, 34, 38, 40, 44, 44, 46, 50, 50, 54, 58, 62, 66, 70, NA,
NA, 74, 76, 80, 84, 88, 92), NDVI_EOS = c(5, 9, 13, 17, 19, 23,
27, 31, 35, 39, 41, 45, 45, 47, 51, 51, 55, 59, 63, 67, 71, NA,
NA, 75, 77, 81, 85, 89, 93)), row.names = c(NA, -29L), class = c("tbl_df",
"tbl", "data.frame"))
I want to do the root mean square error (RMSE) between the two dataframes. As an example, I pretend to do the RMSE between SOS column of Vobs and EVI_SOS column of Vest concerning the LAM ID (which exists in both dataframes).
In other words, I want to subset the data for the ID of interest. In this example, I'm interested in the LAM ID, for Vest and LAM_3 to LAM_7 (that is LAM_3, LAM_4, LAM_5, LAM_6, LAM_7) for Vobs.
I have been using this code:
sqrt(mean((Vobs$SOS - Vest$EVI_SOS)^2, na.rm=TRUE))
but I missed the ID subset for both columns of the two different dataframes. How can I do the subset using this code?
Any help will be much appreciated.
You could get the subsets of the relevant data as:
library(stringr)
diff <- subset(Vobs, ID %in% paste0("LAM_", 3:7))$SOS -
subset(Vest, str_detect(ID, "LAM"))$EVI_SOS
sqrt(mean(diff^2, na.rm=TRUE))
#> [1] 2.44949

Looping/sapply through nlme function

I am trying to execute a loop with mixed-model effects with response variable changing. I came from here and here. I should say that I have tried sthg creating a function and then sapply or lapply (wihtout success)
I provide a small dataset (really small) just to represent my original database (much larger and similar to those of longitudinal studies)
data<- structure(list(paciente = structure(c(6134, 6099, 6457, 6164,
6470, 6323, 6550, 6082, 6476, 6044, 6509, 6539, 6234, 6555, 6383,
6127, 6507, 6513, 6486, 6080, 6101, 6007, 6023, 6516, 6001, 6198,
6510, 6530, 6351, 6181), label = "Paciente", format.spss = "F6.0"),
edad_s1 = structure(c(70, 63, 61, 71, 67, 59, 63, 69, 67,
67, 67, 72, 65, 72, 63, 65, 60, 64, 56, 63, 57, 62, 72, 60,
72, 63, 72, 68, 66, 71), label = "Edad", format.spss = "F3.0"),
sexo_s1 = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 2L), .Label = c("Hombre", "Mujer"), label = "Sexo", class = "factor"),
time = c(2, 1, 2, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 2, 1, 2,
1, 2, 0, 1, 1, 0, 2, 1, 0, 2, 1, 2, 2, 0), grupo_int_v00 = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), label = "Grupo de intervención", class = "factor"),
peso1 = c(108, 80.4, 95, 75, 92.6, 90, 82.2, 94.4, 78, 71.3,
75.1, 83.5, 87.1, 63, 73, 98.5, 90.2, 81.3, 93.4, 79.8, 114.3,
110.9, 81.5, 88.5, 82.4, 88.3, 90, 73, 79, 94.8), cintura1 = c(127,
100.5, 103.5, 108, 115, 114.5, 95.5, 115, 101, 98, 99, 108.5,
105, 99, 104, 126, 114.2, 99, 110, 104.5, 120, 126, 111.5,
102, 117, 110, 125, 100, 104, 123), tasis2_e = c(156, 129,
131, 138, 167, 138, 115, 146, 119, 148, 130, 144, 115, 122,
135, 139, 128, 119, 138, 115, 138, 151, 151, NA, 137, 147,
124, 168, 152, 156), tadias2_e = c(70, 63, 80, 67, 76, 81,
57, 68, 69, 69, 68, 78, 61, 71, 54, 77, 63, 63, 92, 73, 80,
88, 84, NA, 79, 76, 62, 90, 87, 89), p17_total = c(10, 10,
5, 9, 9, 7, 15, 11, 6, 12, 11, 4, 9, 14, 9, 9, 11, 14, 6,
5, 10, 10, 9, 13, 12, 7, 11, 12, 7, 4), geaf_tot = c(1986.01,
1286.71, 1230.77, 1510.49, 839.16, 2144.52, 5361.31, 1678.32,
4055.94, 2601.4, 3363.64, 3076.92, 5342.66, 2769.23, 2601.4,
1693.24, 4055.94, 3146.85, 3916.08, 6405.59, 2442.89, 671.33,
867.13, 1585.08, 3153.85, 3188.81, 7986.01, 839.16, 7552.45,
2937.06), glucosa = c(127, 97, 95, 102, 119, 113, 109, 105,
93, 167, 85, 108, 122, 112, 113, 120, 100, 108, 100, 86,
129, 136, 98, 97, 130, 125, 109, 102, NA, 181), albumi = c(4.47,
4.82, 4.78, 4.22, 4.59, 4.5, 4.33, 4.87, 4.83, 4.98, 4.23,
4.77, 4.76, 4.98, 4.18, 4.51, 4.72, 4.87, 4.77, 4.61, 4.55,
4.77, 4.6, 4.59, 4.25, 4.71, 4.47, 4.54, NA, 4.63), coltot = c(157,
191, 276, 248, 248, 217, 187, 301, 173, 230, 258, 238, 231,
181, 183, 243, 223, 195, 237, 245, 164, 145, 199, 234, 178,
192, 201, 198, NA, 159), hdl = c(39, 50, 57, 59, 49, 44,
60, 98, 52, 73, 58, 44, 58, 60, 48, 46, 73, 58, 39, 47, 38,
45, 59, 56, 72, 34, 78, 62, NA, 54), ldl_calc = c(91, 124,
204, 133, 155, 140, 105, 162, 91, 141, 182, 173, 155, 107,
83, 150, 132, 124, NA, 167, 101, 88, 121, 160, 84, 130, 112,
120, NA, NA), trigli = c(137, 87, 74, 282, 219, 165, 112,
203, 149, 78, 89, 105, 91, 71, 259, 236, 92, 63, 447, 157,
123, 58, 94, 90, 112, 139, 53, 80, NA, 429), hba1c = c(6.57,
5.82, 5.68, 5.96, 6.11, 5.73, 5.48, 5.8, 5.6, 7.8, 5.21,
5.73, 6.1, 5.86, 6.37, 6.27, 5.22, 5.59, 5.47, 5.95, 6.96,
NA, 5.47, 4.99, NA, 6.25, 5.79, 5.79, NA, 6.54), i_hucpeptide = c(NA,
NA, 466.64, 838.61, 847.89, 1481.03, 819.65, NA, 1298.6,
NA, 564.59, 544.2, 755.73, 1057.83, 957.43, NA, 957.33, 1002.34,
1104, NA, NA, NA, NA, 594.6, NA, 815.82, 922.08, 628.54,
NA, 1591.01), i_hughrelin = c(NA, NA, 410.97, 553.65, 453,
352.44, 527.01, NA, 328.27, NA, 1668.41, 460.06, 1072.27,
260.24, 749.03, NA, 1327.91, 363.79, 524.53, NA, NA, NA,
NA, 1051.1, NA, 143.32, 1076.49, 1565.85, NA, 607.31), i_hugip = c(NA,
NA, 2.67, 2.67, 2.67, 2.67, 2.67, NA, 2.67, NA, 2.67, 2.67,
690.74, 1165.16, 2.67, NA, 2.67, 2.67, 2.67, NA, NA, NA,
NA, 2.67, NA, 2.67, 2.67, 2.67, NA, 2.67), i_huglp1 = c(NA,
NA, 127.66, 284.34, 200.13, 59.3, 234.84, NA, 503.42, NA,
103.9, 14.14, 71.6, 56.41, 75.13, NA, 161.36, 124.19, 220.52,
NA, NA, NA, NA, 14.14, NA, 112.57, 100.52, 237.55, NA, 470.91
), i_huglucagon = c(NA, NA, 333.79, 649.94, 726.99, 395.38,
610.5, NA, 434.42, NA, 502.4, 127.62, 268.23, 10.48, 428.15,
NA, 716.02, 238.95, 320.32, NA, NA, NA, NA, 10.48, NA, 238,
487.42, 297.6, NA, 495.16), i_huinsulin = c(NA, NA, 129.24,
270.98, 299.75, 730.82, 267.54, NA, 616.91, NA, 121.26, 85.34,
224.96, 247.48, 220.75, NA, 181.85, 341.25, 551.46, NA, NA,
NA, NA, 133.42, NA, 263.87, 279.45, 94.78, NA, 573.14), i_huleptin = c(NA,
NA, 3992.49, 17806.43, 8409.76, 11511.43, 2965.92, NA, 3223.08,
NA, 9018.79, 1039.45, 2613.33, 2128.98, 7307.89, NA, 13492.13,
2883.77, 4775.98, NA, NA, NA, NA, 2602.96, NA, 2829.59, 8511.92,
3528.77, NA, 11487.15), i_hupai1 = c(NA, NA, 997.29, 2499.25,
3085.25, 1909.44, 1730.55, NA, 3333.37, NA, 1424.3, 1857.71,
2578.46, 2268.52, 2222.97, NA, 2722.92, 1300.69, 2732.11,
NA, NA, NA, NA, 1204.36, NA, 2483.08, 2289.67, 1791.79, NA,
6595.54), i_huresistin = c(NA, NA, 3044.48, 5774.77, 3221.72,
4925.57, 5170.95, NA, 3683.64, NA, 4041.32, 6771.31, 5119.11,
9521.7, 3328.41, NA, 5061.65, 3773.39, 3039.39, NA, NA, NA,
NA, 4405.17, NA, 2577.84, 3433.82, 6802.94, NA, 6461.67),
i_huvisfatin = c(NA, NA, 302.3, 2083.46, 2989.72, 1118.7,
8.64, NA, 96.03, NA, 2209.51, 8.64, 1944.37, 1415.55, 678.33,
NA, 4349.56, 8.64, 410.1, NA, NA, NA, NA, 117, NA, 8.64,
2308.8, 228.53, NA, 1766.64), col_rema = c(27, 17, 15, 56,
44, 33, 22, 41, 30, 16, 18, 21, 18, 14, 52, 47, 18, 13, NA,
31, 25, 12, 19, 18, 22, 28, 11, 16, NA, NA), homa = c(NA,
NA, 5.053, 11.374, 14.679, 33.985, 12.001, NA, 23.61, NA,
4.242, 3.793, 11.294, 11.406, 10.265, NA, 7.484, 15.167,
22.694, NA, NA, NA, NA, 5.326, NA, 13.574, 12.535, 3.978,
NA, 42.691), i_pcr = c(NA, NA, 0.41, 0.82, NA, 2.08, 0.08,
NA, 0.1, NA, 0.38, 0.05, 0.04, 0.35, 0.2, NA, 0.98, 0.02,
NA, NA, NA, NA, NA, 0.2, NA, 0.1, 0.16, 0.16, NA, 2.93)), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))
Afterwards I am defining my iteration and my variables database
ex<- subset(data[, 6:30])
for (i in 1:length(ex)) {
var_1 <- ex[,i]
var_1 <- unlist(var_1)
lme_1 <- lme(var_1 ~ sexo_s1*peso1 + edad_s1 + p17_total + poly(time, 2)*grupo_int_v00,
random = ~ poly(time, 2)|paciente, control=lmeControl(opt="optim"),
data = dat_longer, subset = !is.na(var_1))
Error in model.frame.default(formula = ~time + var_1 + sexo_s1 + peso1 + :
invalid type (list) for variable 'var_1'
I have tried unlisting/as.data.frame in before running the loop
for (i in 1:length(data)) {
var_1 <- data[,i]
var_1 <- unlist(var_1) #or as.data.frame(var_1)
lme_1 <- lme(var_1 ~ sexo_s1*peso1 + edad_s1 + p17_total + poly(time, 2)*grupo_int_v00,
random = ~ poly(time, 2)|paciente, control=lmeControl(opt="optim"),
data = dat_longer, subset = !is.na(var_1))
}
Error in model.frame.default(formula = ~time + var_1 + sexo_s1 + peso1 + :
variable lengths differ (found for 'var_1')
I have also tried to develop a new function to iterate over
lme_z <- function(z){
out <- lme(z ~ sexo_s1*peso1 + edad_s1 + p17_total + poly(time, 2)*grupo_int_v00,
random = ~ poly(time, 2)|paciente, control=lmeControl(opt="optim"),
data = dat_longer, subset = !is.na(z))
}
Error
If there is some contribution to iterate in the response variable (I know Ben Bolker is an expert)
Thanks in advance
If data is a data frame containing all of the variables that you use in your formula, including all of the responses that you want to consider, then you can do:
f <- function(resp) {
fixed <- . ~ sexo_s1 * peso1 + edad_s1 + p17_total + poly(time, 2) * grupo_int_v00
fixed[[2L]] <- as.name(resp)
lme(fixed = fixed,
random = ~poly(time, 2) | paciente,
data = data,
subset = !is.na(data[[resp]]),
control = lmeControl(opt = "optim"))
}
list_of_lme_objects <- lapply(names_of_response_variables, f)
An important piece is:
fixed <- . ~ sexo_s1 * peso1 + edad_s1 + p17_total + poly(time, 2) * grupo_int_v00
fixed[[2L]] <- as.name(resp)
The second statement injects the response named resp into the left hand side of the formula template. A more transparent example:
fixed <- . ~ world
fixed[[2L]] <- as.name("hello")
fixed
## hello ~ world
Another important piece is:
subset = !is.na(data[[resp]])
Here, the right hand side actually evaluates to a logical vector of length equal to the number of rows of data. You might consider passing na.action = na.omit instead of subset, though that will also omit rows where the independent variables have missing values, so the semantics are slightly different.
The variable grupo_int_v00 is missing from your data frame. You'll have to fix that on your end in order to test the code...
I was going to suggest:
formvars <- c("sexo_s1*peso1",
"edad_s1",
"p17_total",
"poly(time, 2)")
## excluded *grupo_int_v00 since not in example data frame
respvars <- names(df)[7:30]
result <- list()
for (r in respvars) {
result[[r]] <- lme(reformulate(formvars, response = r),
random = ~ poly(time, 2)|paciente,
control=lmeControl(opt="optim"),
data = df,
na.action = na.exclude)
}
Many of #MikaelJagan's points are well taken. In particular:
grupo_int_v00 excluded since it wasn't in your example data set
this code doesn't work for your example since there are only two complete cases (i.e., observations with no missing predictors/responses) in the data set, so we can't fit a quadratic polynomial ("degree must be less than the number of unique points")
I used na.exclude, which obviates your subset argument; it excludes NA values when fitting but will re-introduce them e.g. in calculating predictions or residuals

Subset specific rows in a dataframe, but keeping the observations

I have a dataframe which look like this
y = data.frame(subdel = c(1, 2, 3, 1, 57, 14, 1, 2, 57, 57, 57, 3, 1, 1,
31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89","A45", "B26", "B42","B61", "B70"))
I'm expecting the next result:
z = data.frame(subdel = c(1, 2, 3, 57, 57, 57, 57, 3, 1, 1, 31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89", "A45", "B26", "B42","B61", "B70"))
I want to match subdel == 57 with muni == NA, but, as you can see, conservating all the another observations in the dataframe.
Any help would be appreciated.
We can use subset with a logical condition i.e. check for NA in 'muni' (is.na(muni)) and (&) where the 'subdel' is 57 (subdel == 57) or all other non-NA elements from 'muni' (!is.na(muni))
subset(y, is.na(muni) & subdel == 57 | !is.na(muni))

Simple restrictions/constraint for multiple imputation (MICE) in R

I want to perform multiple imputation for a set of variables using the MICE package in R.
# Example data
data <- data.frame(
gcs = c(3, 10, NA, NA, NA, 15, 14, 15, 15, 14, 15, NA, 13, 15, 15),
hf = c(50, 66, 78, 99, NA, NA, 56, 55, NA, 76, 98, 105, NA, NA, 65),
...
)
The minimum for gcs is 3 and the maximum is 15, and it may not be a fractional number, how can I set these constraints in MICE? Same goes for hf, but this one only has a bottom limit of 0.

Resources