Calculate internal consistency of items by grouping variables using dplyr/tidyverse - r

I’d like to calculate the internal consistency (alpha and omega) of items by grouping variables (e.g., age and raterType). Ideally I’d be able to do this using dplyr/tidyverse. My question is similar to another question (Using dplyr to nest or group two variables, then perform the Cronbach's alpha function or other statistics to the data), however I can’t get the solution to work in my case.
Here is a minimal example:
library("tidyverse")
library("psych")
library("MBESS")
mydata <- expand.grid(ID = 1:100,
age = 1:5,
raterType = c("self",
"friend",
"parent"))
set.seed(12345)
mydata$item1 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item2 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item3 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item4 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item5 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item6 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item1[sample(nrow(mydata), 100)] <- NA
mydata$item2[sample(nrow(mydata), 100)] <- NA
mydata$item3[sample(nrow(mydata), 100)] <- NA
mydata$item4[sample(nrow(mydata), 100)] <- NA
mydata$item5[sample(nrow(mydata), 100)] <- NA
mydata$item6[sample(nrow(mydata), 100)] <- NA
itemNames <- paste("item", 1:6, sep = "")
To calculate internal consistency for the entire dataset, I would calculate alpha and omega, respectively, by the following code:
alpha(mydata[,itemNames])$total$raw_alpha
ci.reliability(mydata[,itemNames], type = "omega", interval.type = "none")$est
However, I want to calculate alpha and omega for each combination of age and raterType.
Here's my attempt:
mydata %>%
pivot_longer(cols = c(-age, -raterType, -ID)) %>%
select(-ID) %>%
nest_by(age, raterType) %>%
mutate(alpha = alpha(data)$total$raw_alpha,
omega = ci.reliability(data, type = "omega", interval.type = "none")$est)
This throws an error. For some reason, the code provides incorrect estimates for omega and throws an error for alpha:
> # This provides the wrong estimates:
> mydata %>%
+ pivot_longer(cols = c(-age, -raterType, -ID)) %>%
+ select(-ID) %>%
+ nest_by(age, raterType) %>%
+ mutate(omega = ci.reliability(data, type = "omega", interval.type = "none")$est)
# A tibble: 15 × 4
# Rowwise: age, raterType
age raterType data omega
<int> <fct> <list<tibble[,2]>> <dbl>
1 1 self [600 × 2] 0.218
2 1 friend [600 × 2] 0.257
3 1 parent [600 × 2] 0.261
4 2 self [600 × 2] 0.196
5 2 friend [600 × 2] 0.257
6 2 parent [600 × 2] 0.209
7 3 self [600 × 2] 0.179
8 3 friend [600 × 2] 0.225
9 3 parent [600 × 2] 0.247
10 4 self [600 × 2] 0.224
11 4 friend [600 × 2] 0.252
12 4 parent [600 × 2] 0.218
13 5 self [600 × 2] 0.248
14 5 friend [600 × 2] 0.218
15 5 parent [600 × 2] 0.202
>
> # This throws an error:
> mydata %>%
+ pivot_longer(cols = c(-age, -raterType, -ID)) %>%
+ select(-ID) %>%
+ nest_by(age, raterType) %>%
+ mutate(alpha = alpha(data)$total$raw_alpha)
Number of categories should be increased in order to count frequencies.
Error in `mutate()`:
! Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ The error occurred in row 1.
Caused by error in `FUN()`:
! only defined on a data frame with all numeric-alike variables
Run `rlang::last_error()` to see where the error occurred.
Warning messages:
1: Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ NAs introduced by coercion
ℹ The warning occurred in row 1.
2: Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ Item = name had no variance and was deleted but still is counted in the score
ℹ The warning occurred in row 1.
The omega values above do not correspond to the values obtained from running the ci.reliability() function on the respective subset of the data:
> alpha(mydata[which(mydata$age == 3 & mydata$raterType == "self"), itemNames])$total$raw_alpha
[1] -0.3018416
> ci.reliability(mydata[which(mydata$age == 3 & mydata$raterType == "self"), itemNames], type = "omega", interval.type = "none")$est
[1] 0.00836356

Perhaps this helps
out1 <- mydata %>%
group_by(age, raterType) %>%
summarise(alpha = alpha(across(all_of(itemNames)))$total$raw_alpha,
omega = ci.reliability(across(all_of(itemNames)),
type = "omega", interval.type = "none")$est, .groups = 'drop')
-output
> out1
# A tibble: 15 × 4
age raterType alpha omega
<int> <fct> <dbl> <dbl>
1 1 self -0.135 2.76
2 1 friend 0.138 0.231
3 1 parent -0.229 255.
4 2 self -0.421 NA
5 2 friend 0.0650 58.7
6 2 parent 0.153 NA
7 3 self -0.302 0.00836
8 3 friend 0.147 0.334
9 3 parent 0.196 0.132
10 4 self -0.0699 NA
11 4 friend 0.118 0.214
12 4 parent -0.0303 31.1
13 5 self -0.0166 0.246
14 5 friend -0.192 0.0151
15 5 parent 0.0847 NA
Or may be this
out2 <- mydata %>%
nest_by(age, raterType) %>%
mutate(alpha = alpha(data[, itemNames])$total$raw_alpha,
omega = ci.reliability(data[, itemNames], type = "omega",
interval.type = "none")$est)
-output
out2
# A tibble: 15 × 5
# Rowwise: age, raterType
age raterType data alpha omega
<int> <fct> <list<tibble[,7]>> <dbl> <dbl>
1 1 self [100 × 7] -0.135 2.76
2 1 friend [100 × 7] 0.138 0.231
3 1 parent [100 × 7] -0.229 255.
4 2 self [100 × 7] -0.421 NA
5 2 friend [100 × 7] 0.0650 58.7
6 2 parent [100 × 7] 0.153 NA
7 3 self [100 × 7] -0.302 0.00836
8 3 friend [100 × 7] 0.147 0.334
9 3 parent [100 × 7] 0.196 0.132
10 4 self [100 × 7] -0.0699 NA
11 4 friend [100 × 7] 0.118 0.214
12 4 parent [100 × 7] -0.0303 31.1
13 5 self [100 × 7] -0.0166 0.246
14 5 friend [100 × 7] -0.192 0.0151
15 5 parent [100 × 7] 0.0847 NA

Related

Accessing results from a list column when some elements are NA

Problem: List column contains a few missing values
Data
Consider the following tibble that contains the results of 2 model fits:
> Model_fits
# A tibble: 4 x 4
cyl data model1 model2
<dbl> <list<tibble[,2]>> <list> <list>
1 2 [5 x 2] <dbl [1]> <dbl [1]>
2 4 [11 x 2] <lm> <lm>
3 6 [7 x 2] <lm> <dbl [1]>
4 8 [14 x 2] <lm> <lm>
The data for cyl==2 was missing in this example. Therefore, model1 contains NA_real_ in the first row. Similarly, model2 contains NA_real_ in rows 1 and 3.
Extracting model results
I want to extract the results of model fit using broom::glance. But it does not work due to the missing values:
> Model_fits %>%
+ mutate(summary_res = map(model1, broom::glance))
Error: Problem with `mutate()` input `summary_res`.
x No glance method for objects of class numeric
i Input `summary_res` is `map(model1, broom::glance)`.
Attempt at solution
So, I try to use purrr::possibly, but that does not work either:
> Model_fits %>%
+ mutate(summary_res1 = map(model1, ~ possibly(broom::glance(.x),
+ otherwise = NA_real_)))
Error: Problem with `mutate()` input `summary_res1`.
x No glance method for objects of class numeric
i Input `summary_res1` is `map(model1, ~possibly(broom::glance(.x), otherwise = NA_real_))`.
Expected outcome
I want to get the broom::glance results for all non-missing values and NA_real_ for all missing values. Please guide me how can I get these results?
Code for creating Model_fits
Please note that I created the following as a reproducible example. But this is not my original data/model results.
library(tidyverse)
new_data <- tibble(mpg = rep(NA_real_, 5),
cyl = rep(2, 5),
disp = rep(NA_real_, 5))
mtcars2 <- mtcars %>%
dplyr::select(mpg, cyl, disp)
mt <- bind_rows(mtcars2,
new_data)
model_res_list <- map(mtcars2 %>% group_split(cyl), ~lm(mpg ~ disp, data = .x))
lizt <- list(NA_real_, model_res_list[[1]], model_res_list[[2]], model_res_list[[3]])
lizt2 <- list(NA_real_, model_res_list[[1]], NA_real_, model_res_list[[3]])
Model_fits <- mt %>%
group_nest(cyl) %>%
mutate(model1 = lizt,
model2 = lizt2)
One more thing you could do about this is using tryCatch function, So that you define in case of an error occurring what would be the output of your function. In this case it will not bring the execution of the function to a halt.
Model_fits %>%
mutate(mod01 = map(model1, ~ tryCatch(glance(.x),
error = function(cond) {
NA_real_
}))) %>%
unnest(mod01)
# A tibble: 4 x 17
cyl data model1 model2 mod01 r.squared adj.r.squared sigma statistic p.value df
<dbl> <list<tibbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 [5 x 2] <dbl [~ <dbl ~ NA NA NA NA NA NA NA
2 4 [11 x 2] <lm> <lm> NA 0.648 0.609 2.82 16.6 0.00278 1
3 6 [7 x 2] <lm> <dbl ~ NA 0.0106 -0.187 1.58 0.0537 0.826 1
4 8 [14 x 2] <lm> <lm> NA 0.270 0.209 2.28 4.44 0.0568 1
# ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
If we want to use possibly or safely instead of tryCatch we should first write a custom function that wraps glance in general and before applying on our data set:
poss_glance <- possibly(glance, otherwise = NA_real_)
Model_fits %>%
mutate(mod01 = map(model1, ~ poss_glance(.x))) %>%
unnest(mod01)
# A tibble: 4 x 17
cyl data model1 model2 mod01 r.squared adj.r.squared sigma statistic p.value df
<dbl> <list<tibbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 [5 x 2] <dbl [~ <dbl ~ NA NA NA NA NA NA NA
2 4 [11 x 2] <lm> <lm> NA 0.648 0.609 2.82 16.6 0.00278 1
3 6 [7 x 2] <lm> <dbl ~ NA 0.0106 -0.187 1.58 0.0537 0.826 1
4 8 [14 x 2] <lm> <lm> NA 0.270 0.209 2.28 4.44 0.0568 1
# ... with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
Or even we could use safely in place of possibly so that our function returns an enhanced output in this case NA_real_:
safe_glance <- safely(glance, otherwise = NA_real_)
Model_fits %>%
mutate(mod01 = map(model1, ~ safe_glance(.x)))
You can just check if the value is NA in a wrapper function that you pass to map.
Model_fits %>%
mutate(summary_res = map(model1, function(x) if (length(x) == 1 && is.na(x)) NA_real_ else broom::glance(x)))

When using purrr::map() to fit a model onto a new list-column, return indicative strings if fitting errors

I have data that I want to model per group using purrr::map(). Sometimes, fitting the model fails for one (or more) of the subgroups. The error could be, for example, contrasts can be applied only to factors with 2 or more levels, or a warning such as glm.fit: algorithm did not converge, or otherwise.
Because an error fails the entire code, I'd like to create a condition: if there's a fitting error for a sub-group, return "string-of-choice" for that sub-group; but for sub-groups that do result with a model, for them return the model object. And even if there's a warning about convergence, I prefer having "string-of-choice-2" than a non-converging model.
While my question is general, I'm providing some example of toy data for demonstration.
Example
Here's a function for generating data. In this data, 3 columns correspond to 3 questions that people answered about:
whether they love coconut (yes or no, coded as 0/1.)
whether they love eggplant
whether they love tomatoes
In addition, we have an id column and gender column.
In the following two versions of this data, the column about loving tomatoes could be either all NA or all 0.
generate_data <- function(x) {
data.frame(id = 1:2000,
do_u_love_coconut = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.2, 0.4, 0.4)),
do_u_love_eggplant = sample(c(0, 1, NA), 2000, replace = TRUE, prob = c(0.1, 0.5, 0.4)),
do_u_love_tomatoes = rep(x, 2000),
gender = sample(c("male", "female"), 2000, replace = TRUE))
}
## generate the data
set.seed(2021)
df_tomatoes_is_NA <- generate_data(NA)
df_tomatoes_is_zero <- generate_data(0)
## preview the data
library(tibble)
as_tibble(df_tomatoes_is_NA)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <lgl> <chr>
## 1 1 NA NA NA male
## 2 2 NA NA NA male
## 3 3 NA NA NA male
## 4 4 1 1 NA female
## 5 5 NA 1 NA female
## 6 6 NA NA NA male
## 7 7 NA NA NA female
## 8 8 1 1 NA male
## 9 9 0 1 NA female
## 10 10 0 1 NA female
## # ... with 1,990 more rows
as_tibble(df_tomatoes_is_zero)
## # A tibble: 2,000 x 5
## id do_u_love_coconut do_u_love_eggplant do_u_love_tomatoes gender
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 NA 0 0 male
## 2 2 NA NA 0 male
## 3 3 1 NA 0 female
## 4 4 0 1 0 female
## 5 5 1 0 0 male
## 6 6 NA 0 0 female
## 7 7 1 1 0 male
## 8 8 1 NA 0 male
## 9 9 1 NA 0 male
## 10 10 0 1 0 female
## # ... with 1,990 more rows
Fitting the model
So now I want to fit a model of each coconut/eggplant/tomatoes by gender.
version 1 of the data
library(tidyr)
library(purrr)
library(dplyr)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
Error: Problem with mutate() input fit_and_predict.
x contrasts can be applied only to factors with 2 or more levels
i Input fit_and_predict is map(data, ~glm(formula = value ~ gender, data = .x, family = "binomial")).
i The error occurred in group 3: name = "do_u_love_tomatoes".
version 2 of the data
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
# A tibble: 3 x 3
# Groups: name [3]
name data fit_and_predict
<chr> <list> <list>
1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
3 do_u_love_tomatoes <tibble [2,000 x 3]> <glm>
Warning message:
Problem with mutate() input fit_and_predict.
i glm.fit: algorithm did not converge
i Input fit_and_predict is map(data, ~glm(formula = value ~ gender, data = .x, family = "binomial")).
i The error occurred in group 3: name = "do_u_love_tomatoes".
My question
I want to account for potential fitting errors and decide what should be the return value in such failures. For example, for an error such as contrasts can be applied only to factors with 2 or more levels I want to have "contrasts_error" as the return value. An expected output for example:
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "contrasts_error"
And if there's a convergence warning, for example glm.fit: algorithm did not converge, I'd like to have:
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,
~ glm(formula = value ~ gender, data = .x, family = "binomial")))
## # A tibble: 3 x 3
## # Groups: name [3]
## name data fit_and_predict
## <chr> <list> <list>
## 1 do_u_love_coconut <tibble [2,000 x 3]> <glm>
## 2 do_u_love_eggplant <tibble [2,000 x 3]> <glm>
## 3 do_u_love_tomatoes <tibble [2,000 x 3]> <chr[1]> <-- "convergence_warning"
EDIT
Just to clarify, since there are many potential fitting errors and warnings, a solution will always require me to specify in the code each potential error/warning and its respective string. I gave two examples above (contrasts error and convergence warning).
You can handle this with a tryCatch and capture all warnings and errors and return corresponding output for them.
apply_glm <- function(data, formula) {
tryCatch(glm(formula = formula, data = data, family = "binomial"), error = function(e) {
if(e$message == "contrasts can be applied only to factors with 2 or more levels")
return('contrasts error')
}, warning = function(w) {
if(w$message == "glm.fit: algorithm did not converge")
return('convergence warning')
})
}
You can extend handling the error and warnings messages with if/else if or case_when statements.
Apply the function for dataset df_tomatoes_is_NA :
library(tidyverse)
df_tomatoes_is_NA %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data, apply_glm, value ~ gender)) -> result1
result1
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result1$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.68837 -0.08838
Degrees of Freedom: 1214 Total (i.e. Null); 1213 Residual
(785 observations deleted due to missingness)
Null Deviance: 1564
Residual Deviance: 1564 AIC: 1568
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.61343 -0.01754
Degrees of Freedom: 1218 Total (i.e. Null); 1217 Residual
(781 observations deleted due to missingness)
Null Deviance: 1101
Residual Deviance: 1101 AIC: 1105
[[3]]
[1] "contrasts error"
For dataset df_tomatoes_is_zero :
df_tomatoes_is_zero %>%
pivot_longer(starts_with("do_u")) %>%
group_by(name) %>%
nest() %>%
mutate(fit_and_predict = map(data,apply_glm)) -> result2
result2
# name data fit_and_predict
# <chr> <list> <list>
#1 do_u_love_coconut <tibble [2,000 × 3]> <glm>
#2 do_u_love_eggplant <tibble [2,000 × 3]> <glm>
#3 do_u_love_tomatoes <tibble [2,000 × 3]> <chr [1]>
result2$fit_and_predict
[[1]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
0.49372 0.07442
Degrees of Freedom: 1190 Total (i.e. Null); 1189 Residual
(809 observations deleted due to missingness)
Null Deviance: 1570
Residual Deviance: 1570 AIC: 1574
[[2]]
Call: glm(formula = value ~ gender, family = "binomial", data = data)
Coefficients:
(Intercept) gendermale
1.60539 -0.03636
Degrees of Freedom: 1177 Total (i.e. Null); 1176 Residual
(822 observations deleted due to missingness)
Null Deviance: 1073
Residual Deviance: 1073 AIC: 1077
[[3]]
[1] "convergence warning"

Rolling window with slide_dbl() on grouped data

This is an extension to following question: Rolling window slider::slide() with grouped data
I want to mutate a column of my grouped tibble with slide_dbl(), i.e. applying slide_dbl() on all groups, but only within them, not across them.
When running the solution of linked question I receive following error message:
Error: Problem with `mutate()` input `rollreg`.
x Inapplicable method for 'mutate_' applied to object of class "c('double', 'numeric')".
My tibble has following structure:
tibble [450,343 x 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ company: num [1:450343] 1 1 1 1 1 ...
$ date: Date[1:450343], format: "2011-11-30" "2011-12-31" "2012-01-31" "2012-02-29" ...
$ result: num [1:450343] NA NA NA 12.5981 -2.9023 ...
- attr(*, "groups")= tibble [3,339 x 2] (S3: tbl_df/tbl/data.frame)
..$ company: num [1:3339] 1 2 3 4 5 ...
..$ .rows : list<int> [1:3339]
To complete, this is the code I ran according to the linked solution:
testtest <- data %>%
group_by(company) %>% nest() %>%
mutate(rollreg = map(data, ~ .x %>% mutate(result_2 = slide_dbl(.x = .$result, .f = ~prod(1+.)-1, .before = 11, .after = -1, complete=TRUE)))) %>%
select(-data) %>% unnest(rollreg)
Here, above mentioned error message occurs. I guess it's because of the data structure. Yet, I can't figure any solution (also not with similar functions like group_map() or group_modify()). Can anyone help? Thanks in advance!
An option is group_split by the grouping column (in the example, using 'case', loop over the list of datasets with map, create new column in mutate by applying the slide_dbl
library(dplyr)
library(tidyr)
library(purrr)
data %>%
group_split(case) %>%
map_dfr(~ .x %>%
mutate(out = slide_dbl(r1, .f = ~ prod(1 + .x) - 1,
.before = 5, .after = -1, complete = TRUE)))
-output
# A tibble: 30 x 6
# t case r1 r2 r3 out
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 a -0.294 -0.164 1.33 0
# 2 2 a 0.761 1.01 0.115 -0.294
# 3 3 a -0.781 -0.499 0.290 0.243
# 4 4 a -0.0732 -0.110 0.289 -0.728
# 5 5 a -0.528 0.707 0.181 -0.748
# 6 6 a -1.35 -0.411 -1.47 -0.881
# 7 7 a -0.397 -1.28 0.172 -1.06
# 8 8 a 1.68 0.956 -2.81 -1.02
# 9 9 a -0.0167 -0.0727 -1.08 -1.24
#10 10 a 1.25 -0.326 1.61 -1.26
## … with 20 more rows
Or if we need to use the nest_by, it creates an attribute rowwise, so, it is better to ungroup before applying
out1 <- data %>%
select(-t) %>%
nest_by(case) %>%
ungroup %>%
mutate(data = map(data, ~ .x %>%
mutate(out = slide_dbl(r1, .f = ~ prod(1 + .x) - 1,
.before = 5, .after = -1, complete = TRUE))))
-output
out1
# A tibble: 3 x 2
# case data
# <chr> <list>
#1 a <tibble [10 × 4]>
#2 b <tibble [10 × 4]>
#3 c <tibble [10 × 4]>
Now, we unnest the structure
out1 %>%
unnest(data)
# A tibble: 30 x 5
# case r1 r2 r3 out
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a -0.294 -0.164 1.33 0
# 2 a 0.761 1.01 0.115 -0.294
# 3 a -0.781 -0.499 0.290 0.243
# 4 a -0.0732 -0.110 0.289 -0.728
# 5 a -0.528 0.707 0.181 -0.748
# 6 a -1.35 -0.411 -1.47 -0.881
# 7 a -0.397 -1.28 0.172 -1.06
# 8 a 1.68 0.956 -2.81 -1.02
# 9 a -0.0167 -0.0727 -1.08 -1.24
#10 a 1.25 -0.326 1.61 -1.26
# … with 20 more rows
data
data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))
i also got a question regarding the slide_Dbl function. I would like to check out other rollingregressions. My data is already fixed with an 8 weak week, but if i would like to look at for example 16 or 24 weeks, should i change the (before= ) from 8 to 16? The reason why i am asking is that i dont have the original dataset, but its already fixed with 8 weeks, so if i add the (before= ) with an additional 8 will it be 16?
new8 <- new%>%mutate( across(
where(is.double),
function(x) slide_dbl(x, mean, na.rm = TRUE, .before = 8L) %>% lag()))
Or should i put
new16 <- new%>%mutate(across(
where(is.double),
function(x) slide_dbl(x, mean, na.rm = TRUE, .before = 16L) %>% lag()))

Problem with running paired t-test within nested dplyr dataset

I have gone through the vignette for row-wise operations for the new dplyr v1.0.0 and am intrigued by the possibilities of the nest_by function for modelling within different silos of a dataset.
However I am having difficulty getting a repeated-measures analysis to work.
Here's an example to illustrate when it does work
df1 <- data.frame(group = factor(rep(LETTERS[1:3],10)),
pred = factor(rep(letters[1:2],each=5,length.out=30)),
out = rnorm(30))
Now create the nesting based on the group variable.
library(dplyr)
nest1 <- df1 %>% nest_by(group)
nest
We can view this new special nested data frame
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# a [10 x 2]
# b [10 x 2]
# c [10 x 2]
Now we can perform operations on it, like a linear regression, regressing out on pred within each level of the original group variable.
mods <- nest1 %>% mutate(mod = list(lm(out ~ pred, data = data)))
In this new object we have added a new column to the original nested dataset containing the lm() object
mods
# # A tibble: 3 x 3
# # Rowwise: group
# group data mod
# <fct> <list<tbl_df[,2]>> <list>
# 1 A [10 x 2] <lm>
# 2 B [10 x 2] <lm>
# 3 C [10 x 2] <lm>
And we can view the results of these models
library(broom)
mods %>% summarise(broom::tidy(mod))
# A tibble: 6 x 6
# Groups: group [3]
# group term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 A (Intercept) 0.0684 0.295 0.232 0.823
# 2 A predb -0.231 0.418 -0.553 0.595
# 3 B (Intercept) -0.159 0.447 -0.356 0.731
# 4 B predb 0.332 0.633 0.524 0.615
# 5 C (Intercept) -0.385 0.245 -1.57 0.154
# 6 C predb 0.891 0.346 2.58 0.0329
Now I would like to be able to do the same thing but with a repeated measures t-test.
# dataset with grouping factor and two columns, each representing a measure at one of two timepoints
df2 <- data.frame(group = factor(rep(letters[1:3],10)),
t1 = rnorm(30),
t2 = rnorm(30))
# nest by grouping factor
nest2 <- df2 %>% nest_by(group)
nest2
# A tibble: 3 x 2
# Rowwise: group
# group data
# <fct> <list<tbl_df[,2]>>
# 1 a [10 x 2]
# 2 b [10 x 2]
# 3 c [10 x 2]
Now when I try to perform a paired t-test at each level of the new nested dataset, using a similar procedure to the linear model...
mods2 <- nest2 %>% mutate(t = list(t.test(t1, t2, data = data)))
...I get the following error message
Error: Problem with `mutate()` input `t`.
x object 't1' not found
i Input `t` is `list(t.test(t1, t2, data = data))`.
i The error occured in row 1.
Run `rlang::last_error()` to see where the error occurred.
Can anyone help me?
The data option is used with the formula method, while 's3' method with x, y as argument, we can wrap using with
library(dplyr)
library(purrr)
nest2 %>%
mutate(t = list(with(data, t.test(t1, t2))))
# A tibble: 3 x 3
# Rowwise: group
# group data t
# <fct> <list<tbl_df[,2]>> <list>
#1 a [10 × 2] <htest>
#2 b [10 × 2] <htest>
#3 c [10 × 2] <htest>
Or use extractors ($, [[)
nest2 %>%
mutate(t = list(t.test(data$t1, data$t2)))

How to run a linear regression by row and fill data only in elements with NA in R?

I have a data frame containing data on population densities in cities dependent on the distance to the city center ('spatial distance profiles').
The dataframe looks like this (example):
set.seed(1)
data <- data.frame(cities = c("city1","city2","city3"),
km1 = runif(3,6,7),
km2 = runif(3,5,6),
km3 = runif(3,4,5),
km4 = c(3.5,3.2,NA),
km5 = c(NA,NA,NA)
)
"cities" contains the cityname or an identifier while variables "km1-km4" contain the log of the population density in that distance. Note that observation 3 in the example does not have data for km4; all cities do not have data for km5.
What I am trying to achieve is to extrapolate how many kilometers the city spreads out, when the population density follows an exponential function.
To do so, I want to first run a linear regression y~x for every row of the table, where y is variables km1-kmX and x the corresponding distance to the city center (1,2,3, ...).
lm(km1-kmX ~ distance)
(Variable "distance" is not defined in the example, as I do not know how to incorparate it in the dataframe. But I hope the idea comes through)
So, for city1 and city2 variables km1-km4 should be used, while for city3 obviously only km1-km3.
The resulting coefficients beta_0 and beta_1 should then be stored as variables in the corresponding row.
Next, I want to use the coefficients to calulcate the logarithm of population density for the missing variables, kind of:
km4 = beta_0 * exp(beta_1*4) #for observation 3
km5 = beta_0 * exp(beta_1*5) #for all observations
I know that the description is a little vague; I wanted to be as precice as possible with amplifying all of the details of the idea. Appreciate any help.
Although probably not necessary, this is platform I use R on:
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Assuming that the objective is to fill in the NA values using a linear regression of the non-NA values on the same line vs. the digit in the heading, first extract the numeric values in the column names forming x. Then define a function which regresses the row values against x and predicts the NAs from that. Finally apply it to each row.
x <- as.numeric(gsub("\\D", "", names(data)[-1])) # c(1, 2, 3, 4, 5)
na.lm <- function(r, x) ifelse(is.na(r), predict(lm(r ~ x), list(x = x)), r)
cbind(data[1], t(apply(data[-1], 1, na.lm, x = x)))
giving:
cities km1 km2 km3 km4 km5
1 city1 6.265509 5.908208 4.944675 3.50000 2.839583
2 city2 6.372124 5.201682 4.660798 3.20000 2.344337
3 city3 6.572853 5.898390 4.629114 3.75638 2.784510
First, I think we need to reshape your data from "wide" to "tall" format. This will satisfy Ryan's comment that you cannot do linear regression with a single row -- he's technically right, but I think he's missing the point that you actually have 4-5 observations per row, not 1. (Comment since deleted.)
(Second: never name a variable data. If you forget to create it in a new R session, all functions that depend on it will fail in a curious and often unintuitive way instead of the expected simpler error message Error: object 'data' not found. I will use dat with your creation code.)
This is demonstrated using a few packages from the tidyverse:
library(dplyr)
library(tidyr)
library(purrr)
Reshaping: first, you enumerate as km1, km2, etc, but those are categorical variables, not numbers, and I'm inferring that you want the numbers stored in them. So what you have as a column name (km1) really should be data (km = 1). (Oh, and I remove the NA, since they don't help feeding the model. We'll bring them back later.)
datlong <- dat %>%
gather(km, dens, -cities) %>%
mutate(km = as.numeric(gsub("km", "", km))) %>%
rename(city = cities) %>%
filter(complete.cases(.))
datlong
# city km dens
# 1 city1 1 6.265509
# 2 city2 1 6.372124
# 3 city3 1 6.572853
# 4 city1 2 5.908208
# 5 city2 2 5.201682
# 6 city3 2 5.898390
# 7 city1 3 4.944675
# 8 city2 3 4.660798
# 9 city3 3 4.629114
# 10 city1 4 3.500000
# 11 city2 4 3.200000
Now the problem is how to do a regression on each city. First, let's just "tidy" things up a little by putting all of a city's data in one "cell" of the frame.
datnested <- datlong %>%
group_by(city) %>%
nest(.key = "citydat")
datnested
# # A tibble: 3 x 2
# city citydat
# <fct> <list>
# 1 city1 <tibble [4 x 2]>
# 2 city2 <tibble [4 x 2]>
# 3 city3 <tibble [3 x 2]>
Now we can run a regression on each dataset:
datmodel <- datnested %>%
mutate(model = map(citydat, ~ lm(dens ~ km, data = .x)))
datmodel
# # A tibble: 3 x 3
# city citydat model
# <fct> <list> <list>
# 1 city1 <tibble [4 x 2]> <S3: lm>
# 2 city2 <tibble [4 x 2]> <S3: lm>
# 3 city3 <tibble [3 x 2]> <S3: lm>
Notice the embedded models in the frame? Each looks something like this:
datmodel$model[[1]]
# Call:
# lm(formula = dens ~ km, data = .x)
# Coefficients:
# (Intercept) km
# 7.470 -0.926
Now that can be used elsewhere. Let's run a prediction:
predkm <- 1:5
datpred <- datmodel %>%
mutate(pred = map(model, ~ data_frame(km = predkm, preddens = predict(.x, newdata = data.frame(km=predkm)))))
datpred
# # A tibble: 3 x 4
# city citydat model pred
# <fct> <list> <list> <list>
# 1 city1 <tibble [4 x 2]> <S3: lm> <tibble [5 x 2]>
# 2 city2 <tibble [4 x 2]> <S3: lm> <tibble [5 x 2]>
# 3 city3 <tibble [3 x 2]> <S3: lm> <tibble [5 x 2]>
Similarly:
datpred$pred[[1]]
# # A tibble: 5 x 2
# km preddens
# <int> <dbl>
# 1 1 6.54
# 2 2 5.62
# 3 3 4.69
# 4 4 3.77
# 5 5 2.84
Okay, so how do we get a single resulting frame?
datpredonly <- datpred %>%
select(city, pred) %>%
unnest()
datpredonly
# # A tibble: 15 x 3
# city km preddens
# <fct> <int> <dbl>
# 1 city1 1 6.54
# 2 city1 2 5.62
# 3 city1 3 4.69
# 4 city1 4 3.77
# 5 city1 5 2.84
# 6 city2 1 6.37
# 7 city2 2 5.36
# 8 city2 3 4.36
# 9 city2 4 3.35
# 10 city2 5 2.34
# 11 city3 1 6.67
# 12 city3 2 5.70
# 13 city3 3 4.73
# 14 city3 4 3.76
# 15 city3 5 2.78
If you want to compare with the original (for errors, etc), try:
full_join(datlong, datpredonly, by = c("city", "km")) %>%
arrange(city, km)
# city km dens preddens
# 1 city1 1 6.265509 6.543607
# 2 city1 2 5.908208 5.617601
# 3 city1 3 4.944675 4.691595
# 4 city1 4 3.500000 3.765589
# 5 city1 5 NA 2.839583
# 6 city2 1 6.372124 6.367239
# 7 city2 2 5.201682 5.361514
# 8 city2 3 4.660798 4.355788
# 9 city2 4 3.200000 3.350063
# 10 city2 5 NA 2.344337
# 11 city3 1 6.572853 6.671989
# 12 city3 2 5.898390 5.700119
# 13 city3 3 4.629114 4.728249
# 14 city3 4 NA 3.756380
# 15 city3 5 NA 2.784510
So you discussed using an exponential regression: this is handled in the single call to lm earlier in the run. Feel free to change from dens ~ km to specific exponential formulas.
I had broken all of that into components. Here's the long chain.
predkm <- 1:5
datnestedmodels <- datlong %>%
group_by(city) %>%
nest(.key = "citydat") %>%
mutate(
model = map(citydat, ~ lm(dens ~ km, data = .x)),
pred = map(model, ~ data_frame(km = predkm,
preddens = predict(.x, newdata = data.frame(km=predkm))))
)
datnestedmodels %>%
select(city, pred) %>%
unnest()
If you prefer (or need) it in the "wide" format:
datnestedmodels %>%
select(city, pred) %>%
unnest() %>%
spread(km, preddens, sep = "")
# # A tibble: 3 x 6
# city km1 km2 km3 km4 km5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 city1 6.54 5.62 4.69 3.77 2.84
# 2 city2 6.37 5.36 4.36 3.35 2.34
# 3 city3 6.67 5.70 4.73 3.76 2.78

Resources