Rolling window with slide_dbl() on grouped data - r

This is an extension to following question: Rolling window slider::slide() with grouped data
I want to mutate a column of my grouped tibble with slide_dbl(), i.e. applying slide_dbl() on all groups, but only within them, not across them.
When running the solution of linked question I receive following error message:
Error: Problem with `mutate()` input `rollreg`.
x Inapplicable method for 'mutate_' applied to object of class "c('double', 'numeric')".
My tibble has following structure:
tibble [450,343 x 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ company: num [1:450343] 1 1 1 1 1 ...
$ date: Date[1:450343], format: "2011-11-30" "2011-12-31" "2012-01-31" "2012-02-29" ...
$ result: num [1:450343] NA NA NA 12.5981 -2.9023 ...
- attr(*, "groups")= tibble [3,339 x 2] (S3: tbl_df/tbl/data.frame)
..$ company: num [1:3339] 1 2 3 4 5 ...
..$ .rows : list<int> [1:3339]
To complete, this is the code I ran according to the linked solution:
testtest <- data %>%
group_by(company) %>% nest() %>%
mutate(rollreg = map(data, ~ .x %>% mutate(result_2 = slide_dbl(.x = .$result, .f = ~prod(1+.)-1, .before = 11, .after = -1, complete=TRUE)))) %>%
select(-data) %>% unnest(rollreg)
Here, above mentioned error message occurs. I guess it's because of the data structure. Yet, I can't figure any solution (also not with similar functions like group_map() or group_modify()). Can anyone help? Thanks in advance!

An option is group_split by the grouping column (in the example, using 'case', loop over the list of datasets with map, create new column in mutate by applying the slide_dbl
library(dplyr)
library(tidyr)
library(purrr)
data %>%
group_split(case) %>%
map_dfr(~ .x %>%
mutate(out = slide_dbl(r1, .f = ~ prod(1 + .x) - 1,
.before = 5, .after = -1, complete = TRUE)))
-output
# A tibble: 30 x 6
# t case r1 r2 r3 out
# <int> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 a -0.294 -0.164 1.33 0
# 2 2 a 0.761 1.01 0.115 -0.294
# 3 3 a -0.781 -0.499 0.290 0.243
# 4 4 a -0.0732 -0.110 0.289 -0.728
# 5 5 a -0.528 0.707 0.181 -0.748
# 6 6 a -1.35 -0.411 -1.47 -0.881
# 7 7 a -0.397 -1.28 0.172 -1.06
# 8 8 a 1.68 0.956 -2.81 -1.02
# 9 9 a -0.0167 -0.0727 -1.08 -1.24
#10 10 a 1.25 -0.326 1.61 -1.26
## … with 20 more rows
Or if we need to use the nest_by, it creates an attribute rowwise, so, it is better to ungroup before applying
out1 <- data %>%
select(-t) %>%
nest_by(case) %>%
ungroup %>%
mutate(data = map(data, ~ .x %>%
mutate(out = slide_dbl(r1, .f = ~ prod(1 + .x) - 1,
.before = 5, .after = -1, complete = TRUE))))
-output
out1
# A tibble: 3 x 2
# case data
# <chr> <list>
#1 a <tibble [10 × 4]>
#2 b <tibble [10 × 4]>
#3 c <tibble [10 × 4]>
Now, we unnest the structure
out1 %>%
unnest(data)
# A tibble: 30 x 5
# case r1 r2 r3 out
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a -0.294 -0.164 1.33 0
# 2 a 0.761 1.01 0.115 -0.294
# 3 a -0.781 -0.499 0.290 0.243
# 4 a -0.0732 -0.110 0.289 -0.728
# 5 a -0.528 0.707 0.181 -0.748
# 6 a -1.35 -0.411 -1.47 -0.881
# 7 a -0.397 -1.28 0.172 -1.06
# 8 a 1.68 0.956 -2.81 -1.02
# 9 a -0.0167 -0.0727 -1.08 -1.24
#10 a 1.25 -0.326 1.61 -1.26
# … with 20 more rows
data
data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))

i also got a question regarding the slide_Dbl function. I would like to check out other rollingregressions. My data is already fixed with an 8 weak week, but if i would like to look at for example 16 or 24 weeks, should i change the (before= ) from 8 to 16? The reason why i am asking is that i dont have the original dataset, but its already fixed with 8 weeks, so if i add the (before= ) with an additional 8 will it be 16?
new8 <- new%>%mutate( across(
where(is.double),
function(x) slide_dbl(x, mean, na.rm = TRUE, .before = 8L) %>% lag()))
Or should i put
new16 <- new%>%mutate(across(
where(is.double),
function(x) slide_dbl(x, mean, na.rm = TRUE, .before = 16L) %>% lag()))

Related

Calculate internal consistency of items by grouping variables using dplyr/tidyverse

I’d like to calculate the internal consistency (alpha and omega) of items by grouping variables (e.g., age and raterType). Ideally I’d be able to do this using dplyr/tidyverse. My question is similar to another question (Using dplyr to nest or group two variables, then perform the Cronbach's alpha function or other statistics to the data), however I can’t get the solution to work in my case.
Here is a minimal example:
library("tidyverse")
library("psych")
library("MBESS")
mydata <- expand.grid(ID = 1:100,
age = 1:5,
raterType = c("self",
"friend",
"parent"))
set.seed(12345)
mydata$item1 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item2 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item3 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item4 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item5 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item6 <- sample(1:7, nrow(mydata), replace = TRUE)
mydata$item1[sample(nrow(mydata), 100)] <- NA
mydata$item2[sample(nrow(mydata), 100)] <- NA
mydata$item3[sample(nrow(mydata), 100)] <- NA
mydata$item4[sample(nrow(mydata), 100)] <- NA
mydata$item5[sample(nrow(mydata), 100)] <- NA
mydata$item6[sample(nrow(mydata), 100)] <- NA
itemNames <- paste("item", 1:6, sep = "")
To calculate internal consistency for the entire dataset, I would calculate alpha and omega, respectively, by the following code:
alpha(mydata[,itemNames])$total$raw_alpha
ci.reliability(mydata[,itemNames], type = "omega", interval.type = "none")$est
However, I want to calculate alpha and omega for each combination of age and raterType.
Here's my attempt:
mydata %>%
pivot_longer(cols = c(-age, -raterType, -ID)) %>%
select(-ID) %>%
nest_by(age, raterType) %>%
mutate(alpha = alpha(data)$total$raw_alpha,
omega = ci.reliability(data, type = "omega", interval.type = "none")$est)
This throws an error. For some reason, the code provides incorrect estimates for omega and throws an error for alpha:
> # This provides the wrong estimates:
> mydata %>%
+ pivot_longer(cols = c(-age, -raterType, -ID)) %>%
+ select(-ID) %>%
+ nest_by(age, raterType) %>%
+ mutate(omega = ci.reliability(data, type = "omega", interval.type = "none")$est)
# A tibble: 15 × 4
# Rowwise: age, raterType
age raterType data omega
<int> <fct> <list<tibble[,2]>> <dbl>
1 1 self [600 × 2] 0.218
2 1 friend [600 × 2] 0.257
3 1 parent [600 × 2] 0.261
4 2 self [600 × 2] 0.196
5 2 friend [600 × 2] 0.257
6 2 parent [600 × 2] 0.209
7 3 self [600 × 2] 0.179
8 3 friend [600 × 2] 0.225
9 3 parent [600 × 2] 0.247
10 4 self [600 × 2] 0.224
11 4 friend [600 × 2] 0.252
12 4 parent [600 × 2] 0.218
13 5 self [600 × 2] 0.248
14 5 friend [600 × 2] 0.218
15 5 parent [600 × 2] 0.202
>
> # This throws an error:
> mydata %>%
+ pivot_longer(cols = c(-age, -raterType, -ID)) %>%
+ select(-ID) %>%
+ nest_by(age, raterType) %>%
+ mutate(alpha = alpha(data)$total$raw_alpha)
Number of categories should be increased in order to count frequencies.
Error in `mutate()`:
! Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ The error occurred in row 1.
Caused by error in `FUN()`:
! only defined on a data frame with all numeric-alike variables
Run `rlang::last_error()` to see where the error occurred.
Warning messages:
1: Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ NAs introduced by coercion
ℹ The warning occurred in row 1.
2: Problem while computing `alpha = alpha(data)$total$raw_alpha`.
ℹ Item = name had no variance and was deleted but still is counted in the score
ℹ The warning occurred in row 1.
The omega values above do not correspond to the values obtained from running the ci.reliability() function on the respective subset of the data:
> alpha(mydata[which(mydata$age == 3 & mydata$raterType == "self"), itemNames])$total$raw_alpha
[1] -0.3018416
> ci.reliability(mydata[which(mydata$age == 3 & mydata$raterType == "self"), itemNames], type = "omega", interval.type = "none")$est
[1] 0.00836356
Perhaps this helps
out1 <- mydata %>%
group_by(age, raterType) %>%
summarise(alpha = alpha(across(all_of(itemNames)))$total$raw_alpha,
omega = ci.reliability(across(all_of(itemNames)),
type = "omega", interval.type = "none")$est, .groups = 'drop')
-output
> out1
# A tibble: 15 × 4
age raterType alpha omega
<int> <fct> <dbl> <dbl>
1 1 self -0.135 2.76
2 1 friend 0.138 0.231
3 1 parent -0.229 255.
4 2 self -0.421 NA
5 2 friend 0.0650 58.7
6 2 parent 0.153 NA
7 3 self -0.302 0.00836
8 3 friend 0.147 0.334
9 3 parent 0.196 0.132
10 4 self -0.0699 NA
11 4 friend 0.118 0.214
12 4 parent -0.0303 31.1
13 5 self -0.0166 0.246
14 5 friend -0.192 0.0151
15 5 parent 0.0847 NA
Or may be this
out2 <- mydata %>%
nest_by(age, raterType) %>%
mutate(alpha = alpha(data[, itemNames])$total$raw_alpha,
omega = ci.reliability(data[, itemNames], type = "omega",
interval.type = "none")$est)
-output
out2
# A tibble: 15 × 5
# Rowwise: age, raterType
age raterType data alpha omega
<int> <fct> <list<tibble[,7]>> <dbl> <dbl>
1 1 self [100 × 7] -0.135 2.76
2 1 friend [100 × 7] 0.138 0.231
3 1 parent [100 × 7] -0.229 255.
4 2 self [100 × 7] -0.421 NA
5 2 friend [100 × 7] 0.0650 58.7
6 2 parent [100 × 7] 0.153 NA
7 3 self [100 × 7] -0.302 0.00836
8 3 friend [100 × 7] 0.147 0.334
9 3 parent [100 × 7] 0.196 0.132
10 4 self [100 × 7] -0.0699 NA
11 4 friend [100 × 7] 0.118 0.214
12 4 parent [100 × 7] -0.0303 31.1
13 5 self [100 × 7] -0.0166 0.246
14 5 friend [100 × 7] -0.192 0.0151
15 5 parent [100 × 7] 0.0847 NA

problem with `replace_na()` from tidyr package

I wrote a function that has five arguments to calculate random numbers from a normal distribution. It has two steps:
replace NA with 0 in tibble column
replace 0 with a random number
My problems are:
line three doesn't replace NA value with 0
line five doesn't replace 0 with a random number
I have this error :
! Must subset columns with a valid subscript vector.
x Subscript `col` has the wrong type `function`.
It must be logical, numeric, or character.
here is my code :
whithout=function(col,min,max,mean,sd){
for(i in 1:4267){
continuous_dataset=continuous_dataset %>% replace_na(continuous_dataset[,col]=0)
if(is.na(continuous_dataset[,col])){
continuous_dataset[i,col]=round(rtruncnorm(1,min,max,mean,sd))
}
}
}
There's no need to write a function that loops across both columns and observations.
I assume you have no zeroes in your dataset to begin with. In which case, I can skip replacing NA with 0 and go straight to genereating the replacement value.
My solution is based on the tidyverse.
First, generate some test data.
library(tidyverse)
set.seed(123)
df <- tibble(x=runif(5), y=runif(5), z=runif(5))
df$x[3] <- NA
df$y[4] <- NA
df$z[5] <- NA
df
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 NA 0.892 0.678
4 0.883 NA 0.573
5 0.940 0.457 NA
Now solve the problem.
df %>%
mutate(
across(
everything(),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 467.
By avoiding looping through columns and rows, the code is more compact, more robust and (though I've not tested) faster.
If you don't want to process every column, simply replace everything() with a vector of columns that you do want to process. For example
df %>%
mutate(
across(
c(x, y),
function(.x, mean, sd) .x <- ifelse(is.na(.x), rnorm(nrow(.), mean, sd), .x),
mean=500,
sd=100
)
)
# A tibble: 5 × 3
x y z
<dbl> <dbl> <dbl>
1 0.288 0.0456 0.957
2 0.788 0.528 0.453
3 669. 0.892 0.678
4 0.883 629. 0.573
5 0.940 0.457 NA

How to pass tibble of variable names and function calls to tibble

I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?
We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows
Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble

r studio: simulate my code 1000 times and pick the things which p value<0.05

Here is my original code:
x = rbinom(1000,1,0.5)
z = log(1.3)*x
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
k=glm(y~x,family="binomial")$coef
t=exp(k)
How can I simulate it 1000 times and pick the one with a p-value<0.05?
This is a perfect application for the tidyverse and it's list columns. Please see explanation in the inline comments.
library(tidyverse)
library(broom)
# create a tibble with an id column for each simulation and x wrapped in list()
sim <- tibble(id = 1:1000,
x = list(rbinom(1000,1,0.5))) %>%
# to generate z, pr, y, k use map and map2 from the purrr package to loop over the list column x
# `~ ... ` is similar to `function(.x) {...}`
# `.x` represents the variable you are using map on
mutate(z = map(x, ~ log(1.3) * .x),
pr = map(z, ~ 1 / (1 + exp(-.x))),
y = map(pr, ~ rbinom(1000, 1, .x)),
k = map2(x, y, ~ glm(.y ~ .x, family="binomial")),
# use broom::tidy to get the model summary in form of a tibble
sum = map(k, broom::tidy)) %>%
# select id and sum and unnest the tibbles
select(id, sum) %>%
unnest(cols = c(sum)) %>%
# drop the intercepts and every .x with a p < 0.05
filter(term !="(Intercept)",
p.value < 0.05)
sim
#> # A tibble: 545 x 6
#> id term estimate std.error statistic p.value
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 .x 0.301 0.127 2.37 0.0176
#> 2 7 .x 0.263 0.127 2.06 0.0392
#> 3 8 .x 0.293 0.127 2.31 0.0211
#> 4 11 .x 0.377 0.128 2.96 0.00312
#> 5 12 .x 0.265 0.127 2.08 0.0373
#> 6 13 .x 0.366 0.127 2.88 0.00403
#> 7 14 .x 0.461 0.128 3.61 0.000305
#> 8 17 .x 0.274 0.127 2.16 0.0309
#> 9 18 .x 0.394 0.127 3.09 0.00200
#> 10 19 .x 0.371 0.127 2.92 0.00354
#> # … with 535 more rows
Created on 2020-05-18 by the reprex package (v0.3.0)
I won't do this for you, but these are the steps you'll probably want to go through:
Write your code as a function that returns the value you're interested in (presumably t)
Use something like replicate to run this function many times and record all the answers
Use something like quantile to extract the percentile you're interested in

How to write a function that conducts paired t-tests on all group/variable combinations in a data frame

I have a data frame similar to data created below:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878),each=5))
Var <- data.frame(Variable=c("Copper","Iron","Lead","Zinc","CaCO"))
n <- 10
Variable <- do.call("rbind",replicate(n,Var,simplify=F))
Location <- rep(c("Alpha","Beta","Gamma"), times=c(20,20,10))
Location <- data.frame(Location)
set.seed(1)
FirstPt<- data.frame(FirstPt=sample(1:100,50,replace=T))
LastPt <- data.frame(LastPt=sample(1:100,50,replace=T))
First3<- data.frame(First3=sample(1:100,50,replace=T))
First5<- data.frame(First5=sample(1:100,50,replace=T))
First7<- data.frame(First7=sample(1:100,50,replace=T))
First10<- data.frame(First10=sample(1:100,50,replace=T))
Last3<- data.frame(Last3=sample(1:100,50,replace=T))
Last5<- data.frame(Last5=sample(1:100,50,replace=T))
Last7<- data.frame(Last7=sample(1:100,50,replace=T))
Last10<- data.frame(Last10=sample(1:100,50,replace=T))
data <- cbind(ID,Location,Variable,FirstPt,LastPt,First3,First5,First7,
First10,Last3,Last5,Last7,Last10)
This may be a two part question, but I want to write a function that groups all Variables that are the same (for instance, all the observations that are Copper) and conducts a paired t test between all possible combinations of the numeric columns (FirstPt:Last10). I want it to return the p values in a data frame like this:
Test P-Value
FirstPt.vs.LastPt …
FirstPt.vs.First3 …
ect... …
This will likely be a second function, but I also want to do this after the observations are grouped by Location so that the output data frame will look like this:
Test P-Value
FirstPt.vs.LastPt.InAlpha
FirstPt.vs.LastPt.InBeta
ect...
You can do both of these with one function:
library(tidyverse)
t.test.by.group.combos <- function(.data, groups){
by <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
.data %>%
group_by(!!!groups) %>%
select_if(is.integer) %>%
group_split() %>%
map(.,
~pivot_longer(., cols = (FirstPt:Last10), names_to = "name", values_to = "val") %>%
nest(data = val) %>%
full_join(.,.,by = by) %>%
filter(name.x != name.y) %>%
mutate(test = paste(name.x, "vs",name.y, !!!groups, sep = "."),
p.value = map2_dbl(data.x,data.y, ~t.test(unlist(.x), unlist(.y))$p.value)) %>%
select(test,p.value)%>%
filter(!duplicated(p.value))
) %>%
bind_rows()
}
t.test.by.group.combos(data, vars(Variable))
#> # A tibble: 225 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO 0.511
#> 2 FirstPt.vs.First3.CaCO 0.184
#> 3 FirstPt.vs.First5.CaCO 0.494
#> 4 FirstPt.vs.First7.CaCO 0.354
#> 5 FirstPt.vs.First10.CaCO 0.893
#> 6 FirstPt.vs.Last3.CaCO 0.496
#> 7 FirstPt.vs.Last5.CaCO 0.909
#> 8 FirstPt.vs.Last7.CaCO 0.439
#> 9 FirstPt.vs.Last10.CaCO 0.146
#> 10 LastPt.vs.First3.CaCO 0.578
#> # … with 215 more rows
t.test.by.group.combos(data, vars(Variable, Location))
#> # A tibble: 674 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO.Alpha 0.850
#> 2 FirstPt.vs.First3.CaCO.Alpha 0.822
#> 3 FirstPt.vs.First5.CaCO.Alpha 0.895
#> 4 FirstPt.vs.First7.CaCO.Alpha 0.810
#> 5 FirstPt.vs.First10.CaCO.Alpha 0.645
#> 6 FirstPt.vs.Last3.CaCO.Alpha 0.870
#> 7 FirstPt.vs.Last5.CaCO.Alpha 0.465
#> 8 FirstPt.vs.Last7.CaCO.Alpha 0.115
#> 9 FirstPt.vs.Last10.CaCO.Alpha 0.474
#> 10 LastPt.vs.First3.CaCO.Alpha 0.991
#> # … with 664 more rows
This is kind of a lengthy function, but in general we group by the groups argument, then we select the groups and any integer columns, then we split the dataframe by the groups. After, we map all the combinations of variables and perform t.tests for each combo. Lastly, we rejoin all the groups into one dataframe.
I think this is what you want. The key was to use group_by and do from tidyverse.
df <- NULL
for(i in (4:(ncol(data)-1))){
for(j in ((i+1):ncol(data))){
df <- rbind(df,data %>%
group_by(Location) %>%
do(data.frame(pval = t.test(.[[i]],.[[j]], data = .)$p.value)) %>%
ungroup() %>%
mutate(Test = paste0(colnames(data)[i],'.vs.',colnames(data)[j]))
)
}
}
df$Test <- paste0(df$Test,'.In',df$Location)
Probably, you can acheive what you want using the below code :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = FirstPt:Last10) %>%
group_by(Variable) %>%
summarise(p_value = list(combn(name, 2, function(x)
t.test(value[name == x[1]], value[name == x[2]])$p.value)),
test = list(combn(name, 2, paste, collapse = "_"))) %>%
unnest(cols = c(test, p_value))
# Variable p_value test
# <fct> <dbl> <chr>
# 1 CaCO 0.915 FirstPt_LastPt
# 2 CaCO 0.529 FirstPt_First3
# 3 CaCO 0.337 FirstPt_First5
# 4 CaCO 0.350 FirstPt_First7
# 5 CaCO 0.395 FirstPt_First10
# 6 CaCO 0.765 FirstPt_Last3
# 7 CaCO 0.204 FirstPt_Last5
# 8 CaCO 0.873 FirstPt_Last7
# 9 CaCO 0.479 FirstPt_Last10
#10 CaCO 1 FirstPt_FirstPt
# … with 24,740 more rows
To do it grouped by Location you can add that into group_by command and keep rest of the code as it is.

Resources