Bootstrapping by multiple groups in the tidyverse: rsample vs. broom - r

In this SO Question bootstrapping by several groups and subgroups seemed to be easy using the broom::bootstrap function specifying the by_group argument with TRUE.
My desired output is a nested tibble with n rows where the data column contains the bootstrapped data generated by each bootstrap call (and each group and subgroup has the same amount of cases as in the original data).
In broom I did the following:
# packages
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
library(rsample)
library(broom)
# some data to bootstrap
set.seed(123)
data <- tibble(
group=rep(c('group1','group2','group3','group4'), 25),
subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
v1=rnorm(100),
v2=rnorm(100)
)
# the actual approach using broom::bootstrap
tibble(id = 1:100) %>%
mutate(data = map(id, ~ {data %>%
group_by(group,subgroup) %>%
broom::bootstrap(100, by_group=TRUE)}))
Since the broom::bootstrap function is deprecated, I rebuild my approach with the desired output using rsample::bootstraps. It seems to be much more complicated to get my desired output. Am I doing something wrong or have things gotten more complicated in the tidyverse when generating grouped bootstraps?
data %>%
dplyr::mutate(group2 = group,
subgroup2 = subgroup) %>%
tidyr::nest(-group2, -subgroup2) %>%
dplyr::mutate(boot = map(data, ~ rsample::bootstraps(., 100))) %>%
pull(boot) %>%
purrr::map(., "splits") %>%
transpose %>%
purrr::map(., ~ purrr::map_dfr(., rsample::analysis)) %>%
tibble(id = 1:length(.), data = .)

Related

How to modify N in gtsummary/ how to shape one-hot-encoded data for gtsummary

I have survey data which is one_hot_encoded, which I have then shaped into longer data so that I can compare variables within groups. The problem here is that this has created a "magical" increase in my n. I have retained an id column in my dataframe so I can easily obtain the real n using uniq(id) to find the number of different people who provided data.
However, the N given in the table is based on the number of rows. Is there a way to change the function so that tbl_summary() gives N based on the uniq ids? However I have been dropping the id column before calling tbl_summary to avoid getting summary statistics.
The other questions ive been wondering is that perhaps there is a better way to shape my data for it to pair with gtsummary?
drug1_dose = rnorm(100)
drug2_dose = rnorm(100)
df <- data.frame(drug1_dose, drug2_dose) %>%
rowid_to_column(d, "id") %>%
df <- df %>%
rename(drug1 = drug1_dose) %>%
rename(drug2 = drug2_dose) %>%
pivot_longer(c(drug1, drug2), names_to = "drug", values_to = "dose", values_drop_na = TRUE) %>%
select(-id) %>%
tbl_summary()
It is worth mentioning that in my data, there are several cases where there is only data for drug1 or for drug 2, as the two groups are overlapping but not the same. I was not sure how to show this in reprex.
Thank you in advance!
You can use the modify_headeR() function to change the header to whatever you'd like. Details at http://www.danieldsjoberg.com/gtsummary/reference/modify.html
library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.4.0'
drug1_dose = rnorm(100)
drug2_dose = rnorm(100)
df <-
data.frame(drug1_dose, drug2_dose) %>%
rowid_to_column("id") %>%
rename(drug1 = drug1_dose) %>%
rename(drug2 = drug2_dose) %>%
pivot_longer(c(drug1, drug2), names_to = "drug", values_to = "dose", values_drop_na = TRUE)
tbl <-
df %>%
select(-id) %>%
tbl_summary() %>%
modify_header(stat_0 = "**N = 100**")
Created on 2021-04-21 by the reprex package (v2.0.0)

Fit loess smoothers for multiple groups across multiple numeric variables

I need to fit many loess splines by the grouping variable (Animal) across multiple numeric columns (Var1, Var2), and extract these values.
I found code to do this task one variable at a time;
# Create dataframe 1
OneVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))))
library(dplyr)
library(tidyr)
library(purrr)
# Get fitted values from each model
Models <- OneVarDF %>%
tidyr::nest(-Animal) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = Var1 ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)
# Create prediction column
Results <- Models %>%
dplyr::select(-m) %>%
tidyr::unnest()
This "Results" dataframe is essential for downstream tasks (detrending many non-parametric distributions).
How can we achieve this with a dataframe with multiple numeric columns (code below), and extract a "Results" dataframe? Thank you.
# Create dataframe 2
TwoVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))),
Var2 = c(c(replicate(1,sample(22:27,100,rep=TRUE))), c(replicate(1,sample(29:35,100,rep=TRUE)))))
We can get the data in long format using. pivot_longer, group_by Animal and column name and apply loess to each combinaton.
library(dplyr)
library(tidyr)
TwoVarDF %>%
pivot_longer(cols = starts_with('Var')) %>%
group_by(Animal, name) %>%
mutate(model = loess(value~Day, span = 0.3)$fitted)
Include a gather() function to proceed as similar to your previous code.
Models2 <- TwoVarDF %>%
gather(varName, varVal, 3:4) %>%
tidyr::nest(-Animal, -varName) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = varVal ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)

maping over a list and taking the colmeans and rowmeans in r

I am trying to compute the column means and row means of some data I have.
Its similar to the following:
library(rsample)
library(tidyquant)
library(tidyverse)
library(tsibble)
aapl <- tq_get("AAPL", start_date = "2000-01-01")
aapl_monthly_nested <- aapl %>%
mutate(ym = yearmonth(date)) %>%
nest(-ym)
aapl_rolled <- aapl_monthly_nested %>%
rolling_origin(cumulative = FALSE)
map(aapl_rolled$splits, ~ analysis(.x)) %>%
head
I try using the summarise_all function once I have mapped over the data but I cannot seem to get the colMeans. I have replaced colMeans with mean without luck.
x <- map(aapl_rolled$splits, ~analysis(.x),
~map(data,
~summarise_all(.funs(colMeans))))
x[[1]]$data
I would like a single observation of the column means for each of the splits.
EDIT:
I think I got it. - I believe I forgot the unnest the data after nesting it previously.
x <- map(aapl_rolled$splits, ~ analysis(.x) %>%
unnest() %>%
as_tibble(.) %>%
select(-year_month) %>%
summarise_all(mean))
If you have a better solution please let me know.

r: select(everything()) not working with broom::augment for modeling results

I am looking to create a clean dataframe with reodered columns out of a linear model results lm, my eventual aim is to write the dataframe to excel to chart and audit model residuals. First, the sample data:
df1 <- cbind.data.frame(dt = seq.Date(as.Date('2019-01-01'),
as.Date('2019-01-10'),
by = 'day' ),
depVar = rnorm(10,2,1),
indepVar1 = rnorm(10,4,3),
indepVar2 = rnorm(10,7,2)
)
Now run the model:
modRes <- lm(depVar~ indepVar1, data=df1)
avf1 <- broom::augment(modRes)
library(dplyr)
avf1 <- avf1 %>%
# drop what we don't need
select(-c(.se.fit, .hat, .sigma, .cooksd, .std.resid)) %>%
cbind(df1)
The above runs well, but i want to add another pipe %>% with select(dt, everything()) so I can reorder the columns. The below returns an error:
avf1 <- avf1 %>%
select(-c(.se.fit, .hat, .sigma, .cooksd, .std.resid)) %>% # drop what we don't need
cbind(df1) %>%
select(dt, everything())
Error: Can't bind data because some arguments have the same name
Call `rlang::last_error()` to see a backtrace
Why is this failing?
avf1 <- modRes %>%
augment() %>%
select(-c(.se.fit, .hat, .sigma, .cooksd, .std.resid)) %>%
bind_cols(df1) %>%
select(dt, everything())

Transpose data frame variables and add null, unique counts in [r]

I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile

Resources