iterate dependent variable using purrr and map (tidyverse) - r

I have a simple dataset that I want to iterate the dependent variable using aov and tidyverse. From those outputs I then want to compute Tukey HSD tests. I have this working in a for loop structure, but am trying my hardest to migrate from that mentality. I saw this post on iterating aovfunctions with the independent variables. Tried to incorporate this logic into my workflow, but not working out so well. Any tidyverse aficionados that could steer me in the right direction here?
library(tidyverse)
library(data.table)
pfuel <- fread("data/CFL.csv") %>%
mutate(AFCL = AFCL*10,
LCW = LCW*10,
DCW = DCW*10,
LiDe = ifelse(Status == "Li", "Live", "Dead")) %>%
filter(S.F == "S") %>%
group_by(Site, Year, Age, Plot) %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(sb_age = paste0(Year, Age))
aov.models = pfuel %>%
select (-c(Year, Age)) %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr, Site, Plot, sb_age) %>%
map(~ aov(.x ~ sb_age + Site/Plot, data = pfuel))
When the aov.models runs I generate this error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In model.response(mf, "numeric") : NAs introduced by coercion
I haven't gotten to the Tukey test yet, as I cannot get past the aov function. Any suggestions would be GREATLY appreciated!
You can find the data here: https://www.dropbox.com/s/yb8rh860fc7fff2/CFL.csv?dl=0
Thanks!

It may be easier to convert the data to long form, split by response, then fit models and feed the output to the HSD.test function, e.g.,
aov.models <- pfuel %>%
select(-Year, -Age) %>%
gather(variable, value, -sb_age, -Site, -Plot) %>%
split(.$variable) %>%
map(~ aov(value ~ sb_age + Site/Plot, data = .x)) %>%
map(HSD.test, trt = 'sb_age')
I also removed one of the select() statements, as it was selecting all of the columns.

#Z.Lin With your guidance I figured out a solution to the first part of my question. Probably not the most elegant, but it is at least working now! Any refinement would be welcomed, but thank you.
pfuel_var <- pfuel %>%
select(Site, Plot, sb_age) %>%
mutate(Site = as.factor(Site),
Plot = as.factor( Plot),
sb_age = as.factor(sb_age))
aov.models <- pfuel %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr) %>%
map(~ aov(.x ~ pfuel_var$sb_age + pfuel_var$Site/pfuel_var$Plot, data = pfuel))
The second part of my question was how to feed this output into HSD.test from the agricolae package. Anyone have thoughts on that?
What I was thinking would be:
t <- aov.models %>%
map(~ HSD.test(.x, "pfuel_var$sb_age", alpha=0.1))
But that is not working properly. Thoughts very much appreciated.

Related

Is there a way to loop through different levels of a factor for anomaly detection

I am using the 'anomalize' package for anomaly detection. My data consists of three columns, the date, an agent (this is where the different levels come from), and the number of schedules that agent had on a particular day. I can run the anomaly detection just fine when I remove the 'agent' column and sum the number of consults by day using this code:
df <- scheds %>%
group_by(date) %>%
summarise(
new_scheds = sum(new_scheds)
)
df_ts <- df %>% rownames_to_column() %>% as_tibble() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% select(-one_of('rowname'))
df_ts <- df_ts[order(df_ts$date),]
########## TS Decomp ###############
df_ts %>%
time_decompose(new_scheds, method = "stl", frequency = 5, trend = "auto") %>%
anomalize(remainder, method = "gesd", alpha = 0.05, max_anoms = 0.2) %>%
plot_anomaly_decomposition()
But I cannot find out how I would do this same type of thing for each agent individually without manually typing everything out and using filter(). I have tried the following loop with no luck:
agents <- levels(ts_agents$agent)
results <- matrix(NA, length(agents))
for(i in 1:length(agents)){
ts_agents %>%
time_decompose(new_scheds)[i] %>%
anomalize(remainder)[i] %>%
time_recompose()[i] %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.5)[i] }
but I get the following error:
'Error in time_decompose(new_scheds) : object 'new_scheds' not found'
Any tips or pointers would be greatly appreciated!
The reason for your error is that the pipe operator %>% doesn't work right when you try to subset your data.
If you enclose it in brackets and use . to refer to the input, you will avoid this error:
for(i in 1:length(agents)){
ts_agents %>% {
time_decompose(., new_scheds)[i]
} %>% {
...
This fixes the immediate problem of the error but I'm not sure how well the subsetting will work. It may be that you need filter() in the loop, or even group_by(df, agent) without any loop at all. (If you provide a full reproducible example including data, it will be easier to help).

Passing arguments dynamically in Expss tables with user-defined functions

I have a (new) question related to expss tables. I wrote a very simple UDF (that relies on few expss functions), as follows:
library(expss)
z_indices <- function(x, m_global, std_global, weight=NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (w_mean(x, weight)-m_global)/std_global
indices <- 100+(z*100)
return(indices)
}
Reproducible example, based on infert dataset (plus a vector of arbitrary weights):
data(infert)
infert$w <- as.vector(x=rep(2, times=nrow(infert)), mode='numeric')
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(infert$age, infert$w),std_global=w_sd(infert$age, infert$w))
}) %>%
tab_pivot(stat_position="inside_columns")
The table is computed and the output for the first line is (almost) as expected.
Then things go messy for the second line, since both arguments of z_indices explicitely refer to infert$age, where infert$parity is expected.
My question: is there a way to dynamically pass the variables of tab_cells as function argument within tab_stat_fun to match the variable being processed? I guess this happens inside function declaration but have not clue how to proceed...
Thanks!
EDIT April 28th 2020:
Answer from #Gregory Demin works great in the scope of infert dataset, although for better scalability to larger dataframes I wrote the following loop:
var_df <- data.frame("age"=infert$age, "parity"=infert$parity)
tabZ=infert
for(each in names(var_df)){
tabZ = tabZ %>%
tab_cells(var_df[each]) %>%
tab_cols(total(), education) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
tab_stat_fun(label="Z", function(x, m_global, std_global, weight=NULL){
z_indices(x, m_global=w_mean(var_df[each], infert$w),std_global=w_sd(var_df[each], infert$w))
})
}
tabZ = tabZ %>% tab_pivot()
Hope this inspires other expss users in the future!
There is no universal solution for this case. Function in the tab_stat_fun is always calculated inside cell so you can't get global values in it.
However, in your case we can calculate z-index before summarizing. Not so flexible solution but it works:
# function for weighted z-score
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
tab_cells(age = w_z_index(age, w), parity = w_z_index(parity, w)) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")
UPDATE.
A little more scalable approach:
w_z_index = function(x, weight = NULL){
if(is.null(weight)) weight = rep(1, length(x))
z <- (x - w_mean(x, weight))/w_sd(x, weight)
indices <- 100+(z*100)
return(indices)
}
w_z_index_df = function(df, weight = NULL){
df[] = lapply(df, w_z_index, weight = weight)
df
}
data(infert)
infert$w <- rep(2, times=nrow(infert))
infert %>%
tab_cells(age, parity) %>%
tab_cols(total(), education, case %nest% list(total(), education)) %>%
tab_weight(w) %>%
tab_stat_valid_n(label="N") %>%
tab_stat_mean(label="Mean") %>%
# here we get z-index instead of original variables
# we process a lot of variables at once
tab_cells(w_z_index_df(data.frame(age, parity))) %>%
tab_stat_mean(label="Z") %>%
tab_pivot(stat_position="inside_columns")

maping over a list and taking the colmeans and rowmeans in r

I am trying to compute the column means and row means of some data I have.
Its similar to the following:
library(rsample)
library(tidyquant)
library(tidyverse)
library(tsibble)
aapl <- tq_get("AAPL", start_date = "2000-01-01")
aapl_monthly_nested <- aapl %>%
mutate(ym = yearmonth(date)) %>%
nest(-ym)
aapl_rolled <- aapl_monthly_nested %>%
rolling_origin(cumulative = FALSE)
map(aapl_rolled$splits, ~ analysis(.x)) %>%
head
I try using the summarise_all function once I have mapped over the data but I cannot seem to get the colMeans. I have replaced colMeans with mean without luck.
x <- map(aapl_rolled$splits, ~analysis(.x),
~map(data,
~summarise_all(.funs(colMeans))))
x[[1]]$data
I would like a single observation of the column means for each of the splits.
EDIT:
I think I got it. - I believe I forgot the unnest the data after nesting it previously.
x <- map(aapl_rolled$splits, ~ analysis(.x) %>%
unnest() %>%
as_tibble(.) %>%
select(-year_month) %>%
summarise_all(mean))
If you have a better solution please let me know.

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

dplyr: How to use select and filter inside functions; (...) not working for arguments

I'm trying to build some functions for creating standard tables from a questionnaire, using dplyr for the data manipulation. This question was very helpful for the group_by function, passing arguments (in this case, the name of the variable I want to use to make the table) to (...), but that seems to break down when trying to pass the same arguments to other dplyr commands, specifically 'select' and 'filter'. The error message I get is '...' used in an incorrect context'.
Does anyone have any ideas on this? Thank you
For the sake of completeness (and any other hints - I'm very new to writing functions), here is the code I would like to use:
myTable <- function(x, ...) {
df <-
x %>%
group_by(Var1, ...) %>%
filter(!is.na(...) & ... != '') %>% # To remove missing values: Not working!
summarise(value = n()) %>%
group_by(Var1) %>%
mutate(Tot = sum(value)) %>%
group_by(Var1, ...) %>%
summarise(num = sum(value), total = sum(Tot), proportion = num/total*100) %>%
select(Var1, ..., proportion) # To select desired columns: Not working!
tab <- dcast(df, Var1 ~ ..., value.var = 'proportion')
tab[is.na(tab)] <- 0
print(tab)
}

Resources