I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
Related
I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE
Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?
Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))
Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.
I have a simple dataset that I want to iterate the dependent variable using aov and tidyverse. From those outputs I then want to compute Tukey HSD tests. I have this working in a for loop structure, but am trying my hardest to migrate from that mentality. I saw this post on iterating aovfunctions with the independent variables. Tried to incorporate this logic into my workflow, but not working out so well. Any tidyverse aficionados that could steer me in the right direction here?
library(tidyverse)
library(data.table)
pfuel <- fread("data/CFL.csv") %>%
mutate(AFCL = AFCL*10,
LCW = LCW*10,
DCW = DCW*10,
LiDe = ifelse(Status == "Li", "Live", "Dead")) %>%
filter(S.F == "S") %>%
group_by(Site, Year, Age, Plot) %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr) %>%
summarise_all(sum) %>%
ungroup() %>%
mutate(sb_age = paste0(Year, Age))
aov.models = pfuel %>%
select (-c(Year, Age)) %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr, Site, Plot, sb_age) %>%
map(~ aov(.x ~ sb_age + Site/Plot, data = pfuel))
When the aov.models runs I generate this error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
In addition: Warning message:
In model.response(mf, "numeric") : NAs introduced by coercion
I haven't gotten to the Tukey test yet, as I cannot get past the aov function. Any suggestions would be GREATLY appreciated!
You can find the data here: https://www.dropbox.com/s/yb8rh860fc7fff2/CFL.csv?dl=0
Thanks!
It may be easier to convert the data to long form, split by response, then fit models and feed the output to the HSD.test function, e.g.,
aov.models <- pfuel %>%
select(-Year, -Age) %>%
gather(variable, value, -sb_age, -Site, -Plot) %>%
split(.$variable) %>%
map(~ aov(value ~ sb_age + Site/Plot, data = .x)) %>%
map(HSD.test, trt = 'sb_age')
I also removed one of the select() statements, as it was selecting all of the columns.
#Z.Lin With your guidance I figured out a solution to the first part of my question. Probably not the most elegant, but it is at least working now! Any refinement would be welcomed, but thank you.
pfuel_var <- pfuel %>%
select(Site, Plot, sb_age) %>%
mutate(Site = as.factor(Site),
Plot = as.factor( Plot),
sb_age = as.factor(sb_age))
aov.models <- pfuel %>%
select(LiFol, DeFol, Li.1hr, De.1hr, Li.10hr, De.10hr, Li.100hr, De.100hr) %>%
map(~ aov(.x ~ pfuel_var$sb_age + pfuel_var$Site/pfuel_var$Plot, data = pfuel))
The second part of my question was how to feed this output into HSD.test from the agricolae package. Anyone have thoughts on that?
What I was thinking would be:
t <- aov.models %>%
map(~ HSD.test(.x, "pfuel_var$sb_age", alpha=0.1))
But that is not working properly. Thoughts very much appreciated.
Variants of this question have been asked a lot, I also read about NSE.
Still I cannot figure this out.
This is easy:
library(dplyr)
data(cars)
cars %>%
group_by(speed) %>%
summarise(d = mean(dist))
Now I want to use variable x to pass the dist column to mean
x <- "dist"
Of course this does not work:
cars %>%
group_by(speed) %>%
summarise(d = mean(x))
So I use SE version of summarise:
cars %>%
group_by(speed) %>%
summarise_(d = mean(x))
Ok, does not work, so I have to add ~ as well:
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(x))
Still does not work, but if use dist instead of x:
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(dist))
This works, but doesn't use x.
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(~x))
This also doesn't work.
I'm basically monkeying around without any idea how to make this work, or why it fails.
cars %>%
group_by(speed) %>%
summarise_each_(funs(mean), vars(matches(x)))
This works great (see it as a solution for using list() instead of vars() here):
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(disp, hp), list(~weighted.mean(., wt)))
However, in a very similar situation using summarize_if(), it does not work:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(~weighted.mean(., wt)))
Error in weighted.mean.default(., wt) :
'x' and 'w' must have the same length
Why?
I believe this has to do with what you are naming this new variable. This works:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(tmp = ~weighted.mean(., wt)))
See the naming section here and issues that have been noted here for more details.