This works great (see it as a solution for using list() instead of vars() here):
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(disp, hp), list(~weighted.mean(., wt)))
However, in a very similar situation using summarize_if(), it does not work:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(~weighted.mean(., wt)))
Error in weighted.mean.default(., wt) :
'x' and 'w' must have the same length
Why?
I believe this has to do with what you are naming this new variable. This works:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(tmp = ~weighted.mean(., wt)))
See the naming section here and issues that have been noted here for more details.
Related
I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...
Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)
I posted a question previously, here and for some reason if I run the code now:
mtcars %>%
filter(gear == 4) %>%
select(vs, am) %>%
pivot_longer(everything()) %>%
count(name, value) %>%
mutate(perc = n/sum(n) * 100)
It is now returning:
Error in count(., name, value) : Argument 'x' must be a vector: list
It was functional just a month ago, so I am baffled as to what is causing this.
Most probably, it is a case of masking of the function with same function from a different package that got accidentally loaded. If we use :: to specify the package, it should work
mtcars %>%
dplyr::filter(gear == 4) %>%
dplyr::select(vs, am) %>%
tidyr::pivot_longer(everything()) %>%
dplyr::count(name, value) %>%
dplyr::mutate(perc = n/sum(n) * 100)
Here, we used the :: in each of the function because select/filter/mutate/count are found in more than one package
I have some code that specifies a grouping variable as a string.
group_var <- "cyl"
My current code for using this grouping variable in a dplyr pipeline is:
mtcars %>%
group_by_(group_var) %>%
summarize(mean_mpg = mean(mpg))
My best guess as to how to replace the deprecated group_by_ function with group_by is:
mtcars %>%
group_by(!!as.name(group_var)) %>%
summarize(mean_mpg = mean(mpg))
This works but is not explicitly mentioned in the programming with dplyr vignette.
Is using !!as.name() the preferred way to replace group_by_() with group_by()?
Is this within a function? Otherwise I think the !!as.name() part is unnecessary and I would stick with the group_by_at(group_var) suggestion by #aosmith for simplicity sake. Otherwise, I would set it up as so:
examplr <- function(data, group_var){
group_var <- as.name(group_var)
data %>%
group_by(!!group_var) %>%
summarize(mean_mpg = mean(mpg))
}
examplr(data = mtcars,
group_var = "cyl")
Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?
Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))
Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.
I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})