Use of other columns as arguments to function in summarize_if()

Use of other columns as arguments to function in summarize_if() - r

This works great (see it as a solution for using list() instead of vars() here):
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(disp, hp), list(~weighted.mean(., wt)))
However, in a very similar situation using summarize_if(), it does not work:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(~weighted.mean(., wt)))
Error in weighted.mean.default(., wt) :
'x' and 'w' must have the same length
Why?

I believe this has to do with what you are naming this new variable. This works:
mtcars %>%
group_by(cyl) %>%
summarize_if(is.numeric, list(tmp = ~weighted.mean(., wt)))
See the naming section here and issues that have been noted here for more details.

Related

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...

Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

Count causing Argument 'x' must be a vector: list

I posted a question previously, here and for some reason if I run the code now:
mtcars %>%
filter(gear == 4) %>%
select(vs, am) %>%
pivot_longer(everything()) %>%
count(name, value) %>%
mutate(perc = n/sum(n) * 100)
It is now returning:
Error in count(., name, value) : Argument 'x' must be a vector: list
It was functional just a month ago, so I am baffled as to what is causing this.

Most probably, it is a case of masking of the function with same function from a different package that got accidentally loaded. If we use :: to specify the package, it should work
mtcars %>%
dplyr::filter(gear == 4) %>%
dplyr::select(vs, am) %>%
tidyr::pivot_longer(everything()) %>%
dplyr::count(name, value) %>%
dplyr::mutate(perc = n/sum(n) * 100)
Here, we used the :: in each of the function because select/filter/mutate/count are found in more than one package

Replacing group_by_ with group_by when the argument is a string in dplyr

I have some code that specifies a grouping variable as a string.
group_var <- "cyl"
My current code for using this grouping variable in a dplyr pipeline is:
mtcars %>%
group_by_(group_var) %>%
summarize(mean_mpg = mean(mpg))
My best guess as to how to replace the deprecated group_by_ function with group_by is:
mtcars %>%
group_by(!!as.name(group_var)) %>%
summarize(mean_mpg = mean(mpg))
This works but is not explicitly mentioned in the programming with dplyr vignette.
Is using !!as.name() the preferred way to replace group_by_() with group_by()?

Is this within a function? Otherwise I think the !!as.name() part is unnecessary and I would stick with the group_by_at(group_var) suggestion by #aosmith for simplicity sake. Otherwise, I would set it up as so:
examplr <- function(data, group_var){
group_var <- as.name(group_var)
data %>%
group_by(!!group_var) %>%
summarize(mean_mpg = mean(mpg))
}
examplr(data = mtcars,
group_var = "cyl")

Using nest and purrr::map outside of mutate

Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?

Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))

Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))

#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use of other columns as arguments to function in summarize_if() - r

I believe this has to do with what you are naming this new variable. This works: mtcars %>% group_by(cyl) %>% summarize_if(is.numeric, list(tmp = ~weighted.mean(., wt))) See the naming section here and issues that have been noted here for more details.

Related

How to reuse parts of long chain of pipe operators in R?

Count causing Argument 'x' must be a vector: list

Replacing group_by_ with group_by when the argument is a string in dplyr

Using nest and purrr::map outside of mutate

bootstrap by group and calculate statistics

Categories

Resources