I'm having a little trouble figuring out quasiquotation, specifically I have a function which takes an argument which specifies which variable should go into a model which is then run within a purrr::map call.
I've been working from: https://dplyr.tidyverse.org/articles/programming.html
# libs
library(tidyverse)
library(broom)
# dummy data
df <- data.frame(
"a"=rep(c("alpha","beta"),50),
"b"=rnorm(100),
"value1"=rnorm(100),
"value2"=rnorm(100)
)
model <- function(var) {
var <- enquo(var)
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ (!! var),data=.)))
}
model(value1)
> Error in mutate_impl(.data, dots) : Evaluation error: invalid model formula.
putting the name in directly works as expected:
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ value1,data=.))) %>%
unnest(model %>% map(glance))
I can use !! var within a function:
modelX <- function(var,df=df) {
var <- enquo(var)
df %>%
select(!! var)
}
modelX(value1,df)
I'm assuming that this has something to do with the fact the !! var is referring to a value in the nested tibble data, I've been poking around with rlang::qq_show() but haven't been able to figure it out so far'
The enquo() will attempt to track the enviroment of the symbol you pass in, but you don't really want that included in the formula you are passing to lm. It would be better to capture that as a symbol rather than a quosure. Try this
model <- function(var) {
var <- ensym(var)
df %>%
group_by(a) %>%
nest() %>%
mutate(model=map(data, ~ lm(b ~ !!var, data=.)))
}
Worked for me with dplyr_0.7.6 and purrr_0.2.5
Related
I am trying to write a function that index variables names.
In particular, in my function, I use mutate to encode a variable that I have without changing its name. Does anyone knows how I can index a variable on the left end side of mutate?
Here is an example
library(tydiverse)
# first create relevant dataset
iris <- iris%>% group_by(Species) %>% mutate(mean_Length=mean(Sepal.Length))
# second create my function
userfunction <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(get(var)= # this is what causes my function to fail. How can i refer to the `var` here?
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
# this function produces the following error # Error: unexpected '}' in "}"
#note that if I change the reference to its original string the function works
userfunction2 <- function(var){
newdata <- iris %>%
select(mean_Length,{var}) %>% distinct() %>%
mutate(Species= # without reference it works, but I am unable to use the function for multiple variables.
factor(get(var),get(var))) %>%
arrange(get(var)) #
return(newdata)
}
encodedata<- userfunction2("Species")
Thanks a lot in advance for your help
Best
Here is a working example that goes into a similar direction as Limey's answer:
iris <- datasets::iris %>%
group_by(Species) %>%
mutate(mean_Length=mean(Sepal.Length)) %>%
ungroup()
userfunction <- function(var){
iris %>%
transmute(mean_Length, "temp" = iris[[var]]) %>%
distinct() %>%
mutate("{var}" := factor(temp)) %>%
arrange(temp) %>%
select(-temp)
}
userfunction("Petal.Length")
I don't think var is your problem. I think it's the =. If you you have a enquoted variable on the left hand side of the assignment (which is effectively what you do have with get()), you need :=, not =.
See here for more details.
I would have written your function slightly differently:
userfunction <- function(data, var){
qVar <- enquo(var)
newdata <- data %>%
select(mean_Length, !! qVar) %>% distinct() %>%
mutate(!! qVar := factor(!! qVar, !! qVar)) %>%
arrange(!! qVar)
return(newdata)
}
The inclusion of the data parameter means you can include it in a pipe:
encodedata <- iris %>% userfunction(Species)
encodedata
# A tibble: 3 x 2
# Groups: Species [3]
mean_Length Species
<dbl> <fct>
1 5.01 setosa
2 5.94 versicolor
3 6.59 virginica
I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'
glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))
Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268
I'm very new to R. Trying to define a function that groups a data set (group_by) and then creates summary statistics based on the groupings (dplyr, summarise_each).
Without defining a function the following works:
sum_stat <- data %>%
group_by(column) %>%
summarise_each(funs(mean), var1:var20)
The following function form does not work:
sum_stat <- function(data, column){
data %>%
group_by(column) %>%
summarise_each(funs(mean), var1:var20)
}
sum_stat(data, column)
The error message returned is:
Error: unknown column 'column'
This is the usual way you'd do this:
foo <- function(data,column){
data %>%
group_by_(.dots = column) %>%
summarise_each(funs(mean))
}
foo(mtcars,"cyl")
foo(mtcars,"gear")
I'm trying to build some functions for creating standard tables from a questionnaire, using dplyr for the data manipulation. This question was very helpful for the group_by function, passing arguments (in this case, the name of the variable I want to use to make the table) to (...), but that seems to break down when trying to pass the same arguments to other dplyr commands, specifically 'select' and 'filter'. The error message I get is '...' used in an incorrect context'.
Does anyone have any ideas on this? Thank you
For the sake of completeness (and any other hints - I'm very new to writing functions), here is the code I would like to use:
myTable <- function(x, ...) {
df <-
x %>%
group_by(Var1, ...) %>%
filter(!is.na(...) & ... != '') %>% # To remove missing values: Not working!
summarise(value = n()) %>%
group_by(Var1) %>%
mutate(Tot = sum(value)) %>%
group_by(Var1, ...) %>%
summarise(num = sum(value), total = sum(Tot), proportion = num/total*100) %>%
select(Var1, ..., proportion) # To select desired columns: Not working!
tab <- dcast(df, Var1 ~ ..., value.var = 'proportion')
tab[is.na(tab)] <- 0
print(tab)
}