passing column name as variable in dplyr - r

Variants of this question have been asked a lot, I also read about NSE.
Still I cannot figure this out.
This is easy:
library(dplyr)
data(cars)
cars %>%
group_by(speed) %>%
summarise(d = mean(dist))
Now I want to use variable x to pass the dist column to mean
x <- "dist"
Of course this does not work:
cars %>%
group_by(speed) %>%
summarise(d = mean(x))
So I use SE version of summarise:
cars %>%
group_by(speed) %>%
summarise_(d = mean(x))
Ok, does not work, so I have to add ~ as well:
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(x))
Still does not work, but if use dist instead of x:
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(dist))
This works, but doesn't use x.
cars %>%
group_by(speed) %>%
summarise_(d = ~mean(~x))
This also doesn't work.
I'm basically monkeying around without any idea how to make this work, or why it fails.

cars %>%
group_by(speed) %>%
summarise_each_(funs(mean), vars(matches(x)))

Related

How to get this function to work with the pipe in r?

I have created this function that quickly does some summarization operations (mean, median, geometric mean and arranges them in descending order). This is the function:
summarize_values <- function(tbl, variable){
tbl %>%
summarize(summarized_mean = mean({{variable}}),
summarized_median = median({{variable}}),
geom_mean = exp(mean(log({{variable}}))),
n = n()) %>%
arrange(desc(n))
}
I can do this and it works:
summarize_values(data, lifeExp)
However, I would like to be able to do this:
data %>%
select(year, lifeExp) %>%
summarize_values()
or something like this
data %>%
summarize_values(year, lifeExp)
What am I missing to make this work?
thanks
With pipe, we don't need to specify the first argument which is the tbl,
library(dplyr)
data %>%
summarize_values(lifeExp)
-reproducible example
> mtcars %>%
summarize_values(gear)
summarized_mean summarized_median geom_mean n
1 3.6875 4 3.619405 32

grouped statistical test tidyverse

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE

Labels not parsed in Expss for loop

I'm new to R and trying to explore my variables by groups and i'm using a for loop to pass all suiting variable names under expss.
Here is an reproducible example :
require(expss)
require(dplyr)
colnoms <- as.data.frame(HairEyeColor) %>% names(.)
expss_digits(2)
for (i in colnoms){
as.data.frame(HairEyeColor) %>%
tab_cells(get(i)) %>%
tab_cols(Eye) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot() %>%
set_caption(i) %>%
htmlTable() %>%
print()
}
I expect the name of the variable in the output (Hair, Eye, Color) but instead i get only "get(i)".
Thanks for any advice
After get we can not to know original variable name. The simplest way to show original name is to set variable name as label:
require(expss)
data(HairEyeColor)
HairEyeColor <- as.data.frame(HairEyeColor)
colnoms <- names(HairEyeColor)
expss_digits(2)
for (i in colnoms){
# if we don't have label we assign name as label
if(is.null(var_lab(HairEyeColor[[i]]))) var_lab(HairEyeColor[[i]]) = i
HairEyeColor %>%
tab_cells(get(i)) %>%
tab_cols(Eye) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot() %>%
set_caption(i) %>%
htmlTable() %>%
print()
}

R: dynamic variable name comparisons

I recoded a bunch of variables in a dataset, and and gave the newly recoded variables the prefix "r_" in my dataset.
I'd like to run table on the pairs to ensure the recoding was correct. Something like table(v1, r_v1), but I need to do it for lots of variables. They are not in any particular order, so I couldn't use indexing.
Here is a reproducible example of data one can use (also any tips on optimizing that code are appreciated!).
mtcars %>% select(c(disp,hp)) %>%
mutate_all(funs(if_else(.>100,1,0))) %>%
rename_(.dots=setNames(names(.), paste0('r_', names(.)))) %>%
cbind(mtcars,.)
Any ideas?
I would just use variable names and simple for loop. Calling your modified data dd,
orig = c("disp", "hp")
trans = paste0("r_", orig)
check_list = list()
for (i in seq_along(orig)) {
check_list[[i]] = table(dd[[orig[i]]], dd[[trans[i]]])
# or whatever other check you want to do
}
check_list
You can then examine the check_list contents one at a time.
To keep things in the tidy format with which you started:
library(purrr)
library(tidyr)
mtcars %>%
select(disp,hp) %>%
mutate_all(funs(r = if_else(.>100,1,0))) %>%
mutate(index = row_number()) %>%
gather(key = key, value = value, -index) %>%
separate(key, c("Variable", "Type")) %>%
mutate(Type = ifelse(is.na(Type), "Original", "Recode")) %>%
spread(key = Type, value = value) %>%
select(-index) %>%
split(.$Variable) %>%
map(~ select(.,-Variable)) %>%
map(~ table(.))

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

Resources