Dplyr and multiple t test (keeping the same IV)

Dplyr and multiple t test (keeping the same IV) - r

I'm using this pretty nice code to perform a multiple t.test keeping the independent variable constant!
data(mtcars)
library(dplyr)
vars_to_test <- c("disp","hp","drat","wt","qsec")
iv <- "vs"
mtcars %>%
summarise_each_(
funs_(
sprintf("stats::t.test(.[%s == 0], .[%s == 1])$p.value",iv,iv)
),
vars = vars_to_test)
Unfortunately, dplyr was updated and I've been facing this report
summarise_each() is deprecated. Use summarise_all(),
summarise_at() or summarise_if() instead. To map funs over a
selection of variables, use summarise_at()
When i change the code for _all, at or _if, this function doest not work any more. I'm looking for some advice and thanks much for your support.
Thanks

Instead of creating a string expression with sprintf and then evaluating it, we can use the evaluate the 'vs' converting it to symbol and then evaluate it
library(dplyr)
mtcars %>%
summarise_at(vars(vars_to_test), funs(
try(stats::t.test(.[(!! rlang::sym(iv)) == 0], .[(!! rlang::sym(iv)) == 1])$p.value)
))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06
If we really wanted to parse an expression, use the rlang_parse_expr and rlang::eval_tidy along with sym
library(rlang)
eval_tidy(parse_expr("mtcars %>% summarise_at(vars(vars_to_test),
funs(t.test(.[(!!sym(iv))==0],
.[(!!sym(iv))==1])$p.value ))"))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06

Related

How to add name to vector before creating a dataframe of means sorted by group and variable

Thanks for looking at this!
I want a function to build tables showing stats, such as the mean) for specific variables segrgated into groups.
Below is a start of a function that works up to a point! I use an example using the built in data for mtcars.
MeansbyGroup<-function(var){
M1<-mtcars %>% group_by(cyl)
n1=deparse(substitute(var))
r1<-transpose(M1 %>% summarise(disp=mean(var)))[2,]
}
# EXAMPLE using mtcars
df=MeansbyGroup(mtcars$disp)
df[nrow(df) + 1,] =MeansbyGroup(mtcars$drat)
df
# The above will output
V1 V2 V3
2 230.721875 230.721875 230.721875
2.1 3.596563 3.596563 3.596563
#which is not even the right means!
#below are the correct values...but I can't automate a table like I want
M1<-mtcars %>% group_by(cyl)
transpose(M1 %>% summarise(disp=mean(disp)))[2,]
transpose(M1 %>% summarise(disp=mean(drat)))[2,]
## Here is my desired output of means disaggregated into columns by the group "cyl"
## if the function worked right with the above example
V1 V2 V3
disp 105.1364 183.3143 353.1
drat 4.070909 3.585714 3.229286
As you will see, in the function I have "n1=deparse(substitute(var))" to capture the variable name which I would like to have in the first column, instead of 2 and 2.1 as shown in the example output.
I've tried a few techniques, but when I try to add n1 to the vector, it destroys the values of the means!
Also, I'd like to make the function more generalizable. For this example, I'd prefer the function call to look like MeansbyGroup(var,group,dataframe), which in the above example would be called by MeansbyGroup(disp,cyl,mtcars).
Thanks!

Here's how I would code your table outside of a function:
library(dplyr)
library(tibble)
mtcars %>%
group_by(cyl) %>%
summarize(across(c(disp, drat), mean)) %>%
column_to_rownames("cyl") %>%
t
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286
Using across if you might have multiple variables is quite nice. Putting this inside a function, we will need to use deparse(substitute()) because column_to_rownames requires a string argument for the column. But for the others we can use the friendly {{:
foo = function(data, group, vars) {
grp_name = deparse(substitute(group))
data %>%
group_by({{group}}) %>%
summarize(across({{vars}}, mean)) %>%
column_to_rownames(grp_name) %>%
t
}
foo(data = mtcars, group = cyl, vars = c(disp, drat))
# 4 6 8
# disp 105.136364 183.314286 353.100000
# drat 4.070909 3.585714 3.229286

Mapping pipes to multiple columns in tidyverse

I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!

Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))

You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25

You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)

Feeding vector into summarise_at

I'm sure it is something obvious since I'm an R novice, but I cannot figure out why the first approach is working while the second is not. Something is wrong with my use of "paste".
library(dplyr)
data(mtcars)
characteristics <- c('disp', 'hp')
summarise_at(df, .vars = vars(characteristics), mean)
characteristics <- paste('disp hp', collapse = ",")
summarise_at(df, .vars = vars(characteristics), mean)

If you want to summarise over disp and hp of mtcars why not use a simpler and more straigthforward approach, like so?
mtcars %>%
summarise(across(c('disp', 'hp'), mean))
disp hp
1 230.7219 146.6875
Of yourse, you can also 'feed' your vector into the across operation:
characteristics <- c('disp', 'hp')
mtcars %>%
summarise(across(characteristics, mean))
disp hp
1 230.7219 146.6875
Using summarise(across...)would also take into account that so-called scoped dplyr verbs have now essentially been superseded by across()

With help from a friend, I found the answer.
library('dplyr')
data(mtcars)
characteristics <- unlist(str_split('disp hp', ' '))
# the line above replaced characteristics <- paste('disp hp', collapse = ",")
summarise_at(mtcars, .vars = vars(characteristics), mean)

Can't rename columns with dplyr

I'm trying to rename columns in data frame from Characteristics..genotype. to genotype and from Characteristics..age. to age:
pData(raw_data) %>%
rename(
age = Characteristics..age.,
genotype = Characteristics..genotype.
)
I get the following error:
Error in rename(., age = Characteristics..age., genotype = Characteristics..genotype.) : object 'Characteristics..age.' not found
Which doesn't make sense since columns exist in the data frame:
pData(raw_data)$Characteristics..genotype.
Output of the above:
[1] N171-HD82Q N171-HD82Q N171-HD82Q wt wt wt N171-HD82Q N171-HD82Q N171-HD82Q wt wt
[12] wt N171-HD82Q N171-HD82Q N171-HD82Q wt wt wt
Levels: N171-HD82Q wt
What am I missing?

An option would be backquotes
library(dplyr)
pData(raw_data) %>%
rename(
age = `Characteristics..age.`,
genotype = `Characteristics..genotype.`
)
Or based on the error (reproduced with plyr::rename), it would be better to use :: to specify the package from which it loads to avoid masking
pData(raw_data) %>%
dplyr::rename(
age = Characteristics..age.,
genotype = Characteristics..genotype.
)
But, while testing on dplyr_0.8.3, it is working fine without backquotes a well
data(mtcars)
raw_data <- head(mtcars)
names(raw_data)[1] <- "Characteristics..genotype."
raw_data %>%
dplyr::rename(genotype = Characteristics..genotype.)
# genotype cyl disp hp drat wt qsec vs am gear carb
# ...
The issue would be that plyr also include the same rename function, so if the package was also loaded, it could mask the dplyr::rename
raw_data %>%
plyr::rename(genotype = Characteristics..genotype.)
Error in plyr::rename(., genotype = Characteristics..genotype.) :
unused argument (genotype = Characteristics..genotype.)

You could use rename_all and do the renaming with a function, e.g. use stringr::str_remove_all to remove all instances of "Characteristics.." at the start or "." at the end (periods escaped with \\).
library(tidyverse) # dplyr and stringr
df %>%
rename_all(str_remove_all, '^Characteristics\\.\\.|\\.$')

Tidy approach to regression models, ideally with dplyr

Reading the documentation for do() in dplyr, I've been impressed by the ability to create regression models for groups of data and was wondering whether it would be possible to replicate it using different independent variables rather than groups of data.
So far I've tried
require(dplyr)
data(mtcars)
models <- data.frame(var = c("cyl", "hp", "wt"))
models <- models %>% do(mod = lm(mpg ~ as.name(var), data = mtcars))
Error in as.vector(x, "symbol") :
cannot coerce type 'closure' to vector of type 'symbol'
models <- models %>% do(mod = lm(substitute(mpg ~ i, as.name(.$var)), data = mtcars))
Error in substitute(mpg ~ i, as.name(.$var)) :
invalid environment specified
The desired final output would be something like
var slope standard_error_slope
1 cyl -2.87 0.32
2 hp -0.07 0.01
3 wt -5.34 0.56
I'm aware that something similar is possible using a lapply approach, but find the apply family largely inscrutable. Is there a dplyr solution?

There's nothing particularly complicated about the approach in the linked page. The use of substitute and as.name is a bit arcane, but that's easily rectified.
varlist <- names(mtcars)[-1]
models <- lapply(varlist, function(x) {
form <- formula(paste("mpg ~", x))
lm(form, data=mtcars)
})
dplyr is not the be-all and end-all of R programming. I'd suggest getting familiar with the *apply functions as they'll be of use in many situations where dplyr doesn't work.

This isn't pure "dplyr", but rather, "dplyr" + "tidyr" + "data.table". Still, I think it should be pretty easily readable.
library(data.table)
library(dplyr)
library(tidyr)
mtcars %>%
gather(var, val, cyl:carb) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.322408883
# 2: disp -0.04121512 0.004711833
# 3: hp -0.06822828 0.010119304
# 4: drat 7.67823260 1.506705108
# 5: wt -5.34447157 0.559101045
# 6: qsec 1.41212484 0.559210130
# 7: vs 7.94047619 1.632370025
# 8: am 7.24493927 1.764421632
# 9: gear 3.92333333 1.308130699
# 10: carb -2.05571870 0.568545640
If you really just wanted a few variables, start with a vector, not a data.frame.
models <- c("cyl", "hp", "wt")
mtcars %>%
select_(.dots = c("mpg", models)) %>%
gather(var, val, -mpg) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.3224089
# 2: hp -0.06822828 0.0101193
# 3: wt -5.34447157 0.5591010

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dplyr and multiple t test (keeping the same IV) - r

Related

How to add name to vector before creating a dataframe of means sorted by group and variable

Mapping pipes to multiple columns in tidyverse

Feeding vector into summarise_at

Can't rename columns with dplyr

Tidy approach to regression models, ideally with dplyr

Categories

Resources