Can't rename columns with dplyr - r

I'm trying to rename columns in data frame from Characteristics..genotype. to genotype and from Characteristics..age. to age:
pData(raw_data) %>%
rename(
age = Characteristics..age.,
genotype = Characteristics..genotype.
)
I get the following error:
Error in rename(., age = Characteristics..age., genotype = Characteristics..genotype.) : object 'Characteristics..age.' not found
Which doesn't make sense since columns exist in the data frame:
pData(raw_data)$Characteristics..genotype.
Output of the above:
[1] N171-HD82Q N171-HD82Q N171-HD82Q wt wt wt N171-HD82Q N171-HD82Q N171-HD82Q wt wt
[12] wt N171-HD82Q N171-HD82Q N171-HD82Q wt wt wt
Levels: N171-HD82Q wt
What am I missing?

An option would be backquotes
library(dplyr)
pData(raw_data) %>%
rename(
age = `Characteristics..age.`,
genotype = `Characteristics..genotype.`
)
Or based on the error (reproduced with plyr::rename), it would be better to use :: to specify the package from which it loads to avoid masking
pData(raw_data) %>%
dplyr::rename(
age = Characteristics..age.,
genotype = Characteristics..genotype.
)
But, while testing on dplyr_0.8.3, it is working fine without backquotes a well
data(mtcars)
raw_data <- head(mtcars)
names(raw_data)[1] <- "Characteristics..genotype."
raw_data %>%
dplyr::rename(genotype = Characteristics..genotype.)
# genotype cyl disp hp drat wt qsec vs am gear carb
# ...
The issue would be that plyr also include the same rename function, so if the package was also loaded, it could mask the dplyr::rename
raw_data %>%
plyr::rename(genotype = Characteristics..genotype.)
Error in plyr::rename(., genotype = Characteristics..genotype.) :
unused argument (genotype = Characteristics..genotype.)

You could use rename_all and do the renaming with a function, e.g. use stringr::str_remove_all to remove all instances of "Characteristics.." at the start or "." at the end (periods escaped with \\).
library(tidyverse) # dplyr and stringr
df %>%
rename_all(str_remove_all, '^Characteristics\\.\\.|\\.$')

Related

Mapping pipes to multiple columns in tidyverse

I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)

Feeding vector into summarise_at

I'm sure it is something obvious since I'm an R novice, but I cannot figure out why the first approach is working while the second is not. Something is wrong with my use of "paste".
library(dplyr)
data(mtcars)
characteristics <- c('disp', 'hp')
summarise_at(df, .vars = vars(characteristics), mean)
characteristics <- paste('disp hp', collapse = ",")
summarise_at(df, .vars = vars(characteristics), mean)
If you want to summarise over disp and hp of mtcars why not use a simpler and more straigthforward approach, like so?
mtcars %>%
summarise(across(c('disp', 'hp'), mean))
disp hp
1 230.7219 146.6875
Of yourse, you can also 'feed' your vector into the across operation:
characteristics <- c('disp', 'hp')
mtcars %>%
summarise(across(characteristics, mean))
disp hp
1 230.7219 146.6875
Using summarise(across...)would also take into account that so-called scoped dplyr verbs have now essentially been superseded by across()
With help from a friend, I found the answer.
library('dplyr')
data(mtcars)
characteristics <- unlist(str_split('disp hp', ' '))
# the line above replaced characteristics <- paste('disp hp', collapse = ",")
summarise_at(mtcars, .vars = vars(characteristics), mean)

How to apply a function that outputs several column means for all the data frame objects in a list?

I know this must be a simple question, but I keep struggling with it.
I have this list of 124 data frames called "kks"
I want to input each one of the 124 data frames into the following function:
mytest_function <- function(df){
data.numcols <- df[, sapply(df, is.numeric)]
all.means <- apply(data.numcols, 2, mean)
all.means <- colMeans(data.numcols)
all.means
}
Essentially, I want the means of every column in all 124 data frames from a list of dataframes.
I've tried:
lapply(kks,mytest_function(df))
AND:
lapply(kks,mytest_function(kks))
and I can't figure it out. I keep getting error messages saying "Error in colMeans(df) : 'x' must be an array of at least two dimensions"
What should I do from here?
I think you could also use purrr::map() along with dplyr::summarise_if(), depending on the output format you desire. This would also eliminate the need for your custom function.
library(purrr)
library(dplyr)
kks %>%
map(summarise_if, .predicate = is.numeric, .funs = mean)
Using mtcars as sample data.
library(purrr)
library(dplyr)
kks <- list(mtcars, mtcars, mtcars)
kks %>%
map(summarise_if, .predicate = is.numeric, .funs = mean)
[[1]]
mpg cyl disp hp drat wt qsec vs am gear
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875
carb
1 2.8125
[[2]]
mpg cyl disp hp drat wt qsec vs am gear
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875
carb
1 2.8125
[[3]]
mpg cyl disp hp drat wt qsec vs am gear
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875
carb
1 2.8125
apply with margin 2 and colMeans are doing the same thing. You can use either one of them
mytest_function <- function(df){
data.numcols <- df[, sapply(df, is.numeric)]
colMeans(data.numcols, na.rm = TRUE)
}
and use lapply
output <- lapply(kks, mytest_function)
You can also use summarise_if from dplyr library
library(dplyr)
mytest_function <- function(df){
df %>% summarise_if(is.numeric, mean, na.rm = TRUE)
}

Dplyr and multiple t test (keeping the same IV)

I'm using this pretty nice code to perform a multiple t.test keeping the independent variable constant!
data(mtcars)
library(dplyr)
vars_to_test <- c("disp","hp","drat","wt","qsec")
iv <- "vs"
mtcars %>%
summarise_each_(
funs_(
sprintf("stats::t.test(.[%s == 0], .[%s == 1])$p.value",iv,iv)
),
vars = vars_to_test)
Unfortunately, dplyr was updated and I've been facing this report
summarise_each() is deprecated. Use summarise_all(),
summarise_at() or summarise_if() instead. To map funs over a
selection of variables, use summarise_at()
When i change the code for _all, at or _if, this function doest not work any more. I'm looking for some advice and thanks much for your support.
Thanks
Instead of creating a string expression with sprintf and then evaluating it, we can use the evaluate the 'vs' converting it to symbol and then evaluate it
library(dplyr)
mtcars %>%
summarise_at(vars(vars_to_test), funs(
try(stats::t.test(.[(!! rlang::sym(iv)) == 0], .[(!! rlang::sym(iv)) == 1])$p.value)
))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06
If we really wanted to parse an expression, use the rlang_parse_expr and rlang::eval_tidy along with sym
library(rlang)
eval_tidy(parse_expr("mtcars %>% summarise_at(vars(vars_to_test),
funs(t.test(.[(!!sym(iv))==0],
.[(!!sym(iv))==1])$p.value ))"))
# disp hp drat wt qsec
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06

Tidy approach to regression models, ideally with dplyr

Reading the documentation for do() in dplyr, I've been impressed by the ability to create regression models for groups of data and was wondering whether it would be possible to replicate it using different independent variables rather than groups of data.
So far I've tried
require(dplyr)
data(mtcars)
models <- data.frame(var = c("cyl", "hp", "wt"))
models <- models %>% do(mod = lm(mpg ~ as.name(var), data = mtcars))
Error in as.vector(x, "symbol") :
cannot coerce type 'closure' to vector of type 'symbol'
models <- models %>% do(mod = lm(substitute(mpg ~ i, as.name(.$var)), data = mtcars))
Error in substitute(mpg ~ i, as.name(.$var)) :
invalid environment specified
The desired final output would be something like
var slope standard_error_slope
1 cyl -2.87 0.32
2 hp -0.07 0.01
3 wt -5.34 0.56
I'm aware that something similar is possible using a lapply approach, but find the apply family largely inscrutable. Is there a dplyr solution?
There's nothing particularly complicated about the approach in the linked page. The use of substitute and as.name is a bit arcane, but that's easily rectified.
varlist <- names(mtcars)[-1]
models <- lapply(varlist, function(x) {
form <- formula(paste("mpg ~", x))
lm(form, data=mtcars)
})
dplyr is not the be-all and end-all of R programming. I'd suggest getting familiar with the *apply functions as they'll be of use in many situations where dplyr doesn't work.
This isn't pure "dplyr", but rather, "dplyr" + "tidyr" + "data.table". Still, I think it should be pretty easily readable.
library(data.table)
library(dplyr)
library(tidyr)
mtcars %>%
gather(var, val, cyl:carb) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.322408883
# 2: disp -0.04121512 0.004711833
# 3: hp -0.06822828 0.010119304
# 4: drat 7.67823260 1.506705108
# 5: wt -5.34447157 0.559101045
# 6: qsec 1.41212484 0.559210130
# 7: vs 7.94047619 1.632370025
# 8: am 7.24493927 1.764421632
# 9: gear 3.92333333 1.308130699
# 10: carb -2.05571870 0.568545640
If you really just wanted a few variables, start with a vector, not a data.frame.
models <- c("cyl", "hp", "wt")
mtcars %>%
select_(.dots = c("mpg", models)) %>%
gather(var, val, -mpg) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.3224089
# 2: hp -0.06822828 0.0101193
# 3: wt -5.34447157 0.5591010

Resources