Dplyr Non Standard Evaluation -- Help Needed - r

I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.

You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)

Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))

Related

Check if single column is equal to any multiple others

My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!
If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))
I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))

ifelse in a mutate function in r

I am trying to add a column with a condition using the mutate function in r, but keep getting an error. The code is straight from the teacher's lecture, but yet an error occurs. The LineItem column is a factor class, I am not sure if that make a difference.
Please advice on what I am missing.
Thank you,
Avi
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg), 1, 0)
Error in ifelse(sl > avg) : argument "yes" is missing, with no default
You have placed ')' at wrong places. The general syntax for ifelse is:
ifelse(cond,value if true, value if false)
df <- read.csv('ities_short.csv')
colSums(is.na(df))
sl <- str_length(df$LineItem)
avg <- mean(str_length(df$LineItem))
df <- df %>% mutate(LineItem_LongName = ifelse(sl > avg, 1, 0))
#Nirbhay Singh answer is correct. However, if you compare two vectors, it's generally better to use dplyr::if_else because it is stricter regarding NA values :
df <- df %>% mutate(LineItem_LongName = if_else(sl > avg, 1, 0))
See the doc
Don't create separate objects and use it in dataframe, instead keep them in dataframe itself. You can remove the columns later which you don't need. Moreover, you can do this without ifelse.
library(dplyr)
library(stringr)
df %>%
mutate(temp = str_length(LineItem),
LineItem_LongName = as.integer(temp > mean(temp)))
Or in base R :
df$temp <- nchar(df$LineItem)
transform(df, LineItem_LongName = +(temp > mean(temp)))

Passing variable in function to other function variables in R

I am trying to pass a variable Phyla (which is also the name of a df column of interest) into other functions. However I get the error: Error: Columntax_levelis unknown. Which I understand. It would just be more convenient to state the column you want to use once in the function since this will also be repeated numerous times in the script. I Have tried using OTU_melt_grouped[,1] since this will always be the first column to use in the dcast function, but get the error: Error: Must use a vector in[, not an object of class matrix. Moreover, it does not solve my solution in the group_by function since I want to be able to specify Phyla, Class, Order etc...
I am sure there must be a simple solution, but I don't know where to start!
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
OTU_melt_grouped <- data %>%
group_by(tax_level, variable) %>%
summarise(value = sum(value))
taxa_cols <- dcast(OTU_melt_grouped, variable ~ tax_level)
rownames(taxa_cols) <- meta_data$site
taxa_cols <- taxa_cols[-1]
return(taxa_cols)
}
tax_test <- taxa_specific_columns_func(OTU_melt)
As we are passing an unquoted variable, we could make use of curly-curly ({{..}}) operator in group_by
library(dplyr)
library(tidyr)
library(tibble)
taxa_specific_columns_func <- function(data, tax_level = Phyla) {
data %>%
group_by({{tax_level}}, variable) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = {{tax_level}}, values_from = value) %>%
column_to_rownames("variable")
}
taxa_specific_columns_func(OTU_melt)
# A B C D E
#a 0.01859254 0.42141238 -0.196961 -0.1859115 -0.2901680
#b -0.64700080 NA -0.161108 NA NA
#c -0.03297331 0.05871052 -1.963341 NA 0.7608218
data
set.seed(48)
OTU_melt <- data.frame(Phyla = rep(LETTERS[1:5], each = 3),
variable = sample(letters[1:3], 15, replace = TRUE), value = rnorm(15))

Calculate mode for each column in dataframe using lapply dplyr

I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)

How to pass multiple columns as string to function in dplyr::mutate_at

I have the following (heavily simplified) dplyr example for mutate:
xx <- data.frame(x = 1:10, y = c(rep(1,4),rep(2,6)))
bla_fun <- function(x,y){cat(x," ",y,"\n"); min(x,y)}
xx %>% rowwise() %>%
mutate( z = bla_fun(x,y))
I would like to get it working with mutate_at which enables to pass me the column names as strings.
xx %>% rowwise() %>%
mutate_at( c("x","y"), funs("bla_fun") )
But this does not work. How to get it working?
mutate_at mutates every single column separately.
Your particular example can be solved (assuming the min is a placeholder that cannot be replaced by pmin) like this:
xx %>%
mutate(z = map2(!!sym("x"), !!sym("y"), !!sym("bla_fun")))
syms <- rlang::syms(c("x", "y"))
xx %>%
rowwise() %>%
mutate( z = bla_fun(!!! syms))
Side note 1: mutate_at is typically for applying n unary functions to n variables, not 1 n-ary function to n variables. mutate does the job.
Side note 2: there is no need to group rowwise. You could more simply mutate(xx, z = purrr::map2_dbl(x, y, bla_fun)) or rewrite/vectorize bla_fun with pmin() to mutate directly.
Combine this with the use of syms for strings: mutate(xx, z = mapply(bla_fun, !!! syms)) for instance, or mutate(xx, z = purrr::pmap_dbl(list(!!! syms), bla_fun)).

Resources