How to use string manipulation functions inside .names argument in dplyr::across - r

Though I tried to search whether it is duplicate, but I cannot find similar question. (though a similar one is there, but that is somewhat different from my requirement)
My question is that whether we can use string manipulation function such substr or stringr::str_remove inside .names argument of dplyr::across. As a reproducible example consider this
library(dplyr)
iris %>%
summarise(across(starts_with('Sepal'), mean, .names = '{.col}_mean'))
Sepal.Length_mean Sepal.Width_mean
1 5.843333 3.057333
Now my problem is that I want to rename output columns say str_remove(.col, 'Sepal') so that my output column names are just Length.mean and Width.mean . Why I am asking because, the description of this argument states that
.names
A glue specification that describes how to name the output columns. This can use {.col} to stand for the selected column name, and {.fn} to stand for the name of the function being applied. The default (NULL) is equivalent to "{.col}" for the single function case and "{.col}_{.fn}" for the case where a list is used for .fns.
I have tried many possibilities including the following, but none of these work
library(tidyverse)
library(glue)
iris %>%
summarise(across(starts_with('Sepal'), mean,
.names = glue('{xx}_mean', xx = str_remove(.col, 'Sepal'))))
Error: Problem with `summarise()` input `..1`.
x argument `str` should be a character vector (or an object coercible to)
i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Run `rlang::last_error()` to see where the error occurred.
#OR
iris %>%
summarise(across(starts_with('Sepal'), mean,
.names = glue('{xx}_mean', xx = str_remove(glue('{.col}'), 'Sepal'))))
I know that this can be solved by adding another step using rename_with so I am not looking after that answer.

This works, but with probably a few caveats. You can use functions inside a glue specification, so you could clean up the strings that way. However, when I tried escaping the ".", I got an error, which I assume has something to do with how across parses the string. If you need something more dynamic, you might want to dig into the source code at that point.
In order to use the {.fn} helper, at least in conjunction with creating the glue string on the fly like this, the function needs a name; otherwise you get a number for the function's index in the .fns argument. I tested this out with a second function and using lst for automatic naming.
library(dplyr)
iris %>%
summarise(across(starts_with('Sepal'), .fns = lst(mean, max),
.names = '{stringr::str_remove(.col, "^[A-Za-z]+.")}_{.fn}'))
#> Length_mean Length_max Width_mean Width_max
#> 1 5.843333 7.9 3.057333 4.4

Related

Recoding a discrete variable

I have a discrete variable with scores from 1-3. I would like to change it so 1=2, 2=1, 3=3.
I have tried
recode(Data$GEB43, "c(1=2; 2=1; 3=3")
But that doesn't work.
I know this is an overly stupid question that can be solved in excel within seconds but trying to learn how to do basics like this in R.
We should always provide a minimal reproducible example:
df <- data.frame(x=c(1,1,2,2,3,3))
You didn't specifiy the package for recode so I assumed dplyr. ?dplyr::recode tells us how the arguments should be passed to the function. In the original question "c(1=2; 2=1; 3=3" is a string (i.e. not an R expression but a character string "c(1=2; 2=1; 3=3"). To make it an R expression we have to get rid of the double quotes and replace the ; with ,. Additionally, we need a closing bracket i.e. c(1=2, 2=1, 3=3). But still, as ?dplyr::recode tells us, this is not the way to pass this information to recode:
Solution using dplyr::recode:
dplyr::recode(df$x, "1"=2, "2"=1, "3"=3)
Returns:
[1] 2 2 1 1 3 3
Assuming, you mean dplyr::recode, the syntax is
recode(.x, ..., .default = NULL, .missing = NULL)
From the documentation it says
.x - A vector to modify
... - Replacements. For character and factor .x, these should be named and replacement is based only on their name. For numeric .x, these can be named or not. If not named, the replacement is done based on position i.e. .x represents positions to look for in replacements
So when you have numeric value you can replace based on position directly
recode(1:3, 2, 1, 3)
#[1] 2 1 3

generalization of a chaining operator

I know that the %>% operator allows one to input the LHS to the first argument of the RHS, (so that xxx %>% fun() is equivalent to fun(xxx,)) which allows us to "chain" functions together, but is there a way to generalize this operation so that I can pass the LHS to the "nth" argument of the RHS? I am using the R programming language.
You use the . to pass the LHS into the desired named argument on right. If you want to replace 'hey' with 'ho' in 'hey ho' using gsub(pattern,replacement,text) then you could do any of the following. Note, %>% does not pass the LHS into the first argument of the function, but the first unnamed argument (see the third example below).
'hey ho' %>% gsub('hey','ho',.)
'hey ho' %>% gsub('hey','ho',text=.)
'hey ho' %>% gsub(pattern='hey',replacement='ho')

Dplyr Filter Multiple Like Conditions

I am trying to do a filter in dplyr where a column is like certain observations. I can use sqldf as
Test <- sqldf("select * from database
Where SOURCE LIKE '%ALPHA%'
OR SOURCE LIKE '%BETA%'
OR SOURCE LIKE '%GAMMA%'")
I tried to use the following which doesn't return any results:
database %>% dplyr::filter(SOURCE %like% c('%ALPHA%', '%BETA%', '%GAMMA%'))
Thanks
You can use grepl with ALPHA|BETA|GAMMA, which will match if any of the three patterns is contained in SOURCE column.
database %>% filter(grepl('ALPHA|BETA|GAMMA', SOURCE))
If you want it to be case insensitive, add ignore.case = T in grepl.
%like% is from the data.table package. You're probably also seeing this warning message:
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
The %like% operator is just a wrapper around the grepl function, which does string matching using regular expressions. So % aren't necessary, and in fact they represent literal percent signs.
You can only supply one pattern to match at a time, so either combine them using the regex 'ALPHA|BETA|GAMMA' (as Psidom suggests) or break the tests into three statements:
database %>%
dplyr::filter(
SOURCE %like% 'ALPHA' |
SOURCE %like% 'BETA' |
SOURCE %like% 'GAMMA'
)
Building on Psidom and Nathan Werth's response, for a Tidyverse friendly and concise method, we can do;
library(data.table); library(tidyverse)
database %>%
dplyr::filter(SOURCE %ilike% "ALPHA|BETA|GAMMA") # ilike = case insensitive fuzzysearch

group_by with non-scalar character vectors using tidyeval

Using R 3.2.2 and dplyr 0.7.2 I'm trying to figure out how to effectively use group_by with fields supplied as character vectors.
Selecting is easy I can select a field via string like this
(function(field) {
mpg %>% dplyr::select(field)
})("cyl")
Multiple fields via multiple strings like this
(function(...) {
mpg %>% dplyr::select(!!!quos(...))
})("cyl", "hwy")
and multiple fields via one character vector with length > 1 like this
(function(fields) {
mpg %>% dplyr::select(fields)
})(c("cyl", "hwy"))
With group_by I cannot really find a way to do this for more than one string because if I manage to get an output it ends up grouping by the string I supply.
I managed to group by one string like this
(function(field) {
mpg %>% group_by(!!field := .data[[field]]) %>% tally()
})("cyl")
Which is already quite ugly.
Does anyone know what I have to write so I can run
(function(field) {...})("cyl", "hwy")
and
(function(field) {...})(c("cyl", "hwy"))
respectively? I tried all sorts of combinations of !!, !!!, UQ, enquo, quos, unlist, etc... and saving them in intermediate variables because that sometimes seems to make a difference, but cannot get it to work.
select() is very special in dplyr. It doesn't accept columns, but column names or positions. So that's about the only main verb that accepts strings. (Technically when you supply a bare name like cyl to select, it actually gets evaluated as its own name, not as the vector inside the data frame.)
If you want your function to take simple strings, as opposed to bare expressions or symbols, you don't need quosures. Just create symbols from the strings and unquote them:
myselect <- function(...) {
syms <- syms(list(...))
select(mtcars, !!! syms)
}
mygroup <- function(...) {
syms <- syms(list(...))
group_by(mtcars, !!! syms)
}
myselect("cyl", "disp")
mygroup("cyl", "disp")
To debug the unquoting, wrap with expr() and check that the expression looks right:
syms <- syms(list("cyl", "disp"))
expr(group_by(mtcars, !!! syms))
#> group_by(mtcars, cyl, disp) # yup, looks right!
See this talk for more on this (we'll update the programming vignette to make the concepts clearer): https://schd.ws/hosted_files/user2017/43/tidyeval-user.pdf.
Finally, note that many verbs have a _at suffix variant that accepts strings and character vectors without fuss:
group_by_at(mtcars, c("cyl", "disp"))

Data Management in R

So I have this code where I am trying to unite separate columns called grade prek-12 into one column called Grade. I have employed the tidyr package and used this line of code to perform said task:
unite(dta, "Grade",
c(Gradeprek,
dta$Gradek, dta$Grade1, dta$Grade2,
dta$Grade3, dta$Grade4, dta$Grade5,
dta$Grade6, dta$Grade7, dta$Grade8,
dta$Grade9, dta$Grade10, dta$Grade11,
dta$Grade12),
sep="")
However, I have been getting an error saying this:
error: All select() inputs must resolve to integer column positions.
The following do not: * c(Gradeprek, dta$Gradek, dta$Grade1, dta$Grade2, dta$Grade3, dta$Grade4, dta$Grade5, dta$Grade6, ...
Penny for your thoughts on how I can resolve the situation.
You are mixing and matching the two syntax options for unite and unite_ - you need to pick one and stick with it. In both cases, do not use data$column - they take a data argument so you don't need to re-specify which data frame your columns come from.
Option 1: NSE The default non-standard evaluation means bare column names - no quotes! And no c().
unite(dta, Grade, Gradeprek, Gradek, Grade1, Grade2, Grade3, ...,
Grade12, sep = "")
There are tricks you can do with this. For example, if all your Grade columns are in this order next to each other in your data frame, you could do
unite(dta, Grade, Gradeprek:Grade12, sep = "")
You could also use starts_with("Grade") to get all column that begin with that string. See ?unite and its link to ?select for more details.
Option 2: Standard Evaluation You can use unite_() for a standard-evaluating alternative which will expect column names in a character vector. This has the advantage in this case of letting you use paste() to build column names in the order you want:
unite_(dta, col = "Grade", c("Gradeprek", "Gradek", paste0("Grade", 1:12)), sep = "")

Resources