How to use dplyr SE with "invalid" names (ie containing spaces)? - r

I can't figure out how to use SE dplyr function with invalid variable names, for example selecting a variable with a space in it.
Example:
df <- dplyr::data_frame(`a b` = 1)
myvar <- "a b"
If I want to select a b variable, I can do it with dplyr::select(df, `a b`), but how do I do that with select_?
I suppose I just need to find a function that "wraps" a string in backticks, so that I can call dplyr::select_(df, backtick(myvar))

As MyFlick said in the comments, this behaviour should generally be avoided, but if you want to make it work you can make your own backtick wrapper
backtick <- function(x) paste0("`", x, "`")
dplyr::select_(df, backtick(myvar))
EDIT: Hadley replied to my tweets about this and showed me that simply using as.name will work for this instead of using backticks:
df <- dplyr::data_frame(`a b` = 1)
myvar <- "a b"
dplyr::select_(df, as.name(myvar))

My solution was to exploit the ability of select to use column positions. The as.name solution did not appear to work for some of my columns.
select(df, which(names(df) %in% myvar))
or even more succinctly if already in a pipe:
df %>%
select(which(names(.) %in% myvar))
Although this uses select, in my view it does not rely on NSE.
Note that if there are no matches, all columns will be dropped with no error or warning.

Related

Dplyr Filter Multiple Like Conditions

I am trying to do a filter in dplyr where a column is like certain observations. I can use sqldf as
Test <- sqldf("select * from database
Where SOURCE LIKE '%ALPHA%'
OR SOURCE LIKE '%BETA%'
OR SOURCE LIKE '%GAMMA%'")
I tried to use the following which doesn't return any results:
database %>% dplyr::filter(SOURCE %like% c('%ALPHA%', '%BETA%', '%GAMMA%'))
Thanks
You can use grepl with ALPHA|BETA|GAMMA, which will match if any of the three patterns is contained in SOURCE column.
database %>% filter(grepl('ALPHA|BETA|GAMMA', SOURCE))
If you want it to be case insensitive, add ignore.case = T in grepl.
%like% is from the data.table package. You're probably also seeing this warning message:
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
The %like% operator is just a wrapper around the grepl function, which does string matching using regular expressions. So % aren't necessary, and in fact they represent literal percent signs.
You can only supply one pattern to match at a time, so either combine them using the regex 'ALPHA|BETA|GAMMA' (as Psidom suggests) or break the tests into three statements:
database %>%
dplyr::filter(
SOURCE %like% 'ALPHA' |
SOURCE %like% 'BETA' |
SOURCE %like% 'GAMMA'
)
Building on Psidom and Nathan Werth's response, for a Tidyverse friendly and concise method, we can do;
library(data.table); library(tidyverse)
database %>%
dplyr::filter(SOURCE %ilike% "ALPHA|BETA|GAMMA") # ilike = case insensitive fuzzysearch

group_by with non-scalar character vectors using tidyeval

Using R 3.2.2 and dplyr 0.7.2 I'm trying to figure out how to effectively use group_by with fields supplied as character vectors.
Selecting is easy I can select a field via string like this
(function(field) {
mpg %>% dplyr::select(field)
})("cyl")
Multiple fields via multiple strings like this
(function(...) {
mpg %>% dplyr::select(!!!quos(...))
})("cyl", "hwy")
and multiple fields via one character vector with length > 1 like this
(function(fields) {
mpg %>% dplyr::select(fields)
})(c("cyl", "hwy"))
With group_by I cannot really find a way to do this for more than one string because if I manage to get an output it ends up grouping by the string I supply.
I managed to group by one string like this
(function(field) {
mpg %>% group_by(!!field := .data[[field]]) %>% tally()
})("cyl")
Which is already quite ugly.
Does anyone know what I have to write so I can run
(function(field) {...})("cyl", "hwy")
and
(function(field) {...})(c("cyl", "hwy"))
respectively? I tried all sorts of combinations of !!, !!!, UQ, enquo, quos, unlist, etc... and saving them in intermediate variables because that sometimes seems to make a difference, but cannot get it to work.
select() is very special in dplyr. It doesn't accept columns, but column names or positions. So that's about the only main verb that accepts strings. (Technically when you supply a bare name like cyl to select, it actually gets evaluated as its own name, not as the vector inside the data frame.)
If you want your function to take simple strings, as opposed to bare expressions or symbols, you don't need quosures. Just create symbols from the strings and unquote them:
myselect <- function(...) {
syms <- syms(list(...))
select(mtcars, !!! syms)
}
mygroup <- function(...) {
syms <- syms(list(...))
group_by(mtcars, !!! syms)
}
myselect("cyl", "disp")
mygroup("cyl", "disp")
To debug the unquoting, wrap with expr() and check that the expression looks right:
syms <- syms(list("cyl", "disp"))
expr(group_by(mtcars, !!! syms))
#> group_by(mtcars, cyl, disp) # yup, looks right!
See this talk for more on this (we'll update the programming vignette to make the concepts clearer): https://schd.ws/hosted_files/user2017/43/tidyeval-user.pdf.
Finally, note that many verbs have a _at suffix variant that accepts strings and character vectors without fuss:
group_by_at(mtcars, c("cyl", "disp"))

How to specify "does not contain" in dplyr filter

I am quite new to R.
Using the table called SE_CSVLinelist_clean, I want to extract the rows where the Variable called where_case_travelled_1 DOES NOT contain the strings "Outside Canada" OR "Outside province/territory of residence but within Canada". Then create a new table called SE_CSVLinelist_filtered.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
where_case_travelled_1 %in% -c('Outside Canada','Outside province/territory of residence but within Canada'))
The code above works when I just use "c" and not "-c".
So, how do I specify the above when I really want to exclude rows that contains that outside of the country or province?
Note that %in% returns a logical vector of TRUE and FALSE. To negate it, you can use ! in front of the logical statement:
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
!where_case_travelled_1 %in%
c('Outside Canada','Outside province/territory of residence but within Canada'))
Regarding your original approach with -c(...), - is a unary operator that "performs arithmetic on numeric or complex vectors (or objects which can be coerced to them)" (from help("-")). Since you are dealing with a character vector that cannot be coerced to numeric or complex, you cannot use -.
Try putting the search condition in a bracket, as shown below. This returns the result of the conditional query inside the bracket. Then test its result to determine if it is negative (i.e. it does not belong to any of the options in the vector), by setting it to FALSE.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
(where_case_travelled_1 %in% c('Outside Canada','Outside province/territory of residence but within Canada')) == FALSE)
Just be careful with the previous solutions since they require to type out EXACTLY the string you are trying to detect.
Ask yourself if the word "Outside", for example, is sufficient. If so, then:
data_filtered <- data %>%
filter(!str_detect(where_case_travelled_1, "Outside")
A reprex version:
iris
iris %>%
filter(!str_detect(Species, "versicolor"))
Quick fix. First define the opposite of %in%:
'%ni%' <- Negate("%in%")
Then apply:
SE_CSVLinelist_filtered <- filter(
SE_CSVLinelist_clean,
where_case_travelled_1 %ni% c('Outside Canada',
'Outside province/territory of residence but within Canada'))

Using column names with signs of a data frame in a qplot

I have a dataset and unfortunately some of the column labels in my dataframe contain signs (- or +). This doesn't seem to bother the dataframe, but when I try to plot this with qplot it throws me an error:
x <- 1:5
y <- x
names <- c("1+", "2-")
mydf <- data.frame(x, y)
colnames(mydf) <- names
mydf
qplot(1+, 2-, data = mydf)
and if I enclose the column names in quotes it will just give me a category (or something to that effect, it'll give me a plot of "1+" vs. "2-" with one point in the middle).
Is it possible to do this easily? I looked at aes_string but didn't quite understand it (at least not enough to get it to work).
Thanks in advance.
P.S. I have searched for a solution online but can't quite find anything that helps me with this (it could be due to some aspect I don't understand), so I reason it might be because this is a completely retarded naming scheme I have :p.
Since you have non-standard column names, you need to to use backticks (`)in your column references.
For example:
mydf$`1+`
[1] 1 2 3 4 5
So, your qplot() call should look like this:
qplot(`1+`, `2-`, data = mydf)
You can find more information in ?Quotes and ?names
As said in the other answer you have a problem because you you don't have standard names. When solution is to avoid backticks notation is to convert colnames to a standard form. Another motivation to convert names to regular ones is , you can't use backticks in a lattice plot for example. Using gsub you can do this:
gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',c("1+", "2-","a--"))
[1] "a1" "a2" "aa"
Hence, applying this to your example :
colnames(mydf) <- gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',colnames(mydf))
qplot(a1,a2,data = mydf)
EIDT
you can use make.names with option unique =T
make.names(c("10+", "20-", "10-", "a30++"),unique=T)
[1] "X10." "X20." "X10..1" "a30.."
If you don't like R naming rules, here a custom version with using gsubfn
library(gsubfn)
gsubfn("[+|-]|^[0-9]+",
function(x) switch(x,'+'= 'a','-' ='b',paste('x',x,sep='')),
c("10+", "20-", "10-", "a30++"))
"x10a" "x20b" "x10b" "a30aa" ## note x10b looks better than X10..1

What do the %op% operators in mean? For example "%in%"?

I tried to do this simple search but couldn't find anything on the percent (%) symbol in R.
What does %in% mean in the following code?
time(x) %in% time(y) where x and y are matrices.
How do I look up help on %in% and similar functions that follow the %stuff% pattern, as I cannot locate the help file?
Related questions:
What does eg %+% do? in R
The R %*% operator
What does %*% mean in R
What does %||% do in R?
What does %>% mean in R
I didn't think GSee's or Sathish's answers went far enough because "%" does have meaning all by itself and not just in the context of the %in% operator. It is the mechanism for defining new infix operators by users. It is a much more general issue than the virtues of the %in% infix operator or its more general prefix ancestor match. It could be as simple as making a pairwise "s"(um) operator:
`%s%` <- function(x,y) x + y
Or it could be more interesting, say making a second derivative operator:
`%DD%` <- function(expr, nam="x") { D(D( bquote(.(expr)), nam), nam) }
expression(x^4) %DD% "x"
# 4 * (3 * x^2)
The %-character also has importance in the parsing of Date, date-time, and C-type format functions like strptime, formatC and sprintf.
Since that was originally written we have seen the emergence of the magrittr package with the dplyr elaboration that demonstrates yet another use for %-flanked operators.
So the most general answer is that % symbols are handled specially by the R parser. Since the parser is used to process plotmath expressions, you will also see extensive options for graphics annotations at the ?plotmath help page.
%op% denotes an infix binary operator. There are several built-in operators using %, and you can also create your own.
(A single % sign isn't a keyword in R. You can see a list of keywords on the ?Reserved help page.)
How do I get help on binary operators?
As with anything that isn't a standard variable name, you have to to enclose the term in quotes or backquotes.
?"%in%"
?`%in%`
Credit: GSee's answer.
What does %in% do?
As described on the ?`%in%` help page (which is actually the ?match help page since %in% is really only an infix version of match.),
[%in%] returns a logical vector indicating if there is a match or not for its left operand
It is most commonly used with categorical variables, though it can be used with numbers as well.
c("a", "A") %in% letters
## [1] TRUE FALSE
1:4 %in% c(2, 3, 5, 7, 11)
## [1] FALSE TRUE TRUE FALSE
Credit: GSee's answer, Ari's answer, Sathish's answer.
How do I create my own infix binary operators?
These are functions, and can be defined in the same way as any other function, with a couple of restrictions.
It's a binary opertor, so the function must take exactly two arguments.
Since the name is non-standard, it must be written with quotes or backquotes.
For example, this defines a matrix power operator.
`%^%` <- function(x, y) matrixcalc::matrix.power(x, y)
matrix(1:4, 2) %^% 3
Credit: BondedDust's answer, Ari's answer.
What other % operators are there?
In base R:
%/% and %% perform integer division and modular division respectively, and are described on the ?Arithmetic help page.
%o% gives the outer product of arrays.
%*% performs matrix multiplication.
%x% performs the Kronecker product of arrays.
In ggplot2:
%+% replaces the data frame in a ggplot.
%+replace% modifies theme elements in a ggplot.
%inside% (internal) checks for values in a range.
%||% (internal) provides a default value in case of NULL values. This function also appears internally in devtools, reshape2, roxygen2 and knitr. (In knitr it is called %n%.)
In magrittr:
%>% pipes the left-hand side into an expression on the right-hand side.
%<>% pipes the left-hand side into an expression on the right-hand side, and then assigns the result back into the left-hand side object.
%T>% pipes the left-hand side into an expression on the right-hand side, which it uses only for its side effects, returning the left-hand side.
%,% builds a functional sequence.
%$% exposes columns of a data.frame or members of a list.
In data.table:
%between% checks for values in a range.
%chin% is like %in%, optimised for character vectors.
%like% checks for regular expression matches.
In Hmisc:
%nin% returns the opposite of %in%.
In devtools:
%:::% (internal) gets a variable from a namespace passed as a string.
In sp:
%over% performs a spatial join (e.g., which polygon corresponds to some points?)
In rebus:
%R% concatenates elements of a regex object.
More generally, you can find all the operators in all the packages installed on your machine using:
library(magrittr)
ip <- installed.packages() %>% rownames
(ops <- setNames(ip, ip) %>%
lapply(
function(pkg)
{
rdx_file <- system.file("R", paste0(pkg, ".rdx"), package = pkg)
if(file.exists(rdx_file))
{
rdx <- readRDS(rdx_file)
fn_names <- names(rdx$variables)
fn_names[grepl("^%", fn_names)]
}
}
) %>%
unlist
)
Put quotes around it to find the help page. Either of these work
> help("%in%")
> ?"%in%"
Once you get to the help page, you'll see that
‘%in%’ is currently defined as
‘"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0’
Since time is a generic, I don't know what time(X2) returns without knowing what X2 is. But, %in% tells you which items from the left hand side are also in the right hand side.
> c(1:5) %in% c(3:8)
[1] FALSE FALSE TRUE TRUE TRUE
See also, intersect
> intersect(c(1:5), c(3:8))
[1] 3 4 5
More generally, %foo% is the syntax for a binary operator. Binary operators in R are really just functions in disguise, and take two arguments (the one before and the one after the operator become the first two arguments of the function).
For example:
> `%in%`(1:5,4:6)
[1] FALSE FALSE FALSE TRUE TRUE
While %in% is defined in base R, you can also define your own binary function:
`%hi%` <- function(x,y) cat(x,y,"\n")
> "oh" %hi% "my"
oh my
%in% is an operator used to find and subset multiple occurrences of the same name or value in a matrix or data frame.
For example 1: subsetting with the same name
set.seed(133)
x <- runif(5)
names(x) <- letters[1:5]
x[c("a", "d")]
# a d
# 0.5360112 0.4231022
Now you change the name of "d" to "a"
names(x)[4] <- "a"
If you try to extract the similar names and its values using the previous subscript, it will not work. Notice the result, it does not have the elements of [1] and [4].
x[c("a", "a")]
# a a
# 0.5360112 0.5360112
So, you can extract the two "a"s from different position in a variable by using %in% binary operator.
names(x) %in% "a"
# [1] TRUE FALSE FALSE TRUE FALSE
#assign it to a variable called "vec"
vec <- names(x) %in% "a"
#extract the values of two "a"s
x[vec]
# a a
# 0.5360112 0.4231022
Example 2: Subsetting multiple values from a column
Refer this site for an example

Resources