Dplyr Filter Multiple Like Conditions - r

I am trying to do a filter in dplyr where a column is like certain observations. I can use sqldf as
Test <- sqldf("select * from database
Where SOURCE LIKE '%ALPHA%'
OR SOURCE LIKE '%BETA%'
OR SOURCE LIKE '%GAMMA%'")
I tried to use the following which doesn't return any results:
database %>% dplyr::filter(SOURCE %like% c('%ALPHA%', '%BETA%', '%GAMMA%'))
Thanks

You can use grepl with ALPHA|BETA|GAMMA, which will match if any of the three patterns is contained in SOURCE column.
database %>% filter(grepl('ALPHA|BETA|GAMMA', SOURCE))
If you want it to be case insensitive, add ignore.case = T in grepl.

%like% is from the data.table package. You're probably also seeing this warning message:
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
The %like% operator is just a wrapper around the grepl function, which does string matching using regular expressions. So % aren't necessary, and in fact they represent literal percent signs.
You can only supply one pattern to match at a time, so either combine them using the regex 'ALPHA|BETA|GAMMA' (as Psidom suggests) or break the tests into three statements:
database %>%
dplyr::filter(
SOURCE %like% 'ALPHA' |
SOURCE %like% 'BETA' |
SOURCE %like% 'GAMMA'
)

Building on Psidom and Nathan Werth's response, for a Tidyverse friendly and concise method, we can do;
library(data.table); library(tidyverse)
database %>%
dplyr::filter(SOURCE %ilike% "ALPHA|BETA|GAMMA") # ilike = case insensitive fuzzysearch

Related

Can I use case_when (or anything else) to write with a non-static string?

I have a data frame in R. I have this working splendidly at the moment as a test of my initial regex. For reference, I have dplyr and magrittr installed, largely for other reasons, and I am following some project-wide conventions as far as whitespace and closing parentheses are concerned:
frame %<>% mutate(
columnA = case_when(
grepl("WXYZ *[1-9]{1,2}", columnB) == TRUE ~'HOORAY'
)
)
The thing is, I would like to replace 'HOORAY' with whatever grepl actually found. Right now, I am of course searching for strings containing WXYZ followed by any number of spaces (0 included) and then a single- or double-digit integer.
If, for example, grepl found the string "WXYZ 22", I want the corresponding entry in columnA to be written as "WXYZ 22". But then if it finds "WXYZ5" later, I want it to write "WXYZ5" in its own corresponding entry.
I want, in pseudocode TRUE ~ <what grepl found>.
Can I do this with case_when? If so, is there a better way?
If the case_when structure is necessary, this solution using stringr works:
grepl("WXYZ *[1-9]{1,2}", columnB) ~ str_extract(columnB, "WXYZ *[1-9]{1,2}")
Depending on what the bigger problem setup looks like, you could also just do:
mutate(columnA = str_extract(columnB, "WXYZ *[1-9]{1,2}"))
Note that columnA would be NA for situations where it fails to match. Also note that while grep expects the pattern first and then the target string, stringr functions expect the opposite.

Data Management in R

So I have this code where I am trying to unite separate columns called grade prek-12 into one column called Grade. I have employed the tidyr package and used this line of code to perform said task:
unite(dta, "Grade",
c(Gradeprek,
dta$Gradek, dta$Grade1, dta$Grade2,
dta$Grade3, dta$Grade4, dta$Grade5,
dta$Grade6, dta$Grade7, dta$Grade8,
dta$Grade9, dta$Grade10, dta$Grade11,
dta$Grade12),
sep="")
However, I have been getting an error saying this:
error: All select() inputs must resolve to integer column positions.
The following do not: * c(Gradeprek, dta$Gradek, dta$Grade1, dta$Grade2, dta$Grade3, dta$Grade4, dta$Grade5, dta$Grade6, ...
Penny for your thoughts on how I can resolve the situation.
You are mixing and matching the two syntax options for unite and unite_ - you need to pick one and stick with it. In both cases, do not use data$column - they take a data argument so you don't need to re-specify which data frame your columns come from.
Option 1: NSE The default non-standard evaluation means bare column names - no quotes! And no c().
unite(dta, Grade, Gradeprek, Gradek, Grade1, Grade2, Grade3, ...,
Grade12, sep = "")
There are tricks you can do with this. For example, if all your Grade columns are in this order next to each other in your data frame, you could do
unite(dta, Grade, Gradeprek:Grade12, sep = "")
You could also use starts_with("Grade") to get all column that begin with that string. See ?unite and its link to ?select for more details.
Option 2: Standard Evaluation You can use unite_() for a standard-evaluating alternative which will expect column names in a character vector. This has the advantage in this case of letting you use paste() to build column names in the order you want:
unite_(dta, col = "Grade", c("Gradeprek", "Gradek", paste0("Grade", 1:12)), sep = "")

How to specify "does not contain" in dplyr filter

I am quite new to R.
Using the table called SE_CSVLinelist_clean, I want to extract the rows where the Variable called where_case_travelled_1 DOES NOT contain the strings "Outside Canada" OR "Outside province/territory of residence but within Canada". Then create a new table called SE_CSVLinelist_filtered.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
where_case_travelled_1 %in% -c('Outside Canada','Outside province/territory of residence but within Canada'))
The code above works when I just use "c" and not "-c".
So, how do I specify the above when I really want to exclude rows that contains that outside of the country or province?
Note that %in% returns a logical vector of TRUE and FALSE. To negate it, you can use ! in front of the logical statement:
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
!where_case_travelled_1 %in%
c('Outside Canada','Outside province/territory of residence but within Canada'))
Regarding your original approach with -c(...), - is a unary operator that "performs arithmetic on numeric or complex vectors (or objects which can be coerced to them)" (from help("-")). Since you are dealing with a character vector that cannot be coerced to numeric or complex, you cannot use -.
Try putting the search condition in a bracket, as shown below. This returns the result of the conditional query inside the bracket. Then test its result to determine if it is negative (i.e. it does not belong to any of the options in the vector), by setting it to FALSE.
SE_CSVLinelist_filtered <- filter(SE_CSVLinelist_clean,
(where_case_travelled_1 %in% c('Outside Canada','Outside province/territory of residence but within Canada')) == FALSE)
Just be careful with the previous solutions since they require to type out EXACTLY the string you are trying to detect.
Ask yourself if the word "Outside", for example, is sufficient. If so, then:
data_filtered <- data %>%
filter(!str_detect(where_case_travelled_1, "Outside")
A reprex version:
iris
iris %>%
filter(!str_detect(Species, "versicolor"))
Quick fix. First define the opposite of %in%:
'%ni%' <- Negate("%in%")
Then apply:
SE_CSVLinelist_filtered <- filter(
SE_CSVLinelist_clean,
where_case_travelled_1 %ni% c('Outside Canada',
'Outside province/territory of residence but within Canada'))

Filter contents of data.table based on MULTIPLE regex matches

I am trying to accomplish the same goal as is resolved in this question, but I want to filter the table by two grep statements. When I try this:
DT[grep("word1", column) | grep("word2", column)]
I get this error:
Warning message:
In grep("word1", column) | grep("word2", column) :
longer object length is not a multiple of shorter object length
And when I try to combine this logic with an assigment := in the j argument of the data.table, I get all kinds of weirdness. Basically, it's apparent that the OR operator | doesn't work with grep's in the i argument of a data.table.
I came up with a messy workaround:
DT.a <- DT[grep("word1", column)]
DT.b <- DT[grep("word2", column)]
DT.all <- rbind(DT.a,DT.b)
but I'm hoping there's a better way to accomplish this goal. Any ideas?
The issue here turned out to be a combination of function choice and syntax in the placement of the OR operator |. DT[grep("word1", column) | grep("word2", column)] is confusing to data.table because each grep() returns vectors of indices (integers), which can be of different lengths depending on the data, and the data.table package isn't built to handle this sort of input. grepl() is a more appropriate function to use here because it returns a boolean of whether there is a regex match or not, and the OR operator | should be placed within the regex pattern string.
Solution:
DT[grepl("word1|word2", column)]

select columns based on first letter of columns using grep or grepl in r

I am attempting to use the dplyr package to select all columns that start with i. I have the following code:
dat <- select(newdat1,starts_with("i"))
and the colnames for my data are:
> colnames(newdat)
[1] "i22" "i21" "i20" "i24"
It is just a coincidence in this case they all start with i, as in other cases there will be a larger variety; thus, I want to automate the process. The issue is it appears my code using dplyr is correct; however, I am having issues with the package, so I was wondering if/how to accomplish the same task with grep or grepl, or anything really using the base package. Thanks!
With base R , you can use grep to match column names. You can use
dat <- newdat1[, grep("^i", colnames(newdat1))]
to do a starts-with like query. You can use any regular expression you want as the pattern in grep().

Resources