Issue with gsub and str_replace_all in R - r

I need to replace 38 different types of expressions in the following format "IDENTIFIER:ABC:DEF", "IDENTIFIER:GHI:JKL", etc. with regular expressions like "apple" and "banana". I've tried using str_replace_all in the following format:
df$column <- df$column %>% str_replace_all("IDENTIFIER:ABC:DEF", "apple")
df$column <- df$column %>% str_replace_all("IDENTIFIER:GHI:JKL", "banana")
However, for some reason, R only processes about half of my requests. I've checked and double checked for errors and tried to break up the code but no success.
So then I tried the same with gsub:
df$column <- gsub(df$column, "IDENTIFIER:ABC:DEF", "apple")
df$column <- gsub(df$column, "IDENTIFIER:GHI:JKL", "banana")
and I get this error: "In gsub(df$column ...): argument 'pattern' has length > 1 and only the first element will be used".
I'm really not sure what to do next. Any advice?

gsubfn in the package of the same name provides a superset of gsub functionality and in particular it can optionally take a list as a replacement instead of a string. For each match to the regular expression if the match equals one of the list names it is replaced with the corresponding list value.
library(gsubfn)
x <- "xyz IDENTIFIER:ABC:DEF abc IDENTIFIER:GHI:JKL def" # test input
L <- list("IDENTIFIER:ABC:DEF" = "apple", "IDENTIFIER:GHI:JKL" = "banana")
gsubfn("\\y\\S+\\y", L, x)
## [1] "xyz apple abc banana def"
This also works:
gsubfn("\\b\\S+\\b", L, x, perl = TRUE)
## [1] "xyz apple abc banana def"

Related

How to pass multiple necessary patterns to str_subset?

I am trying to find elements in a character vector that match two words in no particular order, not just any single one of them, using the stringr::str_subset function. In other words, I'm looking for the intersection, not the union of the two words.
I tried using the "or" (|) operator but this only gives me either one of the two words and returns too many results. I also tried just passing a character vector with the two words as the pattern argument. This just returns the error that "longer object length is not a multiple of shorter object length" and only returns the values that match the second one of the two words.
character_vector <- c("abc ghi jkl mno def", "pqr abc def", "abc jkl pqr")
pattern <- c("def", "pqr")
str_subset(character_vector, pattern)
I'm looking for the pattern that will return only the second element of the character vector, i.e. "pqr abc def".
An option is str_detect. Loop over the 'pattern', check if both the 'pattern' elements match with the 'character_vector' (&), use the logical vector to extract the element from the 'character_vector'
library(tidyverse)
map(pattern, str_detect, string = character_vector) %>%
reduce(`&`) %>%
magrittr::extract(character_vector, .)
#[1] "pqr abc def"
Or using str_subset
map(pattern, str_subset, string = character_vector) %>%
reduce(intersect)
#[1] "pqr abc def"
You can use a pure R code with out a loop using regular expression. The code is like this:
character_vector[grepl(paste0("(?=.*",pattern,")",collapse = ""), character_vector, perl = TRUE)]
the grepl would find the position of the character that full fills the regex and condition inside the paste0.
As you are looking for the intersect, you can use the function intersect() and explicit the 2 patterns you are looking for
pattern_1 <- 'pqr'
pattern_2 <- 'def'
intersect(
str_subset(character_vector, pattern_1),
str_subset(character_vector, pattern_2)
)
Will this work?
character_vector %>% purrr::reduce(pattern, str_subset, .init = . )
[1] "pqr abc def"

Removing text between parentheses with unmatched pairs

I am trying to remove characters/numbers between parentheses. Firstly, the numbered parentheses - i.e. ("(3)") - at the start, and then anything in the second pair of parentheses. Sometimes this second pair of parentheses has an unmatched bracket which complicates things. An example:
library(qdapRegex)
n <- c("(1) Apple (Pe(ar)", "(2) Apple (Or(ang)e)", "(3) Banana (Hot(dog)")
c <- rm_between(n,"(",")", extract = TRUE)
To ideally get:
c
> "Apple" "Apple" "Banana"
It seems that you always need the second word. If that is the case then here are a couple of (straightforward) ways of doing it,
#Base R
sapply(strsplit(n, ' '), `[`, 2)
[1] "Apple" "Apple" "Banana"
#The always fun, word() from stringr package
stringr::word(n, 2)
[1] "Apple" "Apple" "Banana"
If you want to use regex, then you could use a replace regex with empty string like this:
[^A-Za-z ]
Or with insensitive flag
(?i)[^a-z ]
Regex demo

Using Regex to edit a column in R [duplicate]

I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.
If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA
For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"
I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"
I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"
dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"
Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate
stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

How to pass list of grep/regex commands to R function

I have a list of roughly 50 somewhat complicated grepl commands strings that I would like to pass to a function in R. I am using these functions to subset a dataframe. Here is an example of 2 of the commands:
t <- subDF[grepl("(extreme|abnormal|unseasonably|unusually|record|excessive) (heat|warm|high temp)",subDF$EVTYPE),]
t <- subDF[grepl("fl(oo)?d",subDF$EVTYPE) & !grepl("flash",subDF$EVTYPE) & !grepl("(tidal|beach|(c(oa)?sta?l))(/tidal)? ?(flood)",subDF$EVTYPE),]
So, in this example I would like to pass pass these 2 grepl commands to a function that will do this subsetting on dataframe subDF (plus pass the other 48 or so).
Any elegant way to do this?
Here's an example that uses quote to create two unevaluated grepl calls. They are then evaluated in an sapply call with eval.
> fruits <- c("one apple", "two pears", "three bananas")
> QQ <- list(q1 = quote(grepl("(one)|(apple)", fruits)),
q2 = quote(grepl("apple", fruits) | grepl("bananas|one", fruits)))
> sapply(QQ, function(x) fruits[eval(x)])
#$q1
#[1] "one apple"
#
#$q2
#[1] "one apple" "three bananas"
A look at QQ
#$q1
#grepl("(one)|(apple)", fruits)
#
#$q2
#grepl("apple", fruits) | grepl("bananas|one", fruits)
Something else that is useful is
> as.list(quote(grepl("one|apple", fruits)))
# [[1]]
# grepl
#
# [[2]]
# [1] "one|apple"
#
# [[3]]
# fruits
With this, you can replace the regular expression (or call or x) in every iteration by way of [[ indexing.
It looks like you're after Reduce. Using the iris dataset as an example:
mygrep <- function(x, df) df[grepl(x, df$Species), ]
pat <- c("setosa", "(setosa|virginica)")
Reduce(mygrep, pat, iris, right=TRUE)

Resources