In a previous question (replace string in R giving a vector of patterns and vector of replacements) y found that mgsub does have as pattern a string that does not need to br escape. That is good when you want to replace text like '[%.+%]' as a literal string, but then is a bad thing if you need to pass a real regular expression like:
library('stringr')
library('qdap')
tt_ori <- 'I have VAR1 and VAR2'
ttl <- list(ttregex='VAR([12])', val="val-\\1")
ttl
# OK
stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$val)
# [1] "I have val-1 and val-2"
# OK
mapply(gsub, ttl$ttregex, ttl$val, tt_ori, perl=T)
# [1] "I have val-1 and val-2"
# FAIL
qdap::mgsub(ttl$ttregex, ttl$val, tt_ori)
# [1] "I have VAR1 and VAR2"
How can I pass a regular expression to mgsub?
[UPDATE]
#BondeDust is rigth, with this oversimplyfied example the question does not make sense. The reason of wanting to use mgsub is for its ability for using a vector of patterns and a vector of replaces with a single string and make all substitutions in this string.
For example in the next example
> tt_ori <- 'I have VAR1 and VAR2 at CARTESIAN'
> ttl <- list( ttregex=c('VAR([12])', 'CARTESIAN')
+ , valregex=c("val-\\1", "XY")
+ , tt=c('VAR1', 'VAR2', 'CARTESIAN')
+ , val=c('val-1', 'val-2', 'XY')
+ )
> ttl
$ttregex
[1] "VAR([12])" "CARTESIAN"
$valregex
[1] "val-\\1" "XY"
$tt
[1] "VAR1" "VAR2" "CARTESIAN"
$val
[1] "val-1" "val-2" "XY"
# str_replace and gsub return multiple strings with partial substitutions
> stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$valregex)
[1] "I have val-1 and val-2 at CARTESIAN" "I have VAR1 and VAR2 at XY"
> mapply(gsub, ttl$ttregex, ttl$valregex, tt_ori, perl=T)
VAR([12]) CARTESIAN
"I have val-1 and val-2 at CARTESIAN" "I have VAR1 and VAR2 at XY"
# qdap (passing regexes) FAIL
> qdap::mgsub(ttl$ttregex, ttl$valregex, tt_ori)
[1] "I have VAR1 and VAR2 at XY"
# qdap (passing strings) is OK
> qdap::mgsub(ttl$tt, ttl$val, tt_ori)
[1] "I have val-1 and val-2 at XY"
I want to take advantage of using regexes when possible and not write all the possible strings (sometimes I don't know them in advance).
Change fixed = TRUE to fixed = FALSE
Related
I am trying to count the number of | in a string. This is my code but it is giving the incorrect answer of 32 instead of 2? Why is this happening and how do I get a function that returns 2? Thanks!
> levels
[1] "Completely|Partially|Not at all"
> str_count(levels, '|')
[1] 32
Also how do I separate the string by the | character? I would like the output to be a character vector of length 3: 'Completely', 'Partially', 'Not at all'.
The | is meaningful in regex as an "or"-like operator. Escape it with backslashes.
stringr::str_count("Completely|Partially|Not at all", "\\|")
# [1] 2
To show what | is normally used for, let's count the occurrences of el and al:
stringr::str_count("Completely|Partially|Not at all", "al")
# [1] 2
stringr::str_count("Completely|Partially|Not at all", "el")
# [1] 1
stringr::str_count("Completely|Partially|Not at all", "el|al")
# [1] 3
To look for the literal | symbol, it needs to be escaped.
To split the string by the | symbol, we can use strsplit (base R) or stringr::str_split:
strsplit("Completely|Partially|Not at all", "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
It's returned as a list, because the argument may be a vector. For instance, it might be more clear if we do
vec <- c("Completely|Partially|Not at all", "something|else")
strsplit(vec, "\\|")
# [[1]]
# [1] "Completely" "Partially" "Not at all"
# [[2]]
# [1] "something" "else"
The pipe | character is a regex metacharacter and needs to be escaped:
levels <- "Completely|Partially|Not at all"
str_count(levels, '\\|')
Another general trick you can use here is to compare the length of the input against the same with all pipes stripped:
nchar(levels) - nchar(gsub("|", "", levels, fixed=TRUE))
[1] 2
Addendum: Use strsplit:
unlist(strsplit(levels, "\\|"))
[1] "Completely" "Partially" "Not at all"
I have this string:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
I want to split the string by the 6-digit numbers. I.e. - I want this:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
How do I do this with regex? The following does not work (using stringr package):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
What am I missing??
Here's an approach with base R using a positive lookahead and lookbehind, and thanks to #thelatemail for the correction:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
An alternative approach with str_extract_all. Note I've used .*? to do 'non-greedy' matching, otherwise .* expands to grab everything:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
An easy-to-understand approach is to add a marker and then split on the locations of those markers. This has the advantage of being able to only look for 6-digit sequences and not require any other features in the surrounding text, whose features may change as you add new and unvetted data.
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\\d{6})", "#SPLIT_HERE#\\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
If your match occurs at the start or end of a string, str_split() will insert blank character entries in the results vector to indicate that (as it did above). If you don't need that information, you can easily remove it with out[nchar(out) != 0].
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
With less complex regex, you can do as following:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
I want to combine the following commands using AND operator:
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
Here is an example for OR operator
http://r.789695.n4.nabble.com/grep-for-multiple-pattern-td4685244.html#a4685247
Would you please advise?
The search for an AND operator in regex (whether in R or elsewhere) can be a long and sad search. The boolean AND means that both of two statements have to be true. How would you apply that to regex? Consider the regex pattern "ab", in grep("ab", strings). Even this simple pattern has several requirements, ALL of which have to be true. It has to have an "a", AND it has to have a "b", AND the "b" has to follow the "a" directly.
strings <- c("abraham, not ahab", "no it was ahab",
"abraham was the one they left on ceti alpha V",
"You're talking about Sherlock Holmes", "He tasks me", "ab")
grep("ab", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "You're talking about Sherlock Holmes"
# [5] "ab"
If what you'd like is to match strings that BOTH start with "ab" AND end with "ab", then #r2evans pattern will work for you: grep("^ab.*ab$", strings, value = TRUE) will show them to you. This means it starts with "ab", has zero or more other characters, and then ends with "ab".
grep("^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# NOTICE THAT THIS DOESN'T MATCH "ab", despite "ab" being at the beginning
# AND the end
If what you'd like is to match all the strings that start with an "a" immediately followed by a "b", AND ALSO all those that end with an "a" immediately followed by a "b", then you actually want grep("(^ab)|(ab$)", strings, value = TRUE)
grep("^ab|ab$", strings, value = TRUE)
# [1] "abraham, not ahab"
# [2] "no it was ahab"
# [3] "abraham was the one they left on ceti alpha V"
# [4] "ab"
So what about that solitary "ab" case? What regex pattern would match that and only that?
grep("^ab$", strings, value = TRUE)
# [1] "ab"
In this case, we wanted all of the matches to BOTH start AND end with "ab", but it had to be the same "ab". Of course, we could combine this with the other "AND" version, and get all of the matches where ab was at the start and ab was at the end:
grep("^ab$|^ab.*ab$", strings, value = TRUE)
# [1] "abraham, not ahab" "ab"
..and one more thing:
We can use #r2evans comment to demonstrate a sort of DeMorgan's law with regex. Notice that the pattern with the | metacharacter produces the same thing that you would by subsetting the strings object with the logical vector produced by combining both individual regex matches with a boolean AND:
strings[grepl("^ab", strings) & grepl("ab$", strings)]
# [1] "abraham, not ahab" "ab"
Here grepl returns a logical vector, and we use it twice. The first is TRUE for every element of strings that matches "^ab", and the second for every element that matches "ab$". Combining those logical vectors with an & operator produces the same thing as a pattern with a | metacharacter.
You may use
grep("^ab(.*ab)?$", strings, value = TRUE)
The pattern matches a string that starts with ab and then has an optional substring ending with ab and then end of string should follow:
^ - start of string
ab - an ab substring
(.*ab)? - 1 or 0 repetitions (due to ? quantifier) of
.* - any 0+ chars, as many as possible
ab - an ab substring
$ - end of string.
See the regex demo.
I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.
library(plyr)
library(tm)
stopWords <- stopwords("en")
class(stopWords)
df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."
head(df1)
df1$string1 <- tolower(df1$string1)
str1 <- strsplit(df1$string1[5], " ")
> !(str1 %in% stopWords)
[1] TRUE
This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.
What am I doing wrong?
You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(or)
str1[[1]][!(str1[[1]] %in% stopWords)]
For the whole data.frame df1, you could do something like:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."
First. You should unlist str1 or use lapply if str1 is vector:
!(unlist(str1) %in% words)
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
Second. Complex solution:
string <- c("This string is a string.",
"This string is a slightly longer string.",
"This string is an even longer string.",
"This string is a slightly shorter string.",
"This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
stopifnot(is.character(string), is.character(words))
spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string." "string slightly longer string." "string even longer string."
#> [4] "string slightly shorter string." "string longest string strings."
Came across this question when I was working on something similar.
Though this has been answered already, I just thought to put up a concise line of code which I used for my problem as well - which will help eliminate all the stop words directly in your dataframe:
df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))
Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase