R remove stopwords from a character vector using %in% - r

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.
library(plyr)
library(tm)
stopWords <- stopwords("en")
class(stopWords)
df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."
head(df1)
df1$string1 <- tolower(df1$string1)
str1 <- strsplit(df1$string1[5], " ")
> !(str1 %in% stopWords)
[1] TRUE
This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.
What am I doing wrong?

You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(or)
str1[[1]][!(str1[[1]] %in% stopWords)]
For the whole data.frame df1, you could do something like:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."

First. You should unlist str1 or use lapply if str1 is vector:
!(unlist(str1) %in% words)
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
Second. Complex solution:
string <- c("This string is a string.",
"This string is a slightly longer string.",
"This string is an even longer string.",
"This string is a slightly shorter string.",
"This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
stopifnot(is.character(string), is.character(words))
spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string." "string slightly longer string." "string even longer string."
#> [4] "string slightly shorter string." "string longest string strings."

Came across this question when I was working on something similar.
Though this has been answered already, I just thought to put up a concise line of code which I used for my problem as well - which will help eliminate all the stop words directly in your dataframe:
df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))

Related

Extract matching words from strings in order

If I have two strings that look like this:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:
> "Here is a"
I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.
Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.
L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"
This shows you the first n words that match:
xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here" "is" "a" "test" "of" "words" "and" "stuff."
# [[2]]
# [1] "Here" "is" "a" "better" "test" "of" "words" "and"
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is" "a"
# [[2]]
# [1] "Here" "is" "a"
From here, we can easily derive the leading string:
paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"
and the remaining strings with
paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."
I wrote a function which will check the string and return the desired output:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
z <- "This string doesn't match"
library(purrr)
check_str <- function(inp, pat, delimiter = "\\s") {
inp <- unlist(strsplit(inp, delimiter))
pat <- unlist(strsplit(pat, delimiter))
ln_diff <- length(inp) - length(pat)
if (ln_diff < 0) {
inp <- append(inp, rep("", abs(ln_diff)))
}
if (ln_diff > 0) {
pat <- append(pat, rep("", abs(ln_diff)))
}
idx <- map2_lgl(inp, pat, ~ identical(.x, .y))
rle_idx <- rle(idx)
if (rle_idx$values[1]) {
idx2 <- seq_len(rle_idx$length[1])
} else {
idx2 <- 0
}
paste0(inp[idx2], collapse = delimiter)
}
check_str(x, y, " ")
#> [1] "Here is a"
check_str(x, z, " ")
#> [1] ""
Created on 2023-02-13 with reprex v2.0.2
You could write a helper function to do the check for you
common_start<-function(x, y) {
i <- 1
last <- NA
while (i <= nchar(x) & i <= nchar(x)) {
if (substr(x,i,i) == substr(y,i,i)) {
if (grepl("[[:space:][:punct:]]", substr(x,i,i), perl=T)) {
last <- i
}
} else {
break;
}
i <- i + 1
}
if (!is.na(last)) {
substr(x, 1, last-1)
} else {
NA
}
}
and use that with your sample stirngs
common_start(x,y)
# [1] "Here is a"
The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

Split vector, categorized by regex

I am searching for a method to split a character vector based on a RegEx pattern.
Example of input:
input <- c("a_foo","b_foo", "c_bar", "d_bar")
split_by <- c("foo", "bar")
The result I am searching for:
$foo
[1] "a_foo" "b_foo"
$bar
[1] "c_bar" "d_bar"
EDIT
Based on the comments and answers, there is need for a clarification.
split_by can have any number of elements;
the RegEx pattern varies from case to case; and
an element in input may be assigned to 0 (no matches), 1, or multiple splits depending on the match.
Hence, the following input:
input <- c("foo_bar", "nothing", "a_foo", "c_bar")
split_by <- c("foo", "bar")
Could return:
$foo
[1] "foo_bar" "a_foo"
$bar
[1] "foo_bar" "c_bar"
In real case, can you extract split_by values from input data?
This works for the example shared.
split(input, sub('.*_', '', input))
#$bar
#[1] "c_bar" "d_bar"
#$foo
#[1] "a_foo" "b_foo"
where
sub('.*_', '', input) #returns
#[1] "foo" "foo" "bar" "bar"
lapply(split_by, grep, x = input, value = TRUE)
# [[1]]
# [1] "a_foo" "b_foo"
#
# [[2]]
# [1] "c_bar" "d_bar"
To get named output you could do:
lapply(setNames(split_by, split_by), grep, x = input, value = TRUE)
split.regex <- function(input, split_by, pattern, add_names=TRUE) {
out <- lapply(split_by, function(x) {
input[grepl(sprintf(pattern, x), input)]
})
if (add_names) {
names(out) <- split_by
}
return(out)
}
First, the pattern must be defined. Since foo and bar occurs at the end of the string, sprintf("%s$", split_by) can be used. In the function, I defined the sprintf inside the function so the argument pattern should be defined as the sprintf string "%s$".
First example
By defining input and split_by as in the question's first example, and then running:
split.regex(input=input, split_by=split_by, pattern="%s$", add_names=TRUE)
We get the desired result:
$foo
[1] "a_foo" "b_foo"
$bar
[1] "c_bar" "d_bar"
Second example
By defining input and split_by as in the question's second example, and then running:
split.regex(input=input, split_by=split_by, pattern="(%s)", add_names=TRUE)
We get the desired result:
$foo
[1] "foo_bar" "a_foo"
$bar
[1] "foo_bar" "c_bar"
Since the input "nothing" didn't match on any, it was correctly excluded from the split, whereas "foo_bar" was correctly added to both splits as it matched on both.

Locate position of first number in string [R]

How can I create a function in R that locates the word position of the first number in a string?
For example:
string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9
string2 <- "80111 is in this string"
#desired_output for string2
1
string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5
I would just use grep and strsplit here for a base R option:
sapply(input, function(x) grep("\\d+", strsplit(x, " ")[[1]]))
Hello I'd like to extract where the first 1010 is in this string
9
80111 is in this string
1
extract where the first 97865 is in this string
5
Data:
input <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is a way to return your desired output:
library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))
This is how it works:
str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space
as.numeric(...) # tries to convert each element to a number, returning NA when it fails
suppressWarnings(...) # suppresses the warnings generated by as.numeric
!is.na(...) # returns true for the values that are not NA (i.e. the numbers)
which(...) # returns the position for each TRUE values
min(...) # returns the first position
The output:
min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5
Here I'll leave a fully tidyverse approach:
library(purrr)
library(stringr)
map_dbl(str_split(strings, " "), str_which, "\\d+")
#> [1] 9 1 5
map_dbl(str_split(strings[1], " "), str_which, "\\d+")
#> [1] 9
Note that it works both with one and multiple strings.
Where strings is:
strings <- c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string")
Here is another approach. We can trim off the remaining characters after the first digit of the first number. Then, just find the position of the last word. \\b matches word boundaries while \\S+ matches one or more non-whitespace characters.
first_numeric_word <- function(x) {
x <- substr(x, 1L, regexpr("\\b\\d+\\b", x))
lengths(gregexpr("\\b\\S+\\b", x))
}
Output
> first_numeric_word(x)
[1] 9 1 5
Data
x <- c(
"Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string",
"extract where the first 97865 is in this string"
)
Here is a base solution using rapply() w/ grep() to recurse through the results of strsplit() and works with a vector of strings.
Note: swap " " and fixed = TRUE with "\\s+" and fixed = FALSE (the default) if you want to split the strings on any whitespace instead of a literal space.
rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5
Data:
strings = c("Hello I'd like to extract where the first 1010 is in this string",
"80111 is in this string", "extract where the first 97865 is in this string")
Try the following:
library(stringr)
position_first_number <- function(string) {
min(which(str_detect(str_split(string, "\\s+", simplify = TRUE), "[0-9]+")))
}
With your example strings:
> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5

Replace string in R with patterns and replacements both vectors

Let's say I have two vectors like so:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
I also have a string variable like so:
string <- "this is a story about a test"
I want to replace values in string so that it becomes the following:
string <- "that was a story about a boy"
I could do this using a for loop but I want this to be vectorized. How should I do this?
If you're open to using a non-base package, stringi will work really well here:
stringi::stri_replace_all_fixed(string, a, b, vectorize_all = FALSE)
#[1] "that was a story about a boy"
Note that this also works the same way for input strings of length > 1.
To be on the safe side, you can adapt this - similar to RUser's answer - to check for word boundaries before replacing:
stri_replace_all_regex(string, paste0("\\b", a, "\\b"), b, vectorize_all = FALSE)
This would ensure that you don't accidentally replace his with hwas, for example.
Here are some solutions. They each will work even if string is a character vector of strings in which case substitutions will be done on each component of it.
1) Reduce This uses no packages.
Reduce(function(x, i) gsub(paste0("\\b", a[i], "\\b"), b[i], x), seq_along(a), string)
## [1] "that was a story about a boy"
2) gsubfn gsubfn is like gsub but the replacement argument can be a list of substitutions (or certain other objects).
library(gsubfn)
gsubfn("\\w+", setNames(as.list(b), a), string)
## [1] "that was a story about a boy"
3) loop This isn't vectorized but have added for comparison. No packages are used.
out <- string
for(i in seq_along(a)) out <- gsub(paste0("\\b", a[i], "\\b"), b[i], out)
out
## [1] "that was a story about a boy"
Note: There is some question of whether cycles are possible. For example, if
a <- c("a", "A")
b <- rev(a)
do we want
"a" to be replaced with "A" and then back to "a" again, or
"a" and "A" to be swapped.
All the solutions shown above assume the first case. If we wanted the second case then perform the operation twice. We will illustrate with (2) because it is the shortest but the same idea applies to them all:
# swap "a" and "A"
a <- c("a", "A")
b <- rev(a)
tmp <- gsubfn("\\w+", setNames(as.list(seq_along(a)), a), string)
gsubfn("\\w+", setNames(as.list(b), seq_along(a)), tmp)
## [1] "this is A story about A test"
> library(stringi)
> stri_replace_all_regex(string, "\\b" %s+% a %s+% "\\b", b, vectorize_all=FALSE)
#[1] "that was a story about a boy"
Chipping in as well with a little function that relies only on R base:
repWords <- function(string,toRep,Rep,sep='\\s'){
wrds <- unlist(strsplit(string,sep))
ix <- match(toRep,wrds)
wrds[ix] <- Rep
return(paste0(wrds,collapse = ' '))
}
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
string <- "this is a story about a test"
> repWords(string,a,b)
[1] "that was a story about a boy"
Note:
This assumes you have a matching number of replacements. You can define the separator with sep.
Talking of external packages, here's another one:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
x <- "this is a story about a test"
library(qdap)
mgsub(a,b,x)
which gives:
"that was a story about a boy"

R: Using paste depending on the number of words in a string within a function

I have a list where each list component has one string vector. Each string vector is of length 1 and contains one or more words separated with spaces (the original list is much larger):
> f <- list("one", "two three", "four", "five six seven")
> f
[[1]]
[1] "one"
[[2]]
[1] "two three"
[[3]]
[1] "four"
[[4]]
[1] "five six seven"
What I need to do is to paste strings before and after the string in each component depending on whether it contains one or more words. The result I look for is something like this:
[[1]]
[1] "Single number: one."
[[2]]
[1] "Multiple numbers: two three."
[[3]]
[1] "Single number: four."
[[4]]
[1] "Multiple numbers: five six seven."
I tried the following, counting the number of words in each string with str_count from the stringr package:
x <- lapply(f, function(j) {
if(str_count(string = f[[j]], pattern = "\\S+") == 1) {
xx[[j]] <- paste("Single number: ", f[[j]], ".", sep = "")
} else {
xx[[j]] <- paste("Multiple numbers: ", f[[j]], ".", sep = "")
}
})
However, I get the following error:
Error in if (str_count(string = f[[j]], pattern = "\\S+") == 1) { :
argument is of length zero
Can someone help?
f[[j]] can be used when we are indexing the elements of list i.e. lapply(seq_along(f),.., but here we are looping on f itself. So, just do str_count(j,..)
library(stringr)
lapply(f, function(j) {
if(str_count(j, '\\S+') >1) {
paste("Multiple numbers: ", j, '.', sep="")
} else paste("Single number: ", j, ".", sep="")
})
#[[1]]
#[1] "Single number: one."
#[[2]]
#[1] "Multiple numbers: two three."
#[[3]]
#[1] "Single number: four."
#[[4]]
#[1] "Multiple numbers: five six seven."
NOTE: This could be done without using any external packages too.
You can take advantage of R’s vectorisation to simplify this; however, this requires using a vector as an input instead of a list — which is OK in your example:
f = unlist(f)
prefix = ifelse(str_count(f, '\\S+') > 1, 'Multiple words: ', 'Single word: ')
paste0(prefix, f, '.')
Given a string the function prefix produces either "Multiple number:" or "Single Number:". lapply it to every component of f and then use Map to paste the corresponding prefixes and f components together. No packages are used:
prefix <- function(x) if (any(grepl(" ", x))) "Multiple numbers:" else "Single number:"
Map(paste, lapply(f, prefix), f)
giving:
[[1]]
[1] "Single number: one"
[[2]]
[1] "Multiple numbers: two three"
[[3]]
[1] "Single number: four"
[[4]]
[1] "Multiple numbers: five six seven"
The last line could alternately be written like this:
as.list(paste(sapply(f, prefix), f))
and if its not important that the result be a list then as.list could be omitted.

Resources