Extract matching words from strings in order - r

If I have two strings that look like this:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
Is there an easy way to check the words from left to right and create a new string of matching words and then stop when the words no longer match so the output would look like:
> "Here is a"
I don't want to find all matching words between the two strings but rather just the words that match in order. So "words and stuff." is in both string but I don't want that to be selected.

Split the strings, compute the minimum of the length of the two splits, take that number of words from the head of each and append a FALSE to ensure a non-match can occur when matching the corresponding words. Then use which.min to find the first non-match and take that number minus 1 of the words and paste back together.
L <- strsplit(c(x, y), " +")
wx <- which.min(c(do.call(`==`, lapply(L, head, min(lengths(L)))), FALSE))
paste(head(L[[1]], wx - 1), collapse = " ")
## [1] "Here is a"

This shows you the first n words that match:
xvec <- strsplit(x, " +")[[1]]
yvec <- strsplit(y, " +")[[1]]
(len <- min(c(length(xvec), length(yvec))))
# [1] 8
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)))
list(xvec[1:i], yvec[1:i])
# [[1]]
# [1] "Here" "is" "a" "test" "of" "words" "and" "stuff."
# [[2]]
# [1] "Here" "is" "a" "better" "test" "of" "words" "and"
cumsum(head(xvec, len) != head(yvec, len))
# [1] 0 0 0 1 2 3 4 5
i <- which.max(cumsum(head(xvec, len) != head(yvec, len)) > 0)
list(xvec[1:(i-1)], yvec[1:(i-1)])
# [[1]]
# [1] "Here" "is" "a"
# [[2]]
# [1] "Here" "is" "a"
From here, we can easily derive the leading string:
paste(xvec[1:(i-1)], collapse = " ")
# [1] "Here is a"
and the remaining strings with
paste(xvec[-(1:(i-1))], collapse = " ")
# [1] "test of words and stuff."

I wrote a function which will check the string and return the desired output:
x <- "Here is a test of words and stuff."
y <- "Here is a better test of words and stuff."
z <- "This string doesn't match"
library(purrr)
check_str <- function(inp, pat, delimiter = "\\s") {
inp <- unlist(strsplit(inp, delimiter))
pat <- unlist(strsplit(pat, delimiter))
ln_diff <- length(inp) - length(pat)
if (ln_diff < 0) {
inp <- append(inp, rep("", abs(ln_diff)))
}
if (ln_diff > 0) {
pat <- append(pat, rep("", abs(ln_diff)))
}
idx <- map2_lgl(inp, pat, ~ identical(.x, .y))
rle_idx <- rle(idx)
if (rle_idx$values[1]) {
idx2 <- seq_len(rle_idx$length[1])
} else {
idx2 <- 0
}
paste0(inp[idx2], collapse = delimiter)
}
check_str(x, y, " ")
#> [1] "Here is a"
check_str(x, z, " ")
#> [1] ""
Created on 2023-02-13 with reprex v2.0.2

You could write a helper function to do the check for you
common_start<-function(x, y) {
i <- 1
last <- NA
while (i <= nchar(x) & i <= nchar(x)) {
if (substr(x,i,i) == substr(y,i,i)) {
if (grepl("[[:space:][:punct:]]", substr(x,i,i), perl=T)) {
last <- i
}
} else {
break;
}
i <- i + 1
}
if (!is.na(last)) {
substr(x, 1, last-1)
} else {
NA
}
}
and use that with your sample stirngs
common_start(x,y)
# [1] "Here is a"
The idea is to check every character, keeping track of the last non-word character that still matches. Using a while loop may not be fancy but it does mean you get to break early without processing the whole string as soon as a mismatch is found.

Related

strsplit(rquote, split = "")[[1]] in R

rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
This question has been asked before on this forum and has one answer on it but I couldn't understand anything from that answer, so here I am asking this question again.
In the above code what is the meaning of [[1]] ?
The program that I'm trying to run:
rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
rcount <- 0
for (char in chars) {
if (char == "r") {
rcount <- rcount + 1
}
if (char == "u") {
break
}
}
print(rcount)
When I don't use [[1]] I get the following warning message in for loop and I get a wrong output of 1 for rcount instead of 5:
Warning message: the condition has length > 1 and only the first element will be used
strsplit is vectorized. That means it splits each element of a vector into a vectors. To handle this vector of vectors it returns a list in which a slot (indexed by [[) corresponds to a element of the input vector.
If you use the function on a one element vector (single string as you do), you get a one-slot list. Using [[1]] right after strsplit() selects the first slot of the list - the anticipated vector.
Unfortunately, your list chars works in a for loop - you have one iteration with the one slot. In if you compare the vector of letters against "r" which throws the warning. Since the first element of the comparison is TRUE, the condition holds and rcount is rised by 1 = your result. Since you are not indexing the letters but the one phrase, the cycle stops there.
Maybe if you run something like strsplit(c("one", "two"), split="") , the outcome will be more straightforward.
> strsplit(c("one", "two"), split="")
[[1]]
[1] "o" "n" "e"
[[2]]
[1] "t" "w" "o"
> strsplit(c("one", "two"), split="")[[1]]
[1] "o" "n" "e"
> strsplit(c("one"), split="")[[1]][2]
[1] "n"
We'll start with the below as data, without [[1]]:
rquote <- "r's internals are irrefutably intriguing"
chars2 <- strsplit(rquote, split = "")
class(chars2)
[1] "list"
It is always good to have an estimate of your return value, your above '5'. We have both length and lengths.
length(chars2)
[1] 1 # our list
lengths(chars2)
[1] 40 # elements within our list
We'll use lengths in our for loop for counter, and, as you did, establish a receiver vector outside the loop,
rcount2 <- 0
for (i in 1:lengths(chars2)) {
if (chars2[[1]][i] == 'r') {
rcount2 <- rcount2 +1
}
if (chars2[[1]][i] == 'u') {
break
}
}
print(rcount2)
[1] 6
length(which(chars2[[1]] == 'r')) # as a check, and another way to estimate
[1] 6
Now supposing, rather than list, we have a character vector:
chars1 <- strsplit(rquote, split = '')[[1]]
length(chars1)
[1] 40
rcount1 <- 0
for(i in 1:length(chars1)) {
if(chars1[i] == 'r') {
rcount1 <- rcount1 +1
}
if (chars1[i] == 'u') {
break
}
}
print(rcount1)
[1] 5
length(which(chars1 == 'r'))
[1] 6
Hey, there's your '5'. What's going on here? Head scratch...
all.equal(chars1, unlist(chars2))
[1] TRUE
That break should just give us 5 'r' before a 'u' is encountered. What's happening when it's a list (or does that matter...?), how does the final r make it into rcount2?
And this is where the fun begins. Jeez. break for coffee and thinking. Runs okay. Usual morning hallucination. They come and go. But, as a final note, when you really want to torture yourself, put browser() inside your for loop and step thru.
Browse[1]> i
[1] 24
Browse[1]> n
debug at #7: break
Browse[1]> chars2[[1]][i] == 'u'
[1] TRUE
Browse[1]> n
> rcount2
[1] 5

R: Using paste depending on the number of words in a string within a function

I have a list where each list component has one string vector. Each string vector is of length 1 and contains one or more words separated with spaces (the original list is much larger):
> f <- list("one", "two three", "four", "five six seven")
> f
[[1]]
[1] "one"
[[2]]
[1] "two three"
[[3]]
[1] "four"
[[4]]
[1] "five six seven"
What I need to do is to paste strings before and after the string in each component depending on whether it contains one or more words. The result I look for is something like this:
[[1]]
[1] "Single number: one."
[[2]]
[1] "Multiple numbers: two three."
[[3]]
[1] "Single number: four."
[[4]]
[1] "Multiple numbers: five six seven."
I tried the following, counting the number of words in each string with str_count from the stringr package:
x <- lapply(f, function(j) {
if(str_count(string = f[[j]], pattern = "\\S+") == 1) {
xx[[j]] <- paste("Single number: ", f[[j]], ".", sep = "")
} else {
xx[[j]] <- paste("Multiple numbers: ", f[[j]], ".", sep = "")
}
})
However, I get the following error:
Error in if (str_count(string = f[[j]], pattern = "\\S+") == 1) { :
argument is of length zero
Can someone help?
f[[j]] can be used when we are indexing the elements of list i.e. lapply(seq_along(f),.., but here we are looping on f itself. So, just do str_count(j,..)
library(stringr)
lapply(f, function(j) {
if(str_count(j, '\\S+') >1) {
paste("Multiple numbers: ", j, '.', sep="")
} else paste("Single number: ", j, ".", sep="")
})
#[[1]]
#[1] "Single number: one."
#[[2]]
#[1] "Multiple numbers: two three."
#[[3]]
#[1] "Single number: four."
#[[4]]
#[1] "Multiple numbers: five six seven."
NOTE: This could be done without using any external packages too.
You can take advantage of R’s vectorisation to simplify this; however, this requires using a vector as an input instead of a list — which is OK in your example:
f = unlist(f)
prefix = ifelse(str_count(f, '\\S+') > 1, 'Multiple words: ', 'Single word: ')
paste0(prefix, f, '.')
Given a string the function prefix produces either "Multiple number:" or "Single Number:". lapply it to every component of f and then use Map to paste the corresponding prefixes and f components together. No packages are used:
prefix <- function(x) if (any(grepl(" ", x))) "Multiple numbers:" else "Single number:"
Map(paste, lapply(f, prefix), f)
giving:
[[1]]
[1] "Single number: one"
[[2]]
[1] "Multiple numbers: two three"
[[3]]
[1] "Single number: four"
[[4]]
[1] "Multiple numbers: five six seven"
The last line could alternately be written like this:
as.list(paste(sapply(f, prefix), f))
and if its not important that the result be a list then as.list could be omitted.

How to look for a certain part in a string and only keep that part

What is the cleanest way of finding for example the string ": [1-9]*" and only keeping that part?
You can work with regexec to get the starting points, but isn't there a cleaner way just to get immediately the value?
For example:
test <- c("surface area: 458", "bedrooms: 1", "whatever")
regexec(": [1-9]*", test)
How do I get immediately just
c(": 458",": 1", NA )
You can use base R which handles this just fine.
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> r <- regmatches(x, gregexpr(':.*', x))
> unlist({r[sapply(r, length)==0] <- NA; r})
# [1] ": 458" ": 1" NA
Although, I find it much simpler to just do...
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> sapply(strsplit(x, '\\b(?=:)', perl=T), '[', 2)
# [1] ": 458" ": 1" NA
library(stringr)
str_extract(test, ":.*")
#[1] ": 458" ": 1" NA
Or for a faster approach stringi
library(stringi)
stri_extract_first_regex(test, ":.*")
#[1] ": 458" ": 1" NA
If you need the keep the values of the one that doesn't have the match
gsub(".*(:.*)", "\\1", test)
#[1] ": 458" ": 1" "whatever"
Try any of these. The first two use the base of R only. The last one assumes that we want to return a numeric vector.
1) sub
s <- sub(".*:", ":", test)
ifelse(test == s, NA, s)
## [1] ": 458" ": 1" NA
If there can be more than one : in a string then replace the pattern with "^[^:]*:" .
2) strsplit
sapply(strsplit(test, ":"), function(x) c(paste0(":", x), NA)[2])
## [1] ": 458" ": 1" NA
Do not use this one if there can be more than one : in a string.
3) strapplyc
library(gsubfn)
s <- strapplyc(test, "(:.*)|$", simplify = TRUE)
ifelse(s == "", NA, s)
## [1] ": 458" ": 1" NA
We can omit the ifelse line if "" is ok instead of NA.
4) strapply If the idea is really that there are some digits on the line and we want to return the numbers or NA then try this:
library(gsubfn)
strapply(test, "\\d+|$", as.numeric, simplify = TRUE)
## [1] 458 1 NA

Merging specific specific strings in character vector

I have character vector where each level is a word. It has been generated from a text in which some segments are marked up with angular brackets. These segments vary in length. I need the marked up segments to be merged in the vector.
The input looks like this:
c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
I need the output to look like this:
c("This","is","some","text","with","<marked up chunks>[L]","in","it")
Thanks.
Here's an approach that also works with multiple chunks in a vector:
vec <- c("This","is","some","text","with","<marked","up","chunks>[L]","in","it")
from <- grep("<", vec)
to <- grep(">", vec)
idx <- mapply(seq, from, to, SIMPLIFY = FALSE)
new_strings <- sapply(idx, function(x)
paste(vec[x], collapse = " "))
replacement <- unlist(mapply(function(x, y) c(y, rep(NA, length(x) - 1)),
idx, new_strings, SIMPLIFY = FALSE))
new_vec <- "attributes<-"(na.omit(replace(vec, unlist(idx), replacement)), NULL)
[1] "This" "is"
[3] "some" "text"
[5] "with" "<marked up chunks>[L]"
[7] "in" "it"

R remove stopwords from a character vector using %in%

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.
library(plyr)
library(tm)
stopWords <- stopwords("en")
class(stopWords)
df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."
head(df1)
df1$string1 <- tolower(df1$string1)
str1 <- strsplit(df1$string1[5], " ")
> !(str1 %in% stopWords)
[1] TRUE
This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.
What am I doing wrong?
You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:
unlist(str1)[!(unlist(str1) %in% stopWords)]
(or)
str1[[1]][!(str1[[1]] %in% stopWords)]
For the whole data.frame df1, you could do something like:
'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
# [[1]]
# [1] "string" "string."
#
# [[2]]
# [1] "string" "slightly" "string."
#
# [[3]]
# [1] "string" "string."
#
# [[4]]
# [1] "string" "slightly" "shorter" "string."
#
# [[5]]
# [1] "string" "string" "strings."
First. You should unlist str1 or use lapply if str1 is vector:
!(unlist(str1) %in% words)
#> [1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
Second. Complex solution:
string <- c("This string is a string.",
"This string is a slightly longer string.",
"This string is an even longer string.",
"This string is a slightly shorter string.",
"This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
stopifnot(is.character(string), is.character(words))
spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string." "string slightly longer string." "string even longer string."
#> [4] "string slightly shorter string." "string longest string strings."
Came across this question when I was working on something similar.
Though this has been answered already, I just thought to put up a concise line of code which I used for my problem as well - which will help eliminate all the stop words directly in your dataframe:
df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))

Resources