replace range of numbers with single numbers in a character string - r

Is there any way to replace range of numbers wih single numbers in a character string? Number can range from n-n, most probably around 1-15, 4-10 ist also possible.
the range could be indicated with a) -
a <- "I would like to buy 1-3 cats"
or with a word b) for example: to, bis, jusqu'à
b <- "I would like to buy 1 jusqu'à 3 cats"
The results should look like
"I would like to buy 1,2,3 cats"
I found this: Replace range of numbers with certain number but could not really use it in R.

gsubfn in the gsubfn package is like gsub but instead of replacing the match with a replacement string it allows the user to specify a function (possibly in formula notation as done here). It then passes the matches to the capture groups in the regular expression, i.e. the matches to the parenthesized parts of the regular expression, as separate arguments and replaces the entire match with the output of the function. Thus we match "(\\d+)(-| to | bis | jusqu'à )(\\d+)" which results in three capture groups so 3 arguments to the function. In the function we use seq with the first and third of these. Note that seq can take character arguments and interpret them as numeric so we did not have to convert the arguments to numeric.
Thus we get this one-liner:
library(gsubfn)
s <- c(a, b) # test input strings
gsubfn("(\\d+)(-| to | bis | jusqu'à )(\\d+)", ~ paste(seq(..1, ..3), collapse = ","), s)
giving:
[1] "I would like to buy 1,2,3 cats" "I would like to buy 1,2,3 cats"

Not the most efficient, but ...
s <- c("I would like to buy 1-3 cats",
"I would like to buy 1 jusqu'à 3 cats",
"foo 22-33",
"quux 11-3 bar")
gre <- gregexpr("([0-9]+(-| to | bis | jusqu'à )[0-9]+)", s)
gre2 <- gregexpr('[0-9]+', regmatches(s, gre))
regmatches(s, gre) <- lapply(regmatches(regmatches(s, gre), gre2),
function(a) paste(do.call(seq, as.list(as.integer(a))), collapse = ","))
s
# [1] "I would like to buy 1,2,3 cats" "I would like to buy 1,2,3 cats"
# [3] "foo 22,23,24,25,26,27,28,29,30,31,32,33" "quux 11,10,9,8,7,6,5,4,3 bar"

This is, in fact, a little tricky, unless someone has already written a package that does this (that I'm not aware of).
a <- "I would like to buy 1-3 cats"
pos <- unlist(gregexpr("\\d+\\D+", a))
a_split <- unlist(strsplit(a, ""))
replacement <- paste(seq.int(a_split[pos[1]], a_split[pos[2]]), collapse = ",")
gsub("\\d+\\D+\\d+", replacement, a)
# [1] "I would like to buy 1,2,3 cats"
EDIT: To show that the same solution works for arbitrary non digit characters between two numbers:
b <- "I would like to buy 1 jusqu'à 3 cats"
pos_b <- unlist(gregexpr("\\d+\\D+", b))
b_split <- unlist(strsplit(b, ""))
replacement <- paste(seq.int(b_split[pos_b[1]], b_split[pos_b[2]]), collapse = ",")
gsub("\\d+\\D+\\d+", replacement, b)
# [1] "I would like to buy 1,2,3 cats"
You can add arbitrary requirements for the run of nondigit characters if you'd like. If you need help with that, just share what the limits on the words or symbols that are between the numbers are!

Related

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

How to return the index of a vector that contains at least a string in another vector in R

I have a list containing verbs. I have another list containing sentences. How do I return the index of the sentence list that contains at least a verb in the verb list for that sentence?
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
I want it to return indexes 1, 3, and 4
Using no additional packages, we can sort of "or" different search terms together using | as follows:
Original question:
verbList <- list("punching, kicking, jumping, hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- gsub(", ", "|", verbList)
grep(v, sentenceList)
New question:
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- paste(verbList, collapse = "|")
grep(v, sentenceList)
A solution from stringr and rebus. We can first split the string, and then use str_which to check if the pattern is in the vector to return the index.
library(stringr)
library(rebus)
# Check the index
result <- str_which(sentenceList, or1(verbList))
result
# [1] 1 3 4

Count the number of all words in a string

Is there a function to count the number of words in a string?
For example:
str1 <- "How many words are in this sentence"
to return a result of 7.
Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.
lengths(gregexpr("\\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.
Update As noted in the comments and in a different answer by #Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; #Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like #user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.
Most simple way would be:
require(stringr)
str_count("one, two three 4,,,, 5 6", "\\S+")
... counting all sequences on non-space characters (\\S+).
But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6
I use the str_count function from the stringr library with the escape sequence \w that represents:
any ‘word’ character (letter, digit or underscore in the current
locale: in UTF-8 mode only ASCII letters and digits are considered)
Example:
> str_count("How many words are in this sentence", '\\w+')
[1] 7
Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.
But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".
Benchmark:
library(stringr)
questions <-
c(
"", "x", "x y", "x y!", "x y! z",
"foo+bar+baz~spam+eggs",
"one, two three 4,,,, 5 6",
"How many words are in this sentence",
"How many words are in this sentence",
"Combien de mots sont dans cette phrase ?",
"
Day after day, day after day,
We stuck, nor breath nor motion;
"
)
answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)
score <- function(f) sum(unlist(lapply(questions, f)) == answers)
funs <-
c(
function(s) sapply(gregexpr("\\W+", s), length) + 1,
function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
function(s) length(str_match_all(s, "\\S+")[[1]]),
function(s) str_count(s, "\\S+"),
function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
function(s) length(unlist(strsplit(s," "))),
function(s) sapply(strsplit(s, " "), length),
function(s) str_count(s, '\\w+')
)
unlist(lapply(funs, score))
Output (11 is the maximum possible score):
6 10 10 8 9 9 7 6 6 11
You can use strsplit and sapply functions
sapply(strsplit(str1, " "), length)
str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])
The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.
The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.
> str1 <- "How many words are in this sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> strsplit(str2,' ')[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7
You can use str_match_all, with a regular expression that would identify your words.
The following works with initial, final and duplicated spaces.
library(stringr)
s <- "
Day after day, day after day,
We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" ) # Sequences of non-spaces
length(m[[1]])
Try this function from stringi package
require(stringi)
> s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
+ "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
+ "Cras vel lorem. Etiam pellentesque aliquet tellus.",
+ "")
> stri_stats_latex(s)
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
133 0 30 24 0 0
Also from stringi package, the straight forward function stri_count_words
stringi::stri_count_words(str1)
#[1] 7
You can use wc function in library qdap:
> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7
You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}
str_count(rm_white(s), " ") +1
Try this
length(unlist(strsplit(str1," ")))
require(stringr)
str_count(x,"\\w+")
will be fine with double/triple spaces between words
All other answers have issues with more than one space between the words.
The solution 7 does not give the correct result in the case there's just one word.
You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.
Ergo:
sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1
require(stringr)
Define a very simple function
str_words <- function(sentence) {
str_count(sentence, " ") + 1
}
Check
str_words(This is a sentence with six words)
Use nchar
if vector of strings is called x
(nchar(x) - nchar(gsub(' ','',x))) + 1
Find out number of spaces then add one
You could use stringr functions str_split() and boundary(), which will recognize the boundaries of words while ignoring punctuation and any extra spaces
sapply(str_split("It's 12 o'clock already", boundary("word")), length)
#[1] 4
sapply(str_split(" It's >12 o'clock already ?! ", boundary("word")), length)
#[1] 4
With stringr package, one can also write a simple script that could traverse a vector of strings for example through a for loop.
Let's say
df$text
contains a vector of strings that we are interested in analysing. First, we add additional columns to the existing dataframe df as below:
df$strings = as.integer(NA)
df$characters = as.integer(NA)
Then we run a for-loop over the vector of strings as below:
for (i in 1:nrow(df))
{
df$strings[i] = str_count(df$text[i], '\\S+') # counts the strings
df$characters[i] = str_count(df$text[i]) # counts the characters & spaces
}
The resulting columns: strings and character will contain the counts of words and characters and this will be achieved in one-go for a vector of strings.
I've found the following function and regex useful for word counts, especially in dealing with single vs. double hyphens, where the former generally should not count as a word break, eg, well-known, hi-fi; whereas double hyphen is a punctuation delimiter that is not bounded by white-space--such as for parenthetical remarks.
txt <- "Don't you think e-mail is one word--and not two!" #10 words
words <- function(txt) {
length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length)
}
words(txt) #10 words
Stringi is a useful package. But it over-counts words in this example due to hyphen.
stringi::stri_count_words(txt) #11 words
There's a simple solution using split and len:
text = 'This is a test for counting words'
# default separator: space
result = len(text.split())
print("There are " + str(result) + " words.")
You can get more details at
https://www.delftstack.com/howto/python/python-count-words-in-string/

Resources