Selecting only strings with 2 words in vector via regex - r

I have a simple character string like:
test<-c("two words", "three more words", "something else", "this has a lot of words", "more of this", "pick me")
I would need a function that returns the indices of test where there are only 2 words in the element (in this example this would be index 1, 3 and 6, but 2, 4 and 5 are completely uninteresting). More context: I am searching for "real" names of persons among a large vector that is mixed also with company names (which have often 3 or more words). I have no clue how to perhaps get regex (or any other technique) to do this...

We can use grep to match the word (\\w+) followed by a space followed by other word (\\w+) from the start (^) to end ($) of the string
grep("^\\w+ \\w+$", test)
[#1] 1 3 6
Or with str_count
library(stringr)
which(str_count(test, "\\w+") == 2)
#[1] 1 3 6

One option involving stringr could be.
which(is.na(word(test, 1, 3, fixed(" "))))
[1] 1 3 6

Related

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

Capture entire substring using regex if there is a match on a number

I have been unable to find the answer to this specific question, I am using R to clean some survey data.
I have some messy survey data with question names as columns, that sometimes include a number and sometimes don't. When they include a number, it will often contain some subcharacters as well indicating the question. Example, I have this vector:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
I want to extract the substrings that contain numbers, and return no results if there is no such match. Desired result (using R)
"1"
"1.a."
NA
"2"
"2.a."
"2.b."
NA
I know I can capture the first number, using
stri_extract_first_regex(questions, "[0-9]+")
But I am at a loss how to modify it to capture the whole string until the first whitespace if it finds a match using this pattern.
For you example data you might use:
[0-9]+(?:\.[a-z]\.)?
That will match:
[0-9]+ Match 1+ digits
(?: Non capturing group
\.[a-z]\. Match a dot, lowercase character and a dot
)? Close non capturing group and make it optional
For example:
questions <- c(
"1 question 1 what do you think?",
"1.a. question 1a further details on what you think",
"Please explain",
"2 question 2 what is your motivation",
"2.a. further details",
"2.b. even further details",
"Please explain")
print(stri_extract_first_regex(questions, "[0-9]+(?:\\.[a-z]\\.)?"))
# [1] "1" "1.a." NA "2" "2.a." "2.b." NA
This might work:
hasnumber <- grepl("[0-9]+",questions)
firstspaces <- sapply(gregexpr(" ", questions), function(x) x[[1]])
res <- ifelse(hasnumber, substr(questions,1,firstspaces-1), NA)
> res
[1] "1" "1.a." NA "2" "2.a." "2.b." NA
The most difficult part I guess is to define where are the first spaces in each question, which could be done with loops or here sapply
You may use
questions <- sub("^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.*", "\\1", questions)
questions[questions==""] <- NA
questions
# => [1] "1" "1.a." NA "2" "2.a." "2.b." NA
The ^(\\d+(?:\\.[a-z0-9]+)*\\.?).*|.* matches
^ - start of string
(\\d+(?:\\.[a-z0-9]+)*) - Capturing group 1:
\\d+ - 1+ digits
(?:\\.[a-z0-9]+)* - 0 or more repetitions of
\\. - a dot
[a-z0-9]+ - 1 or more lowercase ASCII letters or digits
\\.? - an optional dot
.* - any 0+ chars to the end of the string
| - or
.* - the whole string.
Replaces with the contents of Group 1. If the second alternative matches, the result is an empty string, questions[questions==""] <- NA replaces these elements with NAs.

regular expression: ".*\\s([0-9]+)\\snomination.*$"

Could someone explain why "Won 1 Oscar." can be picked out according to the regular expression given as below
awards <- c("Won 1 Oscar.",
"Won 1 Oscar. Another 9 wins & 24 nominations.",
"1 win and 2 nominations.",
"2 wins & 3 nominations.",
"Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
"4 wins & 1 nomination.")
sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
I can only get that the pattern is "abcd (any number 0 -9 ) nominationabcd". Once the pattern is matched, the number will replace the whole string. The matched "Won 1 Oscar" comes from the second element. What I am confused is that there is no nomination.* following "Won 1 " and why there seems to be no replacement.
The gsub function takes the regex (or a plain string if you use fixed=TRUE) and tries to find a match in the input character vector. If the match is found, this match is replaced with the replacement string/pattern. If the match is not found, thecurrent character (string) is returned unchanged.
Since you want to get the only nominations value from each element of the character vector, you need to extract them, rather than replace the matches.
You may rely on the stringr str_extract:
> library(stringr)
> str_extract(awards, "[0-9]+(?=\\s*nomination)")
[1] NA "24" "2" "3" "2" "1"
The [0-9]+(?=\\s*nomination) pattern finds 1 or more digits but only those that are followed with 0+ whitespaces and nomination char sequence (these whitespaces and the "nomination" word are excluded from the matches as this is a pattern inside a positive lookahead ((?=...)) construct that is non-consuming, i.e. not putting the matched text into the match value).

Extract Between Parts of a String

I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?
We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2
Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]

Count the number of all words in a string

Is there a function to count the number of words in a string?
For example:
str1 <- "How many words are in this sentence"
to return a result of 7.
Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.
lengths(gregexpr("\\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.
Update As noted in the comments and in a different answer by #Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; #Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like #user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.
Most simple way would be:
require(stringr)
str_count("one, two three 4,,,, 5 6", "\\S+")
... counting all sequences on non-space characters (\\S+).
But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6
I use the str_count function from the stringr library with the escape sequence \w that represents:
any ‘word’ character (letter, digit or underscore in the current
locale: in UTF-8 mode only ASCII letters and digits are considered)
Example:
> str_count("How many words are in this sentence", '\\w+')
[1] 7
Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.
But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".
Benchmark:
library(stringr)
questions <-
c(
"", "x", "x y", "x y!", "x y! z",
"foo+bar+baz~spam+eggs",
"one, two three 4,,,, 5 6",
"How many words are in this sentence",
"How many words are in this sentence",
"Combien de mots sont dans cette phrase ?",
"
Day after day, day after day,
We stuck, nor breath nor motion;
"
)
answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)
score <- function(f) sum(unlist(lapply(questions, f)) == answers)
funs <-
c(
function(s) sapply(gregexpr("\\W+", s), length) + 1,
function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
function(s) length(str_match_all(s, "\\S+")[[1]]),
function(s) str_count(s, "\\S+"),
function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
function(s) length(unlist(strsplit(s," "))),
function(s) sapply(strsplit(s, " "), length),
function(s) str_count(s, '\\w+')
)
unlist(lapply(funs, score))
Output (11 is the maximum possible score):
6 10 10 8 9 9 7 6 6 11
You can use strsplit and sapply functions
sapply(strsplit(str1, " "), length)
str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])
The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.
The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.
> str1 <- "How many words are in this sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> strsplit(str2,' ')[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7
You can use str_match_all, with a regular expression that would identify your words.
The following works with initial, final and duplicated spaces.
library(stringr)
s <- "
Day after day, day after day,
We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" ) # Sequences of non-spaces
length(m[[1]])
Try this function from stringi package
require(stringi)
> s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
+ "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
+ "Cras vel lorem. Etiam pellentesque aliquet tellus.",
+ "")
> stri_stats_latex(s)
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
133 0 30 24 0 0
Also from stringi package, the straight forward function stri_count_words
stringi::stri_count_words(str1)
#[1] 7
You can use wc function in library qdap:
> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7
You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}
str_count(rm_white(s), " ") +1
Try this
length(unlist(strsplit(str1," ")))
require(stringr)
str_count(x,"\\w+")
will be fine with double/triple spaces between words
All other answers have issues with more than one space between the words.
The solution 7 does not give the correct result in the case there's just one word.
You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.
Ergo:
sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1
require(stringr)
Define a very simple function
str_words <- function(sentence) {
str_count(sentence, " ") + 1
}
Check
str_words(This is a sentence with six words)
Use nchar
if vector of strings is called x
(nchar(x) - nchar(gsub(' ','',x))) + 1
Find out number of spaces then add one
You could use stringr functions str_split() and boundary(), which will recognize the boundaries of words while ignoring punctuation and any extra spaces
sapply(str_split("It's 12 o'clock already", boundary("word")), length)
#[1] 4
sapply(str_split(" It's >12 o'clock already ?! ", boundary("word")), length)
#[1] 4
With stringr package, one can also write a simple script that could traverse a vector of strings for example through a for loop.
Let's say
df$text
contains a vector of strings that we are interested in analysing. First, we add additional columns to the existing dataframe df as below:
df$strings = as.integer(NA)
df$characters = as.integer(NA)
Then we run a for-loop over the vector of strings as below:
for (i in 1:nrow(df))
{
df$strings[i] = str_count(df$text[i], '\\S+') # counts the strings
df$characters[i] = str_count(df$text[i]) # counts the characters & spaces
}
The resulting columns: strings and character will contain the counts of words and characters and this will be achieved in one-go for a vector of strings.
I've found the following function and regex useful for word counts, especially in dealing with single vs. double hyphens, where the former generally should not count as a word break, eg, well-known, hi-fi; whereas double hyphen is a punctuation delimiter that is not bounded by white-space--such as for parenthetical remarks.
txt <- "Don't you think e-mail is one word--and not two!" #10 words
words <- function(txt) {
length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length)
}
words(txt) #10 words
Stringi is a useful package. But it over-counts words in this example due to hyphen.
stringi::stri_count_words(txt) #11 words
There's a simple solution using split and len:
text = 'This is a test for counting words'
# default separator: space
result = len(text.split())
print("There are " + str(result) + " words.")
You can get more details at
https://www.delftstack.com/howto/python/python-count-words-in-string/

Resources