How can I extract the words/sentence next to an specific word? Example:
"On June 28, Jane went to the cinema and ate popcorn"
I would like to choose 'Jane' and get [-2,2], meaning:
"June 28, Jane went to"
We could make a function to help out. This might make it a little more dynamic.
library(tidyverse)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
grab_text <- function(text, target, before, after){
min <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))-before
max <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))+after
paste(str_split(text, "\\s")[[1]][min:max], collapse = " ")
}
grab_text(text = txt, target = "Jane", before = 2, after = 2)
#> [1] "June 28, Jane went to"
First we split the sentence, then we figure out the position of the target, then we grab any word before or after (number specified in the function), last we collapse the sentence back together.
I have a shorter version using str_extract from stringr
library(stringr)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
str_extract(txt,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[1] "June 28, Jane went to"
The function str_extract extract the patern form the string. The regex \\s is for white space, and [^\\s] is the negation of it, so anything but white space. So the whole pattern is Jane with twice a white space before and after and something composed by anything but whitespace
The advantage is that it is already vectorized, and if you have a vector of text you can use str_extract_all:
s <- c("On June 28, Jane went to the cinema and ate popcorn.
The next day, Jane hiked on a trail.",
"an indeed Jane loved it a lot")
str_extract_all(s,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")
[[1]]
[1] "June 28, Jane went to" "next day, Jane hiked on"
[[2]]
[1] "an indeed Jane loved it"
Here's an example with an expansion for multiple occurrences. Basically, split on whitespace, find the word, expand the indices, then make a list of results.
s <- "On June 28, Jane went to the cinema and ate popcorn. The next day, Jane hiked on a trail."
words <- strsplit(s, '\\s+')[[1]]
inds <- grep('Jane', words)
lapply(inds, FUN = function(i) {
paste(words[max(1, i-2):min(length(words), i+2)], collapse = ' ')
})
#> [[1]]
#> [1] "June 28, Jane went to"
#>
#> [[2]]
#> [1] "next day, Jane hiked on"
Created on 2019-09-17 by the reprex package (v0.3.0)
This should work:
stringr::str_extract(text, "(?:[^\\s]+\\s){5}Jane(?:\\s[^\\s]+){5}")
Related
I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
The pattern is number of subscribers and then there is the date as Month Day, Year format. Sometimes there are as of or in or no characters between the pattern and dates.
I have tried the following script.
find_dates <- function(text){
pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
str_extract(text, pattern)
}
However, this extracts the in-between words too, which I would like to ignore.
Desired output:
find_dates(text1)
'number of subscribers December 31, 2022'
find_dates(text2)
'number of subscribers January 10, 2023'
An approach using stringr
library(stringr)
find_Dates <- function(x) paste0(str_extract_all(x,
"\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"
# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"
[[2]]
[1] "number of subscribers January 10, 2023"
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
find_dates <- function(text){
# pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
str_extract(text, pattern, 1:2)
}
find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"
find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"
In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w+\\s\\d{2,4}"))
however the result is the same.
In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w+\\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w+\\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"
I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)
I have a group of names worded in a bizarre fashion. Here is a sample:
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
My goal is to clean it so either first and last name show for names or just team names is returned for teams. Here is what have tried:
df$name <- sub("^(.*[a-z])[A-Z]", "\\1", "\\1", df$name)
This is what I'm getting returned
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
To be clear, goal would be to have this:
Sammy Watkins
Buffalo Bills
New England Patriots
Tre'Quan Smith
JuJu Smith-Schuster
data
df <- data.frame(name = c(
"Sammy WatkinsS. Watkins",
"Buffalo BillsBUF",
"New England PatriotsNE",
"Tre'Quan SmithT. Smith",
"JuJu Smith-SchusterJ. Smith-Schuster"),
stringsAsFactors = FALSE)
I suggest
df$name <- sub("\\B[A-Z]+(?:\\.\\s+\\S+)*$", "", df$name)
See the regex demo
Pattern details
\B - a non-word boundary (there must be a letter, digit or _ right before)
[A-Z]+ - 1+ ASCII uppercase letters (use \p{Lu} to match any Unicode uppercase letters)
(?:\.\s+\S+)* - 0 or more sequences of:
\. - a dot
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
$ - end of string.
What about:
(?<=[a-z])[A-Z](?=[.\sA-Z]).*
Check here. Without experience in R I'm unsure if this would be accepted. Also, there may be neater patterns as I'm rather new to RegEx.
I've also included a (possibly unlikely) sample: Sammy J. WatkinsJ.S. Watkins
Two laps:
df$name <- gsub(".\\. .*", "", df$name)
df$name <- gsub("[A-Z]*$", "", df$name)
The first line removes all cases of the form "x. surname" and the second removes all capital letters at the end of the string.
Another way :
sub("(.*?\\s.*?[a-z](?=[A-Z])).*", "\\1", df$name, perl = TRUE)
#> [1] "Sammy Watkins" "Buffalo Bills" "New England Patriots"
#> [4] "Tre'Quan Smith" "JuJu Smith-Schuster"
sub(".*?\\s.*?[a-z](?=[A-Z])", "", df$name, perl = TRUE)
#> [1] "S. Watkins" "BUF" "NE"
#> [4] "T. Smith" "J. Smith-Schuster"
We're splitting between a lower case character and an upper case character, but not before we see a space.
You could also use unglue :
library(unglue)
unglue_unnest(df, name, "{name1=.*?\\s.*?[a-z]}{name2=[A-Z].*?}")
#> name1 name2
#> 1 Sammy Watkins S. Watkins
#> 2 Buffalo Bills BUF
#> 3 New England Patriots NE
#> 4 Tre'Quan Smith T. Smith
#> 5 JuJu Smith-Schuster J. Smith-Schuster
I have a dataframe with a character column with names in the following format: "Lastname Middlename Title". I need to swap "Lastname" and "Title" and it varies how many middle names there are for each row.
Examples of input:
Doe John Mr.
Smith John Doe Mr.
Desired output:
Mr. John Doe
Mr. John Doe Smith
You can do it with sub and backreferences. Using data x <- c("Doe John Mr.", "Smith John Doe Mr."):
sub("^(\\w+)( .* )(\\w+\\.?)$", "\\3\\2\\1", x)
#### OUTPUT ####
[1] "Mr. John Doe" "Mr. John Doe Smith"
This captures three groups: 1) the first word in the string ^(\\w+), 2) everything between the first word and the last word ( .* ), and 3) the last word in the string with 0 or 1 periods (\\w+\\.?)$. It then swaps groups 1 and 3 while leaving 2 where it is.
We may use strplit.
str1 <- "Doe John Mr."
str2 <- "Smith John Doe Mr."
Reduce(paste, el(strsplit(str1, " "))[3:1])
# [1] "Mr. John Doe"
Reduce(paste, el(strsplit(str2, " "))[c(4, 2, 3, 1)])
# [1] "Mr. John Doe Smith"
I used tokenizer to split up the input string and then go in reverse order. I noticed your example is in reverse order so that's what I'm working off of. If you have other examples where they're not in reverse order, all you have to do is arrange them in the order that you need.
library(tokenizers)
string <- "Doe John Mr. Smith Doe John Mr."
y <- tokenize_words(string, strip_punct = TRUE, simplify = TRUE)
rev(y)