How to extract conversational utterances from single string - r

I have a conversation between several speakers recorded as a single string:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
I also have a vector of the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
However, the first part of the the third speaker's name (al) is contained in one of the extracted utterances (yeah i know thats erm al) and the last utterance by speaker al hamshi (ah you know camping with my girl friend) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?

What if you take another approach?
Remove all speakers from the text and split the string on '\\s*:\\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
You can clean up the output a bit to remove the first empty value from it.

A correct splitting approach would look like
p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
The regex to remove speakers from the string will look like
\s*\b(?:Peter|Mary|al hamshi)(?=:)
See the regex demo. It will match
\s* - 0+ whitespaces
\b - a word boundary
(?:Peter|Mary|al hamshi) - one of the speaker names
(?=:) - that must be followed with a : char.
Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...) call, and then the whole string is split with \s*:\s* regex that matches a : enclosed with 0+ whitespaces.
Alternatively, you can use
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
See this regex demo. Details:
(?<=(?:Peter|Mary|al hamshi):\s) - a location immediately preceded with any speaker name and a whitespace
.*? - any 0+ chars (other than line break chars, use (?s) at the pattern start to make it match any chars) as few as possible
(?=\s*(?:Peter|Mary|al hamshi):|\z) - a location immediately followed with 0+ whitespaces, then any speaker name and a : or end of string.
In R, you can use
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"

Related

Regex - remove sentences starting with certain words only if it is the last sentence

As per the title, I am trying to clean a large compilation of short texts, to remove sentences that start with certain words -- but only if it is the last of >1 sentences that text.
Suppose I want to cut out the last sentence if it begins with 'Jack is ...'
Here is an example with varied cases:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
And here is the regex I currently have: "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$"
map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$", "[TRIM]"))
Producing these results:
[1] "[TRIM]"
[2] "and [TRIM]"
[3] "There are mirrors. And [TRIM]"
[4] "There are dogs. And [TRIM]"
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
## Basically my current regex is still too greedy.
## No trimming should happen for the first 4 examples.
## 5 - 7th examples are correct.
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it.
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets.
Thanks for any help!
gsub("^(.*\\.)\\s*Jack,? is[^.]*\\.?$", "\\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."
# [2] "and Jack is the one who said, let there be fries."
# [3] "There are mirrors. And Jack is there to be suave."
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"
Break-down:
^(.*\\.)\\s*: since we need there to be at least one sentence before what we trim out, we need to find a preceding dot \\.;
Jack,? is from your regex
[^.]*\\.?$: zero or more "not .-dots" followed by a .-dot and end-of-string; if you want to allow blank space after the last period, then you can change this to [^.]*\\.?\\s*$, didn't seem necessary in your example
You can match a dot (or match more chars using a character class [.!?] and then match the last sentence containing Jack and end with a dot (or again the character class to match more chars):
\.\K\h*[Jj]ack,? is[^.\n]*\.$
The pattern matches:
\.\K Match a . and forget what is matched so far
\h*[Jj]ack,? is Match optional horizontal whitespace chars, then Jack or jack, and optional comma and is
[^.\n]*\. Optionally match any char except a . or a newline
$ End of string
Regex demo | R demo
Example code:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
sub("\\.\\K\\h*[Jj]ack,? is[^.\\n]*\\.$", " [TRIM]", test_strings, perl=TRUE)
Output
[1] "Jack is the tallest person."
[2] "and Jack is the one who said, let there be fries."
[3] "There are mirrors. And Jack is there to be suave."
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"

Split 2 numbers character into it's own new line

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

Changing "firstname lastname" to "lastname, firstname"

I have a list of names that I need to convert from "Firstname Lastname" to "Lastname, Firstname".
Barack Obama
Donald J. Trump
J. Edgar Hoover
Beyonce Knowles-Carter
Sting
I used G. Grothendieck's answer to "last name, first name" -> "first name last name" in serialized strings to get to gsub("([^ ]*) ([^ ]*)", "\\2, \\1", str) which gives me -
Obama, Barack
J., DonaldTrump,
Edgar, J.Hoover,
Knowles-Carter, Beyonce
Sting
What I would like to get -
Obama, Barack
Trump, Donald J.
Hoover, J. Edgar
Knowles-Carter, Beyonce
Sting
I would like a regex answer.
There is an esoteric function called person designed for holding names, a conversion function as.person which does this parsing for you and a format method to make use of it afterwards (with a creative use of the braces argument). It even works with complex surnames (eg van Nistelrooy) but the single name result is unsatisfactory. It can fixed with a quick ending sub though.
x <- c("Barack Obama","Donald J. Trump","J. Edgar Hoover","Beyonce Knowles-Carter","Sting", "Ruud van Nistelrooy", "John von Neumann")
y <- as.person(x)
format(y, include=c("family","given"), braces=list(family=c("",",")))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting," "van Nistelrooy, Ruud"
[7] "von Neumann, John"
## fix for single names - curse you Sting!
sub(",$", "", format(y, include=c("family","given"), braces=list(family=c("",","))))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting" "van Nistelrooy, Ruud"
[7] "von Neumann, John"
Use
gsub("(.*[^van])\\s(.*)", "\\2, \\1", people)
The regex:
(.*[^van]) \\s (.*)
Any ammount of characters exluding "van"... the last white space... The last name containing any character.
Data:
people <- c("Barack Obama",
"Donald J. Trump",
"J. Edgar Hoover",
"Beyonce Knowles-Carter",
"Sting",
"Ruud van Nistelrooy",
"Xi Jinping",
"Hans Zimvanmer")
Result:
[1] "Obama, Barack" "Trump, Donald J." "Hoover, J. Edgar"
[4] "Knowles-Carter, Beyonce" "Sting" "van Nistelrooy, Ruud"
[7] "Jinping, Xi" "Zimvanmer, Hans"

Structure character data into data frame

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

How to fix misplaced text words in R

I have a data set which the column title contains the name of movies. In some of the rows name of movies has been misplaced.
[1] "Killer Shrews, The (1959)" [2] "Kronos (1957)"
[3] "Kronos (1973)" [4] "Phantom of the Opera, The (1943)"
[5] "Runaway (1984)" [6] "Slumber Party Massacre, The (1982)"
for example, first one should be The Killer Shrews (1959).
I don't know how to fix this problem. Any thoughts?
We can use sub. Capture the characters as a group in the pattern argument and shuffle the backreference in replacement (assuming that the expected output pattern is similar to the one showed for the first element).
sub("([^,]+),\\s+([^( ]+)\\s+(.*)", "\\2 \\1 \\3", v1)
#[1] "The Killer Shrews (1959)" "Kronos (1957)"
#[3] "Kronos (1973)" "The Phantom of the Opera (1943)"
#[5] "Runaway (1984)" "The Slumber Party Massacre (1982)"
data
v1 <- c("Killer Shrews, The (1959)", "Kronos (1957)", "Kronos (1973)",
"Phantom of the Opera, The (1943)", "Runaway (1984)",
"Slumber Party Massacre, The (1982)" )

Resources