I have a data set which the column title contains the name of movies. In some of the rows name of movies has been misplaced.
[1] "Killer Shrews, The (1959)" [2] "Kronos (1957)"
[3] "Kronos (1973)" [4] "Phantom of the Opera, The (1943)"
[5] "Runaway (1984)" [6] "Slumber Party Massacre, The (1982)"
for example, first one should be The Killer Shrews (1959).
I don't know how to fix this problem. Any thoughts?
We can use sub. Capture the characters as a group in the pattern argument and shuffle the backreference in replacement (assuming that the expected output pattern is similar to the one showed for the first element).
sub("([^,]+),\\s+([^( ]+)\\s+(.*)", "\\2 \\1 \\3", v1)
#[1] "The Killer Shrews (1959)" "Kronos (1957)"
#[3] "Kronos (1973)" "The Phantom of the Opera (1943)"
#[5] "Runaway (1984)" "The Slumber Party Massacre (1982)"
data
v1 <- c("Killer Shrews, The (1959)", "Kronos (1957)", "Kronos (1973)",
"Phantom of the Opera, The (1943)", "Runaway (1984)",
"Slumber Party Massacre, The (1982)" )
Related
As per the title, I am trying to clean a large compilation of short texts, to remove sentences that start with certain words -- but only if it is the last of >1 sentences that text.
Suppose I want to cut out the last sentence if it begins with 'Jack is ...'
Here is an example with varied cases:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
And here is the regex I currently have: "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$"
map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\\.|'] )[Jj]ack,? is.+\\.$", "[TRIM]"))
Producing these results:
[1] "[TRIM]"
[2] "and [TRIM]"
[3] "There are mirrors. And [TRIM]"
[4] "There are dogs. And [TRIM]"
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
## Basically my current regex is still too greedy.
## No trimming should happen for the first 4 examples.
## 5 - 7th examples are correct.
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it.
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets.
Thanks for any help!
gsub("^(.*\\.)\\s*Jack,? is[^.]*\\.?$", "\\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."
# [2] "and Jack is the one who said, let there be fries."
# [3] "There are mirrors. And Jack is there to be suave."
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"
Break-down:
^(.*\\.)\\s*: since we need there to be at least one sentence before what we trim out, we need to find a preceding dot \\.;
Jack,? is from your regex
[^.]*\\.?$: zero or more "not .-dots" followed by a .-dot and end-of-string; if you want to allow blank space after the last period, then you can change this to [^.]*\\.?\\s*$, didn't seem necessary in your example
You can match a dot (or match more chars using a character class [.!?] and then match the last sentence containing Jack and end with a dot (or again the character class to match more chars):
\.\K\h*[Jj]ack,? is[^.\n]*\.$
The pattern matches:
\.\K Match a . and forget what is matched so far
\h*[Jj]ack,? is Match optional horizontal whitespace chars, then Jack or jack, and optional comma and is
[^.\n]*\. Optionally match any char except a . or a newline
$ End of string
Regex demo | R demo
Example code:
test_strings <- c("Jack is the tallest person.",
"and Jack is the one who said, let there be fries.",
"There are mirrors. And Jack is there to be suave.",
"There are dogs. And jack is there to pat them. Very cool.",
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz",
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
sub("\\.\\K\\h*[Jj]ack,? is[^.\\n]*\\.$", " [TRIM]", test_strings, perl=TRUE)
Output
[1] "Jack is the tallest person."
[2] "and Jack is the one who said, let there be fries."
[3] "There are mirrors. And Jack is there to be suave."
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"
I have a conversation between several speakers recorded as a single string:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
I also have a vector of the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
However, the first part of the the third speaker's name (al) is contained in one of the extracted utterances (yeah i know thats erm al) and the last utterance by speaker al hamshi (ah you know camping with my girl friend) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?
What if you take another approach?
Remove all speakers from the text and split the string on '\\s*:\\s*'
strsplit(gsub(paste(speakers, collapse = "|"), '', convers), '\\s*:\\s*')[[1]]
# [1] "" "hiya"
# [3] "hi how wz your weekend" "ahh still got a headache An you party a lot"
# [5] "nuh you know my kid s sick n stuff" "yeah i know thats erm"
# [7] "hey guys how s it goin" "Great"
# [9] "where ve you been last week" "ah you know camping with my girl friend"
You can clean up the output a bit to remove the first empty value from it.
A correct splitting approach would look like
p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
The regex to remove speakers from the string will look like
\s*\b(?:Peter|Mary|al hamshi)(?=:)
See the regex demo. It will match
\s* - 0+ whitespaces
\b - a word boundary
(?:Peter|Mary|al hamshi) - one of the speaker names
(?=:) - that must be followed with a : char.
Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...) call, and then the whole string is split with \s*:\s* regex that matches a : enclosed with 0+ whitespaces.
Alternatively, you can use
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
See this regex demo. Details:
(?<=(?:Peter|Mary|al hamshi):\s) - a location immediately preceded with any speaker name and a whitespace
.*? - any 0+ chars (other than line break chars, use (?s) at the pattern start to make it match any chars) as few as possible
(?=\s*(?:Peter|Mary|al hamshi):|\z) - a location immediately followed with 0+ whitespaces, then any speaker name and a : or end of string.
In R, you can use
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
I have a dataframe (df) with user_names and text for each user. I have another data_frame with important words. I want to create a for loop that iterates over each user and counts how often the important words appear in their text.
Data:
important_words = c("marcus", "yesterday", "democrat", "republican", "trump", "hillary")
df$user_names
[1] "marc12"
[2] "jon"
[3] "67han"
[4] "XXmark"
[5] "mark"
[6] "mark"
df$text
[1] "hi my name is marcus and i am a republican"
[2] "i support hillary"
[3] "go trump!"
[4] "tomorrow i will vote democrat"
[5] "i don't think so"
[6] "yesterday was ok"
We can extract all the important_words for each user_names and count number of unique important words each user has.
library(dplyr)
library(stringr)
df %>%
group_by(user_names) %>%
summarise(unique_imp_word = n_distinct(unlist(str_extract_all(tolower(text),
str_c('\\b', tolower(important_words), '\\b', collapse = "|")))))
I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file
[1] "<START" "ID=\"CMP-001\"" "NO=\"1\">"
[4] "<NAME>Plasma-derived" "vaccine" "(PDV)"
[7] "versus" "placebo" "by"
[10] "intramuscular" "route</NAME>" "<DIC"
[13] "CHI2=\"3.6385\"" "CI_END=\"0.6042\"" "CI_START=\"0.3425\""
[16] "CI_STUDY=\"95\"" "CI_TOTAL=\"95\"" "DF=\"3.0\""
[19] "TOTAL_1=\"0.6648\"" "TOTAL_2=\"0.50487622\"" "BLE=\"YES\""
.
.
.
[789] "TOTAL_2=\"39\"" "WEIGHT=\"300.0\"" "Z=\"1.5443\">"
[792] "<NAME>Local" "adverse" "events"
[795] "after" "each" "injection"
[798] "of" "vaccine</NAME>" "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
[801] "</GROUP_LABEL_2>" "<GRAPH_LABEL_1>" "PDV</GRAPH_LABEL_1>"
the extracted expected title is
Plasma-derived vaccine (PDV) versus placebo by intramuscular route
Note, each file has a different title's length.
Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!
Note: if you just one the first title you can use str_match() instead of str_match_all().
library(stringr)
str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine"
Data:
string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
"TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")
I have a dataframe twts consisting of movie tweets that was retrieved from twitter . I want to extract specified elements of the data frame that has the word like "music" and store them in another data frame object. I thought of using for loops and parsing every sentence word by word. So is there any other efficient ways or in-buid functions to do the desired action.?
input : twt (initial data.frame)
[1] "Music is so nice"
[2] "the movie rocked"
[3] "the hero is the best"
[4] "theme music at its peak"
output : music (new data.frame)
[1] "Music is so nice"
[2] "theme music at its peak"
Thanks in advance :)
> twt <- c("Music is so nice", "the movie rocked", "the hero is the best", "theme music at its peak")
> twt[grep("music", twt, ignore.case = TRUE)]
[1] "Music is so nice" "theme music at its peak"