Shifting text around in a sentence via R - r

I have an R dataframe with movie names like so:
Shawshank Redemption, The
Godfather II, The
Band of Brothers
I would like to display these names as:
The Shawshank Redemption
The Godfather II
Band of Brothers
Can anyone help with how to do a check each row of the dataframe to see if there is a 'The' after a comma (like) above, and if there is, shift it to the front of the sentence?

You can use gsub:
df$movies2 = gsub("^([\\w\\s]+),*\\s*([Tt]he*($|(?=\\s\\(\\d{4}\\))))", "\\2 \\1", df$movies, perl = TRUE)
Result:
> df
movies movies2
1 Shawshank Redemption, The (1994) The Shawshank Redemption (1994)
2 Godfather II, The The Godfather II
3 Band of Brothers Band of Brothers
4 Dora, The Explorer Dora, The Explorer
5 Kill Bill Vol. 2 The Kill Bill Vol. 2 The
6 ,The Highlander ,The Highlander
7 Happening, the the Happening
Data:
df = data.frame(movies = c("Shawshank Redemption, The (1994)",
"Godfather II, The",
"Band of Brothers",
"Dora, The Explorer",
"Kill Bill Vol. 2 The",
",The Highlander",
"Happening, the"), stringsAsFactors = FALSE)
Notes:
The goal of the entire regex is to group the first part (part before ,) and the second part ('The' after , and only when it's at the end or before (year)) into separate capture groups which I can swap with \\2 and \\1
^([\\w\\s]+) matches any word character or spaces one or more times starting from the beginning of the string
,*\\s* matches comma and space both zero or more times
[Tt]he* matches "The" or "the" zero or more times
Notice that it is followed by ($|(?=\\s\\(\\d{4}\\))) which matches the "end of string", $, or a positive lookahead, which checks whether the previous pattern is followed by \\s\\(\\d{4}\\)
\\s\\(\\d{4}\\) matches a space and (4 digits) including the parentheses. Double backslashes are needed to escape a single backslash
So ([Tt]he*($|(?=\\s\\(\\d{4}\\)))) matches "The" or "the" either at the end of string or if it is followed by (4 digits)
Everything in parentheses are capture groups, so \\2 \\1 swaps the first capture group, ([\\w\\s]+), with the second, ([Tt]he*($|(?=\\s\\(\\d{4}\\))))
Now, since "The" is only matched zero or more times by [Tt]he*, if a string doesn't have "The" in it, an empty string gets swapped, with \\1, which returns the original string.

This seems to work for me:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers")
#use grep to find those with ", The" at the end
the.end=grep(", The$",x)
#trim movie titles to remove ", The"
trimmed=strtrim(x[the.end],nchar(x[the.end])-5)
#add "The " to the beginning of the trimmed titles
final=paste("The",trimmed)
#replace the trimmed elements of the movie vector
x[the.end]<-final
#take a look
x
Note that this doesn't remove ", The" from anywhere in the name other than the end... I think that's the behaviour that you want. It will also miss any "The" without the comma, or lower case "the". To see what I mean, try this as your initial movie vector:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers",
"Dora, The Explorer", "Kill Bill Vol. 2 The", ",The Highlander",
"Happening, the")

Related

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

R text mining - remove special characters and quotes

I'm doing a text mining task in R.
Tasks:
1) count sentences
2) identify and save quotes in a vector
Problems :
False full stops like "..." and periods in titles like "Mr." have to be dealt with.
There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)
IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
reproducible text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
The problem is it's splitting by false full-stops
Solution I tried:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
The problem with this is that it's not replacing [Miss. by [Miss
To identify quotes :
stri_extract_all_regex(text, '"\\S+"')
but that's not working too. (It's working with \" with the code below)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
The exact expected vector is :
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
I wanted the sentences separated (so I can count how many sentences in each paragraph).
And quotes also separated.
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
You may match all your current vec values using
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
That is, \w+ matches 1 or more word chars and \. matches a dot.
Next, if you just want to extract quotes, use
regmatches(text, gregexpr('"[^"]*"', text))
The " matches a " and [^"]* matches 0 or more chars other than ".
If you plan to match your sentences together with quotes, you might consider
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
Details
\\s* - 0+ whitespaces
"[^"]*" - a ", 0+ chars other than " and a "
| - or
[^"?!.]+ - 0+ chars other than ?, ", ! and .
[[:space:]?!.]+ - 1 or more whitespace, ?, ! or . chars
[^"[:alnum:]]* - 0+ non-alphanumeric and " chars
R sample code:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""

Extract only the sentence portion of a section header

I have a small problem.
I have text that looks like:
B.1 My name is John
I want to only obtain:
My name is John
I'm having difficulty leaving out both the B and the 1, at the same time
You can do this with sub and a regular expression.
TestStrings = c("B.1 My name is John", "A.12 This is another sentence")
sub("\\b[A-Z]\\.\\d+\\s+", "", TestStrings)
[1] "My name is John" "This is another sentence"
The \\b indicates a word boundary (to eliminate multiple letters)
[A-Z] will match a single capital letter.
\\. will match a period
\\d+ will match one or more digits
\\s+ will match any training blank space.
The part that is matched will be replaced with the empty string.
If you are sure that all the strings that you need have the same (or similar) initial part you can do
> a<-"B.1 My name is John"
> substr(a, 5, nchar(a))
[1] "My name is John"

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

R sub replacing part of identified string

Hi I have a large dataframe of addresses which I need to clean. One of the problems is where I wish to replace a number and suffix with an unwanted whitespace as follows
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
corrected <- c("73A Acacia Avenue","656B East Street", " FLAT 1D High Road")
I can identify and isolate what I wish to change using grep and regexpr, but am not sure how to remove the offending space and replace the correction in the original dataframe
reg <- "([0-9]+ [A-Z] )"
grep(reg, original, value = T, perl =T) # finds match
grep(reg, original, perl =T) # finds match row
regexpr(reg,match) # finds position
findstr <- regmatches(match,r) # show relevant string
So my final stage is to remove the whitespace and apply the correction.
Any help appreciated
Thank you
You may use the gsub with your (a bit modified) regex and \1\2 replacement:
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
reg <- "([0-9]+)\\s([A-Z]\\s+)"
gsub(reg, "\\1\\2", original)
## => [1] "73A Acacia Avenue" "656B East Street" " FLAT 1D High Road" [4] "66B West Street"
See the online R demo.
Details:
([0-9]+) - Group 1 matching one or more digits
\\s - a whitespace
([A-Z]\\s+) - Group 2 matching an uppercase ASCII letter and then 1 or more whitespaces.
The replacement is \1\2 where \1 is the value of the first group and \2 references the value in the second group.

Resources