R text mining - remove special characters and quotes

R text mining - remove special characters and quotes - r

I'm doing a text mining task in R.
Tasks:
1) count sentences
2) identify and save quotes in a vector
Problems :
False full stops like "..." and periods in titles like "Mr." have to be dealt with.
There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)
IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
reproducible text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
The problem is it's splitting by false full-stops
Solution I tried:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
The problem with this is that it's not replacing [Miss. by [Miss
To identify quotes :
stri_extract_all_regex(text, '"\\S+"')
but that's not working too. (It's working with \" with the code below)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
The exact expected vector is :
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
I wanted the sentences separated (so I can count how many sentences in each paragraph).
And quotes also separated.
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

You may match all your current vec values using
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
That is, \w+ matches 1 or more word chars and \. matches a dot.
Next, if you just want to extract quotes, use
regmatches(text, gregexpr('"[^"]*"', text))
The " matches a " and [^"]* matches 0 or more chars other than ".
If you plan to match your sentences together with quotes, you might consider
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
Details
\\s* - 0+ whitespaces
"[^"]*" - a ", 0+ chars other than " and a "
| - or
[^"?!.]+ - 0+ chars other than ?, ", ! and .
[[:space:]?!.]+ - 1 or more whitespace, ?, ! or . chars
[^"[:alnum:]]* - 0+ non-alphanumeric and " chars
R sample code:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""

Related

Remove whitespace after a symbol (hyphen) in R

I'm trying to remove the hyphen that divides a word from a string. For example, the word example: "for exam- ple this".
a <- "for exam- ple this"
How could I join them?
I have tried to remove the script using this command:
str_replace_all(a, "-", "")
But I got this back:
"for exam ple this"
It does not return the word united. I have also tried this:
str_replace_all(a, "- ", "") but I get nothing.
Therefore I have thought of first removing the white spaces after a hyphen to get the following
"for exm-ple this"
and then eliminating the hyphen.
Can you help me?

Here is an option with sub where we match the - followed by zero or more spaces (\\s*) and replace with -
sub("-\\s*", "-", a)
#[1] "for exam-ple this"
If it is to remove all spaces instead of a single one, then replace with gsub
gsub("-\\s*", "-", a)

str_replace_all(a, "- ", "-")

If you are just trying to remove the whitespace after a symbol then Ricardo's answer is sufficient. If you want to remove an unknown amount of whitespace after a hyphen consider
str_replace_all(a, "- +", "-")
#[1] "for exam-ple this"
b <- "for exam- ple this"
str_replace_all(b, "- +", "-")
#[1] "for exam-ple this"
EDIT --- Explaination
The "+" is something that tells r how to match a string and is part of the regular expressions. "+" specifically means to match the preceding character (or group/set) 1 or more times. You can find out more about regular expressions here.

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

I have a list of words in R as shown below:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
And I want to remove the words which are found in the above list from the text as below:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
After removing the unwanted myList words, the myText should look like:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
I was using :
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
But this is not helping me. What I should do??

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

gsub(paste0(myList, collapse = "|"), "", myText)
gives:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)

You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.

How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))

If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.

You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})

Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"

This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,

It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1

I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!

You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R text mining - remove special characters and quotes - r

Related

Remove whitespace after a symbol (hyphen) in R

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

Removing parentheses, text proceeding comma, and the comma in a string using string

separating last sentence from a string in R

Problems in a regular expression to extract names using stringr

Categories

Resources