R regex trimming a string whitespace - r

I have a string that downloaded from the web:
x = "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
They parsed the string such that: ...In addition, contracted words like (can't) are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
I want to make the string back to normal, for example:
x = "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."
How do I trim the space before the punctuation?
Have though about using str_remove_all with regex:
str_remove_all(x,"\\s[[:punct:]]'")
but it will also remove the punctuation.
Any ideas?

With back referencing:
x <- "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
gsub("(\\s+)([[:punct:]])", "\\2", x, perl = TRUE)
# [1] "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."

You may use
str_remove_all(x,"\\s+(?=[[:punct:]])")
str_remove_all(x,"\\s+(?=[\\p{S}\\p{P}])")
Or base R:
gsub("\\s+(?=[\\p{S}\\p{P}])", "", x, perl=TRUE)
See the regex demo.
Details
\s+ - 1 or more whitespace chars
(?=[[:punct:]]) - a positive lookahead that matches a location that is immediately followed with a punctuation character.
Please check R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character? before choosing the variant with [[:punct:]].

Related

How to find two words patterns knowing only the first one in r

I have a dataset with news articles scraped from the web.
For every article, I would like to write a code that individuates the source, so that I can add it to the dataframe in a separate column.
The problem is that I cannot write a command line that works, I have tried to use grep but I think I am not writing the correct regex.
Example:
title content
Art 1 This is article one. Source: The Guardian.
Art 2 This is article two. Source: New York Times.
Art 3 This is article three. Source: The Washington Post.
Expected result:
title source
Art 1 The Guardian
Art 2 New York Times
Art 3 Washington Post
Here is what I tried (the pattern is always constituted by the word Source followed by : followed by one to three words and finishes with a full-stop):
source <- grep("(Source:)([:alpha:]{*})(.\)", df, perl = TRUE)
Here is the error message I get:
Error in grep("(Source:)([:alpha:]{*})(.))", df, perl = TRUE) :
invalid regular expression '(Source:)([:alpha:]{*})(.))'
In addition: Warning message:
In grep("(Source:)([:alpha:]{*})(.))", df, perl = TRUE) :
PCRE pattern compilation error
'POSIX named classes are supported only within a class'
at '[:alpha:]{*})(.))'
I have only limited experience with regex and I cannot find anywhere how to accomplish what I have in mind.
Use str_extract and positive lookbehind ("If you see on the left..."):
content <- "This is article one. Source: The Guardian."
library(stringr)
source <- str_extract(content, "(?<=Source: )[^.]*")
[1] "The Guardian"
Alternatively, use sub and backreference:
source <- sub(".*Source: (.*)\\.$", "\\1", content)
[1] "The Guardian"
You seem to want to get some substrings from character vectors. grep can be used to get whole matching character vectors, so you can't use grep.
You may use regmatches with regexpr to get the substrings. Assuming you have
content <- "Art 1 This is article one. Source: The Guardian."
df <- data.frame(content)
you may extract the source column using
df$source <- regmatches(df$content, regexpr("Source:\\s*\\K.+\\b", df$content, perl=TRUE))
See the R demo
Regex details
Source: - matches a literal text
\s* - 0+ whitespaces
\K - a match reset operator
.+ - any 0 or more characters other than line break characters, as many as possible up to the last...
\b - word boundary (this will "truncate" the trailing punctuation from the match).
See the regex demo.

Removing square brackets and their contents [duplicate]

Suppose I have some text like this,
text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")
and I would like to remove (edit: get rid of) all of the text between the [ and ] (and the brackets themselves). What's the best way to do this? Here is my feeble attempt using regex and the stingr package:
str_extract(text, "\\[[a-z]*\\]")
Thanks for any help!
With this:
gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);
What the regex means:
\[ # '['
[^\]]* # any character except: '\]' (0 or more
# times (matching the most amount possible))
\] # ']'
The following should do the trick. The ? forces a lazy match, which matches as few . as possible before the subsequent ].
gsub('\\[.*?\\]', '', text)
Here'a another approach:
library(qdap)
bracketX(text, "square")
I think this technically answers what you've asked, but you probably want to add a \\: to the end of the regex for prettier text (removing the colon and space).
library(stringr)
str_replace_all(text, "\\[.+?\\]", "")
#> [1] ": We need tax policies that respect the wage earners..."
vs...
str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..."
Created on 2018-08-16 by the reprex package (v0.2.0).
No need to use a PCRE regex with a negated character class / bracket expression, a "classic" TRE regex will work, too:
subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some here and "
See the online R demo
Details:
\\[ - a literal [ (must be escaped or used inside a bracket expression like [[] to be parsed as a literal [)
[^][]* - a negated bracket expression that matches 0+ chars other than [ and ] (note that the ] at the start of the bracket expression is treated as a literal ])
] - a literal ] (this character is not special in both PCRE and TRE regexps and does not have to be escaped).
If you want to only replace the square brackets with some other delimiters, use a capturing group with a backreference in the replacement pattern:
gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"
See another demo
The (...) parenthetical construct forms a capturing group, and its contents can be accessed with a backreference \1 (as the group is the first one in the pattern, its ID is set to 1).

How to remove urls without http in a text document using r

I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.
This is the last statement I tried and got stuck for the above problem.
urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)
Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk
Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:
Thanks.
According to your specs, you may use the following regex:
\s*[^ /]+/[^ /]+
See the regex demo.
Details
\s* - 0 or more whitespace chars
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
/ - a slash
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.
R demo:
urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)
If you want to account for any whitespace, replace the literal space with [:space:],
urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
See already answered, but here is an alternative if you've not come across stringi before
# most complete package for string manipulation
library(stringi)
# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk"
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"
# see what is captured
stringi::stri_extract_all_regex(text, pattern)
# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
This might work:
text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"

R - Regular expression to match all punctuation except that inside of a URL

Basically, I'm looking for a regular expression to select all punctuation except for that which is inside of a URL.
In essence, if I have the string:
This is a URL: https://test.com/ThisIsAURL !
And remove all matches it should become:
This is a URL https://test.com/ThisIsAURL
gsub("[[:punct:]]", "", x) removes all punctuation including from URLs. I've tried using negative look behinds to select punctuation used after https but this was unsuccessful.
In the situation I need it for, all URLs are Twitter link-style URLs https://t.co/. They do not end in .com. Nor do they have more than one backslashed slug (/ThisIsAURL). However, IDEALLY, I'd like the regex to be as versatile as possible, able to perform this operation successfully on any URL.
You may match and capture into Group 1 a URL-like pattern like https?://\S* and then match any punctuation and replace with a backreference to Group 1 to restore the URL in the resulting string:
x <- "This is a URL: https://test.com/ThisIsAURL !"
trimws(gsub("(https?://\\S*)|[[:punct:]]+", "\\1", x, ignore.case=TRUE))
## => [1] "This is a URL https://test.com/ThisIsAURL"
See the R demo online.
The regex is
(https?://\S*)|[[:punct:]]+
See the regex demo.
Details
(https?://\S*) - Group 1 (referenced to with \1 from the replacement pattern):
https?:// - https:// or http://
\S* - 0+ non-whitespace chars
| - or
[[:punct:]]+ - 1+ punctuation (proper punctuation, symbols and _)

Negation in R, how can I replace words following a negation in R?

I'm following up on a question that has been asked here about how to add the prefix "not_" to a word following a negation.
In the comments, MrFlick proposed a solution using a regular expression gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T).
I would like to edit this regular expression in order to add the not_ prefix to all the words following "not" or "n't" until there is some punctuation.
If I'm editing cptn's example, I'd like:
x <- "They didn't sell the company, and it went bankrupt"
To be transformed into:
"They didn't not_sell not_the not_company, and it went bankrupt"
Can the use of backreference still do the trick here? If so, any example would be much appreciated. Thanks!
You may use
(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b
and replace with not_\1. See the regex demo.
Details
(?:\bnot|n't|\G(?!\A)) - either of the three alternatives:
\bnot - whole word not
n't - n't
\G(?!\A) - the end of the previous successful match position
\s+ - 1+ whitespaces
\K - match reset operator that discards the text matched so far
(\w+) - Group 1 (referenced to with \1 from the replacement pattern): 1+ word chars (digits, letters or _)
\b - a word boundary.
R demo:
x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"
First you should split the string on the punctuation you want. For example:
x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]")
[[1]]
[1] "They didn't sell the company" " and it went bankrupt" " Then something else"
and then apply the regex to every element of the list x_split. Finally merge all the pieces (if needed).
This is not ideal, but gets the job done:
x <- "They didn't sell the company, and it did not go bankrupt. That's it"
gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s",
" not_", x,
perl = TRUE)
# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"
Notes:
This uses the (*SKIP)(*FAIL) trick to avoid any pattern you don't want to regex to match. This basically replaces every space with not_ except for those spaces where they fall between:
Start of string or punctuation and "not" or "n't" or
Punctuation and Punctuation (not followed by space) or end of string

Resources