I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.
This is the last statement I tried and got stuck for the above problem.
urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)
Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk
Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:
Thanks.
According to your specs, you may use the following regex:
\s*[^ /]+/[^ /]+
See the regex demo.
Details
\s* - 0 or more whitespace chars
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
/ - a slash
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.
R demo:
urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)
If you want to account for any whitespace, replace the literal space with [:space:],
urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
See already answered, but here is an alternative if you've not come across stringi before
# most complete package for string manipulation
library(stringi)
# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk"
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"
# see what is captured
stringi::stri_extract_all_regex(text, pattern)
# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
This might work:
text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"
Related
I already saw this one, but it is not quite what I need:
regex multiple pattern with singular replacement
Situation: Using gsub, I want to clean up strings. These are my conditions:
Keep words only (no digits nor "weird" symbols)
Keep those words separated with one of (just one) ' - _ $ . as one. For example: don't, re-loading, come_home, something$col
keep specific names, such as package::function or package::function()
So, I have the following:
[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*
Examples:
If I have the following:
# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay
I would like to have
Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay
Problems: I have several:
A. The second expression is not working properly. Right now, it only works with - or '
B. How do I combine all of these in a single gsub in R? I want to do something like gsub(myPatterns, myText), but don't know how to fix and combine all of this.
You can use
trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))
See the regex demo. Or, to also replace multiple whitespaces with a single space, use
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Details
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F): match either of the two patterns:
\w+::\w+(?:\(\))? - 1+ word chars, ::, 1+ word chars and an optional () substring
| - or
\p{L}+ - one or more Unicode letters
(?:[-'_$]\p{L}+)* - 0+ repetitions of -, ', _ or $ and then 1+ Unicode letters
(*SKIP)(*F) - omits and skips the match
| - or
[^\p{L}\s] - any char but a Unicode letter and whitespace
See the R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Output:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"
Alternatively,
txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
"# Update href of toc anchors , use \"-\" instead \".\"",
"# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
"Needs to handle NA for desc::desc_get()",
"Update href of toc anchors use instead",
"Keep something$col or here_you::must_stay")
leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE
I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)
I have a string that downloaded from the web:
x = "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
They parsed the string such that: ...In addition, contracted words like (can't) are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).
I want to make the string back to normal, for example:
x = "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."
How do I trim the space before the punctuation?
Have though about using str_remove_all with regex:
str_remove_all(x,"\\s[[:punct:]]'")
but it will also remove the punctuation.
Any ideas?
With back referencing:
x <- "the company 's newly launched cryptocurrency , Libra , hasn 't been contacted by Facebook , according to a report ."
gsub("(\\s+)([[:punct:]])", "\\2", x, perl = TRUE)
# [1] "the company's newly launched cryptocurrency, Libra, hasn't been contacted by Facebook, according to a report."
You may use
str_remove_all(x,"\\s+(?=[[:punct:]])")
str_remove_all(x,"\\s+(?=[\\p{S}\\p{P}])")
Or base R:
gsub("\\s+(?=[\\p{S}\\p{P}])", "", x, perl=TRUE)
See the regex demo.
Details
\s+ - 1 or more whitespace chars
(?=[[:punct:]]) - a positive lookahead that matches a location that is immediately followed with a punctuation character.
Please check R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character? before choosing the variant with [[:punct:]].
I have a dataset with news articles scraped from the web.
For every article, I would like to write a code that individuates the source, so that I can add it to the dataframe in a separate column.
The problem is that I cannot write a command line that works, I have tried to use grep but I think I am not writing the correct regex.
Example:
title content
Art 1 This is article one. Source: The Guardian.
Art 2 This is article two. Source: New York Times.
Art 3 This is article three. Source: The Washington Post.
Expected result:
title source
Art 1 The Guardian
Art 2 New York Times
Art 3 Washington Post
Here is what I tried (the pattern is always constituted by the word Source followed by : followed by one to three words and finishes with a full-stop):
source <- grep("(Source:)([:alpha:]{*})(.\)", df, perl = TRUE)
Here is the error message I get:
Error in grep("(Source:)([:alpha:]{*})(.))", df, perl = TRUE) :
invalid regular expression '(Source:)([:alpha:]{*})(.))'
In addition: Warning message:
In grep("(Source:)([:alpha:]{*})(.))", df, perl = TRUE) :
PCRE pattern compilation error
'POSIX named classes are supported only within a class'
at '[:alpha:]{*})(.))'
I have only limited experience with regex and I cannot find anywhere how to accomplish what I have in mind.
Use str_extract and positive lookbehind ("If you see on the left..."):
content <- "This is article one. Source: The Guardian."
library(stringr)
source <- str_extract(content, "(?<=Source: )[^.]*")
[1] "The Guardian"
Alternatively, use sub and backreference:
source <- sub(".*Source: (.*)\\.$", "\\1", content)
[1] "The Guardian"
You seem to want to get some substrings from character vectors. grep can be used to get whole matching character vectors, so you can't use grep.
You may use regmatches with regexpr to get the substrings. Assuming you have
content <- "Art 1 This is article one. Source: The Guardian."
df <- data.frame(content)
you may extract the source column using
df$source <- regmatches(df$content, regexpr("Source:\\s*\\K.+\\b", df$content, perl=TRUE))
See the R demo
Regex details
Source: - matches a literal text
\s* - 0+ whitespaces
\K - a match reset operator
.+ - any 0 or more characters other than line break characters, as many as possible up to the last...
\b - word boundary (this will "truncate" the trailing punctuation from the match).
See the regex demo.
I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?
You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\\bbut\\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[\r\n]* - 0 or more line break chars.
Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.
Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"