Clean string using gsub and multiple conditions - r

I already saw this one, but it is not quite what I need:
regex multiple pattern with singular replacement
Situation: Using gsub, I want to clean up strings. These are my conditions:
Keep words only (no digits nor "weird" symbols)
Keep those words separated with one of (just one) ' - _ $ . as one. For example: don't, re-loading, come_home, something$col
keep specific names, such as package::function or package::function()
So, I have the following:
[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*
Examples:
If I have the following:
# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay
I would like to have
Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay
Problems: I have several:
A. The second expression is not working properly. Right now, it only works with - or '
B. How do I combine all of these in a single gsub in R? I want to do something like gsub(myPatterns, myText), but don't know how to fix and combine all of this.

You can use
trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))
See the regex demo. Or, to also replace multiple whitespaces with a single space, use
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Details
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F): match either of the two patterns:
\w+::\w+(?:\(\))? - 1+ word chars, ::, 1+ word chars and an optional () substring
| - or
\p{L}+ - one or more Unicode letters
(?:[-'_$]\p{L}+)* - 0+ repetitions of -, ', _ or $ and then 1+ Unicode letters
(*SKIP)(*F) - omits and skips the match
| - or
[^\p{L}\s] - any char but a Unicode letter and whitespace
See the R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Output:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"

Alternatively,
txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
"# Update href of toc anchors , use \"-\" instead \".\"",
"# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
"Needs to handle NA for desc::desc_get()",
"Update href of toc anchors use instead",
"Keep something$col or here_you::must_stay")
leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE

Related

R: remove every word that ends with ".htm"

I have a df = desc with a variable "value" that holds long text and would like to remove every word in that variable that ends with ".htm" . I looked for a long time around here and regex expressions and cannot find a solution.
Can anyone help? Thank you so much!
I tried things like:
library(stringr)
desc <- str_replace_all(desc$value, "\*.htm*$", "")
But I get:
Error: '\*' is an unrecognized escape in character string starting ""\*"
This regex:
Will Catch all that ends with .htm
Will not catch instances with .html
Is not dependent on being in the beginning / end of a string.
strings <- c("random text shouldbematched.htm notremoved.html matched.htm random stuff")
gsub("\\w+\\.htm\\b", "", strings)
Output:
[1] "random text notremoved.html random stuff"
I am not sure what exactly you would like to accomplish, but I guess one of those is what you are looking for:
words <- c("apple", "test.htm", "friend.html", "remove.htm")
# just replace the ".htm" from every string
str_replace_all(words, ".htm", "")
# exclude all words that contains .htm anywhere
words[!grepl(pattern = ".htm", words)]
# exlude all words that END with .htm
words[substr(words, nchar(words)-3, nchar(words)) != ".htm"]
I am not sure if you can use * to tell R to consider any value inside a string, so I would first remove it. Also, in your code you are setting a change in your variable "value" to replace the entire df.
So I would suggest the following:
desc$value <- str_replace(desc$value, ".htm", "")
By doing so, you are telling R to remove all .htm that you have in the desc$value variable alone. I hope it works!
Let's assume you have, as you say, a variable "value" that holds long text and you want to remove every word that ends in .html. Based on these assumptions you can use str_remove all:
The main point here is to wrap the pattern into word boundary markers \\b:
library(stringr)
str_remove_all(value, "\\b\\w+\\.html\\b")
[1] "apple and test2.html01" "the word must etc. and as well" "we want to remove .htm"
Data:
value <- c("apple test.html and test2.html01",
"the word friend.html must etc. and x.html as well",
"we want to remove .htm")
To achieve what you want just do:
desc$value <- str_replace(desc$value, ".*\\.htm$", "")
You are trying to escape the star and it is useless. You get an error because \* does not exist in R strings. You just have \n, \t etc...
\. does not exist either in R strings. But \\ exists and it produces a single \ in the resulting string used for the regular expression. Therefore, when you escape something in a R regexp you have to escape it twice:
In my regexp: .* means any chars and \\. means a real dot. I have to escape it twice because \ needs to be escape first from the R string.

How to remove urls without http in a text document using r

I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.
This is the last statement I tried and got stuck for the above problem.
urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)
Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk
Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:
Thanks.
According to your specs, you may use the following regex:
\s*[^ /]+/[^ /]+
See the regex demo.
Details
\s* - 0 or more whitespace chars
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
/ - a slash
[^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.
R demo:
urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)
If you want to account for any whitespace, replace the literal space with [:space:],
urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
See already answered, but here is an alternative if you've not come across stringi before
# most complete package for string manipulation
library(stringi)
# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk"
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"
# see what is captured
stringi::stri_extract_all_regex(text, pattern)
# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
This might work:
text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"

gsub replace text after set of characters

I have a lot of error messages that I am trying to clean up.
some of the errors end with the text "(sec): 0.xxx"
i'm trying to use gsub to remove everything after (sec)
data$Message <- gsub("(sec).*", "", data$Message, perl = TRUE)
this returns everything after (
I know it would be easy to just use ":" or ")" but then it effects other errors that I do not want to change.
Is there a way to use gsub to look at several characters -like "(sec)"- instead of just one?
on a related note is their a symbol that represents any number (excludes text) similiar to "."?
You can use regex look behind ?<= to avoid sec being removed and at the same time assert the removed pattern follows sec, so (?<=sec\\)).* will remove everything after sec) but not sec) itself:
gsub("(?<=sec\\)).*", "", "(sec): 0.xxx", perl = TRUE)
# [1] "(sec)"
You can select the first part of the expression (between brackets) and omit the rest:
gsub('(^.*\\(sec\\)).*', '\\1', '(sec): 0.xxx')
## [1] "(sec)"

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Removing white space from data frame in R

I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name".
I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them
gsub( " *\\(.*?\\) *", "", x)
This leaves me with " Team Name" (notice the space before the T).
Now I am trying to remove the white space before the T using trimws or the method shown here, but it is not working
could someone please help me with removing the extra white space.
Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.
" team name" == df[1,1]
You could try
gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")
We can use
sub(".*\\]\\s+", "", x)
#[1] "Team Name"
Or just
sub("\\S+\\s+", "", x)
#[1] "Team Name"
data
x <- '[N] Team Name';
You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)
Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:
x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"
In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T, so I suggest you use that. I really think there is something broken in the default regex implementation.

Resources