I'd like to match the following
My best email gmail.com
email com
email.com
to become
My best email
email com
*nothing*
Specifically, I'm using Regex for R, so I know there are different rules for escaping certain characters. I'm very new to Regex, but so far I have
\ .*(com)
which makes the same input
My
But this code does not work for instances where there are no spaces like the third example, and removes everything past the first space of a line if the line has a ".com"
Use the following solution:
x <- c("My best email gmail.com","email com", "email.com", "smail.com text here")
trimws(gsub("\\S+\\.com\\b", "", x))
## => [1] "My best email" "email com" "" "text here"
See the R demo.
The \\S+\\.com\\b pattern matches 1+ non-whitespace chars followed by a literal .com followed by the word boundary.
The trimws function will trim all the resulting strings (as, e.g. with "smail.com text here", when a space will remain after smail.com removal).
Note that TRE regex engine does not support shorthand character classes inside bracket expressions.
Related
I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?
How would I find all substrings, which are between "##" and "\n“ or “{“?
For example, facing "## Test\n" or "## Test {", I would like to get back "Test".
I am not experienced in using Regex but started in trying
str_match("## Test\n", "## (.*?) \n")
using the stringr-package. But it seems as there is an issue with the line-break.
The following should work:
str_match("## Test\n", "##\s*([^\n{]*)[\n{]")
## matches ##
\s* matches any number of whitespace characters
([^\n{]*) will match and capture any number of characters that are not \n or {
[\n{] ends the pattern on either \n or {
You can use an assertion and a negated class.
Putting a \r carriage return in the class will take care of line break issues.
(?<=##)[^\r\n{]*
What it matches is what you need.
Expanded
(?<= \#\# )
[^\r\n{]*
Also, if you anticipate ## not separated by line breaks or {,
and is valid, use something like this
(?<=##)(?:(?!##|[\r\n{])[\S\s])*
I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.
I having a dataframe of part of speech tagged strings
Example:
best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ
I want to remove the tags after/and the '_' so that I have the output
best phone only issue camera sensor have mind own
I am using R and I couldn't find an appropriate regex for the gsub function.
I tried this.
sentence= c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
o1=gsub("\\_.*","",sentence, perl = T)
But This removes entire string after the first underscore. Thanks in Advance
You may use _[A-Z]+ TRE pattern with gsub:
sentence <- c("best_JJS phone_NN only_RB issue_NN camera_NN sensor_NN have_VB mind_NN own_JJ")
gsub("_[A-Z]+","",sentence)
[1] "best phone only issue camera sensor have mind own"
See the R demo
The _[A-Z]+ pattern matches an underscore (_, note it does not have to be escaped in a regex pattern) and one or more (+) uppercase ASCII letters ([A-Z]).
You may further precise the pattern, say, to only match the _ if it is preceded with a word char and match uppercase letters only when followed with a word boundary:
"\\B_[A-Z]+\\b
In case you want to create a very specific regex for the POS values, you may use alternation:
"\\B_(JJ|NN|CC|[VR]B)\\b"
And continue adding |<code> to the regex pattern.
I have a character string in which I would like to only remove the line breaks followed immediately by a lowercase letter. For example, my string might contain:
one line of text \r\n another line \r\nof text,
which would show up as:
one line of text
another line
of text.
In this example, I would only want to remove the second line break, so that the text would then read:
one line of text
another line of text
I know that the pattern is "\r\n[a-z]", and so the code should be something like
gsub("\r\n[a-z]","")
but I cannot come up with code that removes the line break while retaining the lowercase letter.
Thanks!
We can use a regex lookaround
txtN <- gsub("\r\n(?=[a-z])", "", txt, perl = TRUE)
cat(txtN, sep="\n")
# one line of text
# another line of text,
You may achieve what you need without lookarounds and use a TRE regex like
s <- "one line of text \r\n another line \r\nof text,"
res <- gsub("\r?\n([a-z])","\\1", s)
cat(res)
See the IDEONE demo
If you use the (...) around a pattern you define a capturing group the contents of which you may reference from the replacement pattern.
Pattern details:
\r?\n - a linebreak (either \r\n or \n)
([a-z]) - a lowercase ASCII letter inside Group 1.
Replacement:
\1 - a numbered backreference to the Group 1 contents.
More information about:
Capturing groups
Backreferences
P.S.: If you are keen to use PCRE regex, there is one very nice construct other than a lookahead support - a \R that matches any style linebreak. Then, I'd suggest:
gsub("\\R(?=[a-z])", "", txt, perl = TRUE)
You need to use a positive lookahead for this.
For instance:
text = "one line of text \r\n another line \r\nof text,"
fixed = gsub("\r\n(?=[a-z])", "", text, perl = T)
cat(fixed)
#> one line of text
#> another line of text,