Removing certain regular expressions in r - r

I have a character string in which I would like to only remove the line breaks followed immediately by a lowercase letter. For example, my string might contain:
one line of text \r\n another line \r\nof text,
which would show up as:
one line of text
another line
of text.
In this example, I would only want to remove the second line break, so that the text would then read:
one line of text
another line of text
I know that the pattern is "\r\n[a-z]", and so the code should be something like
gsub("\r\n[a-z]","")
but I cannot come up with code that removes the line break while retaining the lowercase letter.
Thanks!

We can use a regex lookaround
txtN <- gsub("\r\n(?=[a-z])", "", txt, perl = TRUE)
cat(txtN, sep="\n")
# one line of text
# another line of text,

You may achieve what you need without lookarounds and use a TRE regex like
s <- "one line of text \r\n another line \r\nof text,"
res <- gsub("\r?\n([a-z])","\\1", s)
cat(res)
See the IDEONE demo
If you use the (...) around a pattern you define a capturing group the contents of which you may reference from the replacement pattern.
Pattern details:
\r?\n - a linebreak (either \r\n or \n)
([a-z]) - a lowercase ASCII letter inside Group 1.
Replacement:
\1 - a numbered backreference to the Group 1 contents.
More information about:
Capturing groups
Backreferences
P.S.: If you are keen to use PCRE regex, there is one very nice construct other than a lookahead support - a \R that matches any style linebreak. Then, I'd suggest:
gsub("\\R(?=[a-z])", "", txt, perl = TRUE)

You need to use a positive lookahead for this.
For instance:
text = "one line of text \r\n another line \r\nof text,"
fixed = gsub("\r\n(?=[a-z])", "", text, perl = T)
cat(fixed)
#> one line of text
#> another line of text,

Related

Applying a regular expression to a string in R

I'm just getting to know the language R, previously worked with python. The challenge is to replace the last character of each word in the string with *.
How it should look: example text in string, and result work: exampl* tex* i* strin*
My code:
library(tidyverse)
library(stringr)
string_example = readline("Enter our text:")
string_example = unlist(strsplit(string_example, ' '))
string_example
result = str_replace(string_example, pattern = "*\b", replacement = "*")
result
I get an error:
> result = str_replace(string_example, pattern = "*\b", replacement = "*")
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=``)
Help solve the task
Oh, I noticed an error, the pattern should be .\b. this is how the code is executed, but there is no replacement in the string
If you mean words consisting of letters only, you can use
string_example <- "example text in string"
library(stringr)
str_replace_all(string_example, "\\p{L}\\b", "*")
## => [1] "exampl* tex* i* strin*"
See the R demo and the regex demo.
Details:
\p{L} - a Unicode category (propery) class matching any Unicode letter
\b - a word boundary, in this case, it makes sure there is no other word character immediately on the right. It will fails the match if the letter matched with \p{L} is immediately followed with a letter, digit or _ (these are all word chars). If you want to limit this to a letter check, replace \b with (?!\p{L}).
Note the backslashes are doubled because in regular string literals backslashes are used to form string escape sequences, and thus need escaping themselves to introduce literal backslashes in string literals.
Some more things to consider
If you do not want to change one-letter words, add a non-word boundary at the start, "\\B\\p{L}\\b"
If you want to avoid matching letters that are followed with - + another letter (i.e. some compound words), you can add a lookahead check: "\\p{L}\\b(?!-)".
You may combine the lookarounds and (non-)word boundaries as you need.

how to remove decimal point between numbers in R

I am trying to remove the decimal points in decimal numbers in R. Please note I want to keep the full stop of strings.
Example:
data= c("It's 6.00pm, and is late.")
I know that I have to use regex for this, but I am struggling. My desired output is:
"It's 6 00pm, and is late."
Thank you in advance.
Try this:
sub("(?<=\\d)\\.(?=\\d)", " ", data, perl = TRUE)
This solution uses lookbehind (?<=...) and lookahead (?=...)to assert that the period you wish to remove be enclosed by digits (thus avoiding matching the period at the sentence end). If you have several such cases within strings, then use gsubinstead of sub.
I suggest using a simple pattern to find the target text, then adding parenthesis to identify the parts of the matching text that you want to retain.
# Test data
data <- c("It's 6.00pm, and is late.")
The target pattern is a literal dot with a string of digits before and after it. \\d+ matches one or more digits and \\. matches a literal dot. Testing the pattern to see if it works:
grepl("\\d+\\.\\d+", data)
Result
TRUE
If we wanted too eliminate the whole thing we could do a simple replacement with an empty string. Testing if this targets the correct text:
sub("\\d+\\.\\d+", "", data)
Result
"It's pm, and is late."
Instead, to discard only a section of matched text we can identify the parts we want to keep, which is done by surrounding them with parenthesis. Once done we can refer to the captured text in the replacement. \\1 refers to the first chunk of text captured and \\2 refers to the second chunk of text, corresponding to the first and second sets of parenthesis
# pattern replacement
sub("(\\d+)\\.(\\d+)", "\\1\\2", data)
Result
[1] "It's 600pm, and is late."
This effectively removes the dot by omitting it from the replacement text.

Cleaning a text with several quotation marks leaving just a pair of them

How one could replace a text that contains several of the following pattern with just the necessary quotation marks?
Provide we with """"""""""""""""""""""""""""""""This is what matters"""""""""""""""""""""""""""""""".
The result should be:
Provide we with "This is what matters".
I already have tried this, but it didn't work well:
gsub("\"\"", "\"", txt)
Also, these texts are not with the same number of quotes, so there are ones with fewer quotes while other with even more quotes.
Replacing each pair of "" with " when you have multiple consecutive occurrences will result in several consecutive double quotation marks to still remain in the string. You want either to match 1 or more " chars and replace with a single ", or match and remove any " that is followed with ".
You may use
gsub('"+', '"', txt)
See the R demo
The "+ pattern matches one or more double quotation marks and replaces the chunks with a single quotation mark.
With stringr::str_remove_all, you can use a regex that will match any " that is followed with ":
library(stringr)
str_remove_all(txt, '"(?=")')
See the regex demo. The regex here contains a (?=") positive lookahead that requires the presence of " immediately to the right of the current location.
Same concept may be conveyed in base R with a PCRE regex (use perl=TRUE):
gsub('"(?=")', '"', txt, perl=TRUE)
An option with str_remove_all
library(stringr)
str_remove_all(txt, '"+')

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

How to delete only (anyword).com in Regex?

I'd like to match the following
My best email gmail.com
email com
email.com
to become
My best email
email com
*nothing*
Specifically, I'm using Regex for R, so I know there are different rules for escaping certain characters. I'm very new to Regex, but so far I have
\ .*(com)
which makes the same input
My
But this code does not work for instances where there are no spaces like the third example, and removes everything past the first space of a line if the line has a ".com"
Use the following solution:
x <- c("My best email gmail.com","email com", "email.com", "smail.com text here")
trimws(gsub("\\S+\\.com\\b", "", x))
## => [1] "My best email" "email com" "" "text here"
See the R demo.
The \\S+\\.com\\b pattern matches 1+ non-whitespace chars followed by a literal .com followed by the word boundary.
The trimws function will trim all the resulting strings (as, e.g. with "smail.com text here", when a space will remain after smail.com removal).
Note that TRE regex engine does not support shorthand character classes inside bracket expressions.

Resources