R regex remove apostroph except the ones preceded and followed by letter - r

I'm cleaning a text and I'd like to remove any apostrophe except for the ones preceded and followed by letters such as in : i'm, i'll, he's..etc.
I the following preliminary solution, handling many cases, but I want a better one:
rmAps <- function(x) gsub("^\'+| \'+|\'+ |[^[:alpha:]]\'+(a-z)*|\\b\'*$", " ", x)
rmAps("'i'm '' ' 'we end' '")
[1] " i'm we end "
I also tried:
(?<![a-z])'(?![a-z])
But I think I am still missing sth.

gsub("'(?!\\w)|(?<!\\w)'", "", x, perl = TRUE)
#[1] "i'm we end "
Remove occasions when your character is not followed by a word character: '(?!\\w).
Remove occasions when your character is not preceded by a word character: (?<!\\w)'.
If either of those situations occur, you want to remove it, so '(?!\\w)|(?<!\\w)' should do the trick. Just note that \\w includes the underscore, and adjust as necessary.
Another option is
gsub("\\w'\\w(*SKIP)(*FAIL)|'", "", x, perl = TRUE)
In this case, you match any instances when ' is surrounded by word characters: \\w'\\w, and then force that match to fail with (*SKIP)(*FAIL). But, also look for ' using |'. The result is that only occurrences of ' not wrapped in word characters will be matched and substituted out.

You can use the following regular expression:
(?<=\w)'(?=\w)
(?<=) is a positive lookbehind. Everything inside needs to match before the next selector
(?=) is a positive lookahead. Everything inside needs to match after the previous selector
\w any alphanumeric character and the underscore
You could also switch \w to e.g. [a-zA-Z] if you want to restrict the results.
→ Here is your example on regex101 for live testing.

Related

How do I add a space between two characters using regex in R?

I want to add a space between two punctuation characters (+ and -).
I have this code:
s <- "-+"
str_replace(s, "([:punct:])([:punct:])", "\\1\\s\\2")
It does not work.
May I have some help?
There are several issues here:
[:punct:] pattern in an ICU regex flavor does not match math symbols (\p{S}), it only matches punctuation proper (\p{P}), if you still want to match all of them, combine the two classes, [\p{P}\p{S}]
"\\1\\s\\2" replacement contains a \s regex escape sequence, and these are not supported in the replacement patterns, you need to use a literal space
str_replace only replaces one, first occurrence, use str_replace_all to handle all matches
Even if you use all the above suggestions, it still won't work for strings like -+?/. You need to make the second part of the regex a zero-width assertion, a positive lookahead, in order not to consume the second punctuation.
So, you can use
library(stringr)
s <- "-+?="
str_replace_all(s, "([\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", "\\1 ")
str_replace_all(s, "(?<=[\\p{P}\\p{S}])(?=[\\p{P}\\p{S}])", " ")
gsub("(?<=[[:punct:]])(?=[[:punct:]])", " ", s, perl=TRUE)
See the R demo online, all three lines yield [1] "- + ? =" output.
Note that in PCRE regex flavor (used with gsub and per=TRUE) the POSIX character class must be put inside a bracket expression, hence the use of double brackets in [[:punct:]].
Also, (?<=[[:punct:]]) is a positive lookbehind that checks for the presence of its pattern immediately on the left, and since it is non-consuming there is no need of any backreference in the replacement.

Remove all dots but first in a string using R

I have some errors in some numbers showing numbers like "59.34343.23". I know the first dot is correct but the second one (or any after the first) should be remove. How can I remove those?
I tried using gsub in R:
gsub("(?<=\\..*)\\.", "", "59.34343.23", perl=T)
or
gsub("(?<!^[^.]*)\\.", "", "59.34343.23", perl=T)
However it gets the following error "invalid regular expression". But I have been trying the same code in a regex tester and it works.
What is my mistake here?
You can use
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23")
gsub("^([^.]*\\.)|\\.", "\\1", "59.34343.23", perl=TRUE)
See the R demo online and the regex demo.
Details:
^([^.]*\.) - Capturing group 1 (referred to as \1 from the replacement pattern): any zero or more chars from the start of string and then a . char (the first in the string)
| - or
\. - any other dot in the string.
Since the replacement, \1, refers to Group 1, and Group 1 only contains a value after the text before and including the first dot is matched, the replacement is either this part of text, or empty string (i.e. the second and all subsequent occurrences of dots are removed).
We may use
gsub("^[^.]+\\.(*SKIP)(*FAIL)|\\.", "", str1, perl = TRUE)
[1] "59.3434323"
data
str1 <- "59.34343.23"
By specifying perl = TRUE you can convert matches of the following regular expression to empty strings:
^[^.]*\.[^.]*\K.|\.
Start your engine!
If you are unfamiliar with \K hover over it in the regular expression at the link to see an explanation of its effect.
There is always the option to only write back the dot if its the first in the line.
Key feature is to consume the other dots but don't write it back.
Effect is to delete trailing dots.
Below uses a branch reset to accomplish the goal (Perl mode).
(?m)(?|(^[^.\n]*\.)|()\.+)
Replace $1
https://regex101.com/r/cHcu4j/1
(?m)
(?|
( ^ [^.\n]* \. ) # (1)
| ( ) # (1)
\.+
)
The pattern that you tried does not match, because there is an infinite quantifier in the lookbehind (?<=\\..*) that is not supported.
Another variation using \G to get continuous matches after the first dot:
(?:^[^.]*\.|\G(?!^))[^.]*\K\.
In parts, the pattern matches:
(?: Non capture group for the alternation |
^[^.]*\. Start of string, match any char except ., then match .
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
)[^.]* Optionally match any char except .
\K\. Clear the match buffer an match the dot (to be removed)
Regex demo | R demo
gsub("(?:^[^.]*\\.|\\G(?!^))[^.]*\\K\\.", "", "59.34343.23", perl=T)
Output
[1] "59.3434323"

Cleaning a text with several quotation marks leaving just a pair of them

How one could replace a text that contains several of the following pattern with just the necessary quotation marks?
Provide we with """"""""""""""""""""""""""""""""This is what matters"""""""""""""""""""""""""""""""".
The result should be:
Provide we with "This is what matters".
I already have tried this, but it didn't work well:
gsub("\"\"", "\"", txt)
Also, these texts are not with the same number of quotes, so there are ones with fewer quotes while other with even more quotes.
Replacing each pair of "" with " when you have multiple consecutive occurrences will result in several consecutive double quotation marks to still remain in the string. You want either to match 1 or more " chars and replace with a single ", or match and remove any " that is followed with ".
You may use
gsub('"+', '"', txt)
See the R demo
The "+ pattern matches one or more double quotation marks and replaces the chunks with a single quotation mark.
With stringr::str_remove_all, you can use a regex that will match any " that is followed with ":
library(stringr)
str_remove_all(txt, '"(?=")')
See the regex demo. The regex here contains a (?=") positive lookahead that requires the presence of " immediately to the right of the current location.
Same concept may be conveyed in base R with a PCRE regex (use perl=TRUE):
gsub('"(?=")', '"', txt, perl=TRUE)
An option with str_remove_all
library(stringr)
str_remove_all(txt, '"+')

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

Add a white-space between number and special character condition R

I'm trying to use stringr or R base calls to conditionally add a white-space for instances in a large vector where there is a numeric value then a special character - in this case a $ sign without a space. str_pad doesn't appear to allow for a reference vectors.
For example, for:
$6.88$7.34
I'd like to add a whitespace after the last number and before the next dollar sign:
$6.88 $7.34
Thanks!
If there is only one instance, then use sub to capture digit and the $ separately and in the replacement add the space between the backreferences of the captured group
sub("([0-9])([$])", "\\1 \\2", v1)
#[1] "$6.88 $7.34"
Or with a regex lookaround
gsub("(?<=[0-9])(?=[$])", " ", v1, perl = TRUE)
data
v1 <- "$6.88$7.34"
This will work if you are working with a vectored string:
mystring<-as.vector('$6.88$7.34 $8.34$4.31')
gsub("(?<=\\d)\\$", " $", mystring, perl=T)
[1] "$6.88 $7.34 $8.34 $4.31"
This includes cases where there is already space as well.
Regarding the question asked in the comments:
mystring2<-as.vector('Regular_Distribution_Type† Income Only" "Distribution_Rate 5.34%" "Distribution_Amount $0.0295" "Distribution_Frequency Monthly')
gsub("(?<=[[:alpha:]])\\s(?=[[:alpha:]]+)", "_", mystring2, perl=T)
[1] "Regular_Distribution_Type<U+2020> Income_Only\" \"Distribution_Rate 5.34%\" \"Distribution_Amount $0.0295\" \"Distribution_Frequency_Monthly"
Note that the \ appears due to nested quotes in the vector, should not make a difference. Also <U+2020> appears due to encoding the special character.
Explanation of regex:
(?<=[[:alpha:]]) This first part is a positive look-behind created by ?<=, this basically looks behind anything we are trying to match to make sure what we define in the look behind is there. In this case we are looking for [[:alpha:]] which matches a alphabetic character.
We then check for a blank space with \s, in R we have to use a double escape so \\s, this is what we are trying to match.
Finally we use (?=[[:alpha:]]+), which is a positive look-ahead defined by ?= that checks to make sure our match is followed by another letter as explained above.
The logic is to find a blank space between letters, and match the space, which then is replaced by gsub, with a _
See all the regex here

Resources