Replace single quotes without changing the apostrophe - r

I have a data frame with column Title, I want to replace the single quotes to double quotes without changing the apostrophe. For example, 'I don't go to work tomorrow' . It should be "I don't go to work tomorrow".
I tried like this:
gsub("(\\w'\\w+) |, ", "\\1", "I don't go to work tomorrow")
I have tried a couple of ways, but have not got the result as expected.
I try str_replace_all() in stringr, but it replaces all ' into ". Every recommendation would be appreciated.

I think your rule is perhaps as simple as: if an apostrophe has something (non-space) before and after it, then don't replace it; otherwise, replace it.
gsub("^'|(?<= )'|'(?= )|'$", '"', "'I don't go to work tomorrow'", perl = TRUE)
# [1] "\"I don't go to work tomorrow\""
(Updated so that it does not consume the preceding/following space, if present.)

Patterns
To match an apostrophe only at the start/end of the string:
^'|'$
See the regex demo
If the apostophe is searched only outside a word, you may use
\b'\b(*SKIP)(*FAIL)|'
See this regex demo. Here, the ' is matched only if it is not enclosed on both ends with letters, digits or underscores since all ' that are enclosed with word chars are skipped/failed.
If you need to match a ' only when it is not between two letters, use
'(?!(?<=[A-Za-z]')[A-Za-z]) # ASCII only
'(?!(?<=\p{L}')\p{L}) # Any Unicode letters
See this regex demo.
Usage
gsub("^'|'$", '"', "'I don't go to work tomorrow 2'5.'")
## => "I don't go to work tomorrow 2'5."
gsub("\\b'\\b(*SKIP)(*FAIL)|'", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2'5."
gsub("'(?!(?<=\\p{L}')\\p{L})", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2"5."
See the R demo online.

Related

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

Negation in R, how can I replace words following a negation in R?

I'm following up on a question that has been asked here about how to add the prefix "not_" to a word following a negation.
In the comments, MrFlick proposed a solution using a regular expression gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T).
I would like to edit this regular expression in order to add the not_ prefix to all the words following "not" or "n't" until there is some punctuation.
If I'm editing cptn's example, I'd like:
x <- "They didn't sell the company, and it went bankrupt"
To be transformed into:
"They didn't not_sell not_the not_company, and it went bankrupt"
Can the use of backreference still do the trick here? If so, any example would be much appreciated. Thanks!
You may use
(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b
and replace with not_\1. See the regex demo.
Details
(?:\bnot|n't|\G(?!\A)) - either of the three alternatives:
\bnot - whole word not
n't - n't
\G(?!\A) - the end of the previous successful match position
\s+ - 1+ whitespaces
\K - match reset operator that discards the text matched so far
(\w+) - Group 1 (referenced to with \1 from the replacement pattern): 1+ word chars (digits, letters or _)
\b - a word boundary.
R demo:
x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"
First you should split the string on the punctuation you want. For example:
x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]")
[[1]]
[1] "They didn't sell the company" " and it went bankrupt" " Then something else"
and then apply the regex to every element of the list x_split. Finally merge all the pieces (if needed).
This is not ideal, but gets the job done:
x <- "They didn't sell the company, and it did not go bankrupt. That's it"
gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s",
" not_", x,
perl = TRUE)
# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"
Notes:
This uses the (*SKIP)(*FAIL) trick to avoid any pattern you don't want to regex to match. This basically replaces every space with not_ except for those spaces where they fall between:
Start of string or punctuation and "not" or "n't" or
Punctuation and Punctuation (not followed by space) or end of string

how to remove sentences with conjuctions in R

I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?
You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\\bbut\\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[\r\n]* - 0 or more line break chars.
Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.
Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"

Remove several strings between two specific characters

I need help with regex in R.
I have a bunch of strings each of which has a structure similar to this one:
mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes. Thank you very much for completing this\": ME.' 'You!' sai"
Notice that this strings contains substrings within "" followed by a ":" and some text without quotation marks - until we encounter a "|" - then a new quotation mark appears etc.
Notice also that at the very end there is text after a ":" - but at the VERY end there is no "|"
My objective is to completely eliminate all text starting with any ":" (and INCLUDING ":") and until the next "|" (but "|" has to stay). I also need to eliminate all text that comes after the very last ":"
Finally (that's more of a bonus) - I want to get rid of all "\" characters and all quotation marks - because in the final solution I need to have "clean text": A bunch of strings separated only by "|" characters.
Is it possible?
Here is my awkward first attempt:
gsub('\\:.*?\\|', '', mytext)
This method uses 3 passes of g?sub.
sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
The first strips out the text in between ":" and "|" inclusive and replaces it with "|". The second pass removes "\" and """ and the third pass removes the "|" at the end.
With a single gsub you can match text after a : (including the :), so long as it doesn't contain a pipe: :[^|]*. This matches the case at the end of the string, too. You can also match double quotes by searching for another pattern after the alternation character (|): [\"]
gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"

R regex remove apostroph except the ones preceded and followed by letter

I'm cleaning a text and I'd like to remove any apostrophe except for the ones preceded and followed by letters such as in : i'm, i'll, he's..etc.
I the following preliminary solution, handling many cases, but I want a better one:
rmAps <- function(x) gsub("^\'+| \'+|\'+ |[^[:alpha:]]\'+(a-z)*|\\b\'*$", " ", x)
rmAps("'i'm '' ' 'we end' '")
[1] " i'm we end "
I also tried:
(?<![a-z])'(?![a-z])
But I think I am still missing sth.
gsub("'(?!\\w)|(?<!\\w)'", "", x, perl = TRUE)
#[1] "i'm we end "
Remove occasions when your character is not followed by a word character: '(?!\\w).
Remove occasions when your character is not preceded by a word character: (?<!\\w)'.
If either of those situations occur, you want to remove it, so '(?!\\w)|(?<!\\w)' should do the trick. Just note that \\w includes the underscore, and adjust as necessary.
Another option is
gsub("\\w'\\w(*SKIP)(*FAIL)|'", "", x, perl = TRUE)
In this case, you match any instances when ' is surrounded by word characters: \\w'\\w, and then force that match to fail with (*SKIP)(*FAIL). But, also look for ' using |'. The result is that only occurrences of ' not wrapped in word characters will be matched and substituted out.
You can use the following regular expression:
(?<=\w)'(?=\w)
(?<=) is a positive lookbehind. Everything inside needs to match before the next selector
(?=) is a positive lookahead. Everything inside needs to match after the previous selector
\w any alphanumeric character and the underscore
You could also switch \w to e.g. [a-zA-Z] if you want to restrict the results.
→ Here is your example on regex101 for live testing.

Resources