Mid sentence carriage return with regex - r

I have text as follows.
mytext<-c("There is a\nlot of stuff","There is a\nlot of stuff\n","There is a\n lot of stuff","Stuff is everywhere\n\n\n\n around here. Clean it\n up")
I'd like to get rid of the \n in the middle of the sentence with the output being:
There is a lot of stuff
There is a lot of stuff\n
There is a lot of stuff
Stuff is everywhere around here. Clean it up
I have tried:
gsub("([a-z]\\s*)\n+(\\s*[a-z])", "\\1 \\2", mytext)
but it gives the output:
[1] "There is a lot of stuff" "There is a lot of stuff"
[3] "There is a lot of stuff" "Stuff is everywhere\n\n\n around here. Clean it up"
I don't seem to be able to get rid of the mid sentence \n when there are multiples of them. Using the greedy operator with \n gives me odd results.

You may use
gsub("(?:\\h*\\R)++(?!\\z)\\h*", " ", mytext, perl=TRUE)
See the regex demo and the R demo online.
Details
(?:\\h*\\R)++ - 1 or more occurrences (matched possessively thanks to ++ quantifier, so that no backtracking could occur into the non-capturing group pattern) of:
\\h* - 0 or more horizontal whitespaces.
\\R - any line break sequence
(?!\\z) - not at the very end of string.
\\h* - 0 or more horizontal whitespaces.
Since it is a PCRE pattern, perl=TRUE is required.

I think we can use negative lookahead regex.
gsub('\n(?!$)', ' ', mytext, perl = TRUE)
#[1]"There is a lot of stuff" "There is a lot of stuff\n"
#[3]"There is a lot of stuff" "Stuff is everywhere around here. Clean it up"
This will replace all the \n except for the ones which are at the end of the string.

Related

Put spaces around all punctuation but excluding apostrophes

I'm new to this, so I'm sorry if this is a stupid question... I need help with a bit of code in R...
I have a bit of code (below) which puts a space around all my punctuation in all txt files in a folder. It's lovely, but I don't want it to add space around apostrophes (') -
Can anybody help me exclude apostrophes in that bit gsub("(\.+|[[:punct:]])", " \1 " ---?
Or is that how you would do it? (with [^ ?)
I get this:
"I want : spaces around all these marks ; : ! ? . but i didn ’ t want it there in didn ’ t"
I want this:
"I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn’t"
for(file in filelist){
tx=readLines(file)
tx2=gsub("(\\.+|[[:punct:]])", " \\1 ", tx)
writeLines(tx2, con=file)
}
You can use
tx <- "I want: spaces around all these marks;:!?.but i didn’t want it there in didn't"
gsub("\\s*(\\.+|[[:punct:]])(?<!\\b['’]\\b)\\s*", " \\1 ", tx, perl=TRUE)
## => [1] "I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn't"
The perl=TRUE only means that the regex is handled with the PCRE library (note that PCRE regex engine is not the same as Perl regex engine).
See the R demo online and the regex demo.
Details:
\s* - zero or more whitespaces
(\.+|[[:punct:]]) - Group 1 (\1): one or more dots, or a punctuation char
(?<!\b['’]\b) - immediately on the left, there must be no ' or ’ enclosed with word chars
\s* - zero or more whitespaces
We may match the ' and SKIP it before matching all other punctuation works
gsub("’(*SKIP)(*FAIL)|([[:punct:].])", " \\1 ", tx, perl = TRUE)
-output
[1] "I want : spaces around all these marks ; : ! ? . but i didn’t want it there in didn’t"
data
tx <- "I want:spaces around all these marks;:!?. but i didn’t want it there in didn’t"

Replace single quotes without changing the apostrophe

I have a data frame with column Title, I want to replace the single quotes to double quotes without changing the apostrophe. For example, 'I don't go to work tomorrow' . It should be "I don't go to work tomorrow".
I tried like this:
gsub("(\\w'\\w+) |, ", "\\1", "I don't go to work tomorrow")
I have tried a couple of ways, but have not got the result as expected.
I try str_replace_all() in stringr, but it replaces all ' into ". Every recommendation would be appreciated.
I think your rule is perhaps as simple as: if an apostrophe has something (non-space) before and after it, then don't replace it; otherwise, replace it.
gsub("^'|(?<= )'|'(?= )|'$", '"', "'I don't go to work tomorrow'", perl = TRUE)
# [1] "\"I don't go to work tomorrow\""
(Updated so that it does not consume the preceding/following space, if present.)
Patterns
To match an apostrophe only at the start/end of the string:
^'|'$
See the regex demo
If the apostophe is searched only outside a word, you may use
\b'\b(*SKIP)(*FAIL)|'
See this regex demo. Here, the ' is matched only if it is not enclosed on both ends with letters, digits or underscores since all ' that are enclosed with word chars are skipped/failed.
If you need to match a ' only when it is not between two letters, use
'(?!(?<=[A-Za-z]')[A-Za-z]) # ASCII only
'(?!(?<=\p{L}')\p{L}) # Any Unicode letters
See this regex demo.
Usage
gsub("^'|'$", '"', "'I don't go to work tomorrow 2'5.'")
## => "I don't go to work tomorrow 2'5."
gsub("\\b'\\b(*SKIP)(*FAIL)|'", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2'5."
gsub("'(?!(?<=\\p{L}')\\p{L})", '"', "'I don't go to work tomorrow 2'5.'", perl=TRUE)
## => "I don't go to work tomorrow 2"5."
See the R demo online.

How to replace words between two punctuations

I have a dataset that looks like the following
sentence <-
"active ingredients: avobenzone, octocrylene, octyl salicylate.
other stuff inactive ingredients: water, glycerin, edta."
And I am trying to get
"avobenzone, octocrylene, octyl salicylate, water, glycerin, edta."
The logic that I'm thinking in plain English is match on anything that is between a punctuation and a semi-colon to remove them. OR, match between beginning of string and semi-colon and remove them. I am using gsub in r and have gotten so far to here:
gsub("([:punct:][^:]*:)|^([^:]*:)", "", sentence)
but my result is this...
[1] " avobe water, glycerin, edta."
Why is this catching everything between the the first word all the way to the last semi-colon instead of the first? Can someone point me to the right direction to understand this logic?
Thank you!
At least one way is:
gsub(".*?:\\s*(.*?)\\.", "\\1, ", sentence)
[1] "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta, "
Notice the ? after .* That makes the matching be not greedy. Without the ?, .* matches as much as possible.
Addition:
The idea of this is to replace everything except the part that you want with nothing. You said that you wanted to stop at punctuation marks, but you obviously did not want to stop at commas, so I took the liberty of interpreting the problem as finding the parts of the sting between colon and period. In my expression, .*?: matches everything up to the first colon. I put in \\s* to also cut out any blank spaces that might follow the colon. We want everything after that up to the next period. That is represented by .*?\\. BUT we want to keep that part so I put it in parentheses to make it a 'capture group'. Because it is in parens, whatever is between the colon and the period will be stored in the variable called \1 (but you have to type \\1 to get the string \1). I also added ", " (comma-blank) to the end of the capture group to help separate it from whatever comes next. SO This will take
active ingredients: avobenzone, octocrylene, octyl salicylate. and replace it with avobenzone, octocrylene, octyl salicylate, . Since I used gsub (global substitution), it will then start over and try to do the same thing to the rest of the string, replacing other stuff inactive ingredients: water, glycerin, edta. with water, glycerin, edta, . Sorry about the ugly trailing ", ".

Negation in R, how can I replace words following a negation in R?

I'm following up on a question that has been asked here about how to add the prefix "not_" to a word following a negation.
In the comments, MrFlick proposed a solution using a regular expression gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T).
I would like to edit this regular expression in order to add the not_ prefix to all the words following "not" or "n't" until there is some punctuation.
If I'm editing cptn's example, I'd like:
x <- "They didn't sell the company, and it went bankrupt"
To be transformed into:
"They didn't not_sell not_the not_company, and it went bankrupt"
Can the use of backreference still do the trick here? If so, any example would be much appreciated. Thanks!
You may use
(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b
and replace with not_\1. See the regex demo.
Details
(?:\bnot|n't|\G(?!\A)) - either of the three alternatives:
\bnot - whole word not
n't - n't
\G(?!\A) - the end of the previous successful match position
\s+ - 1+ whitespaces
\K - match reset operator that discards the text matched so far
(\w+) - Group 1 (referenced to with \1 from the replacement pattern): 1+ word chars (digits, letters or _)
\b - a word boundary.
R demo:
x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"
First you should split the string on the punctuation you want. For example:
x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]")
[[1]]
[1] "They didn't sell the company" " and it went bankrupt" " Then something else"
and then apply the regex to every element of the list x_split. Finally merge all the pieces (if needed).
This is not ideal, but gets the job done:
x <- "They didn't sell the company, and it did not go bankrupt. That's it"
gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s",
" not_", x,
perl = TRUE)
# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"
Notes:
This uses the (*SKIP)(*FAIL) trick to avoid any pattern you don't want to regex to match. This basically replaces every space with not_ except for those spaces where they fall between:
Start of string or punctuation and "not" or "n't" or
Punctuation and Punctuation (not followed by space) or end of string

how to remove sentences with conjuctions in R

I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?
You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\\bbut\\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[\r\n]* - 0 or more line break chars.
Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.
Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"

Resources