how to remove sentences with conjuctions in R - r

I have text, an example of which is as follows
Input
c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
The expected output is
,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this
I tried:
x[, y] <- gsub(".*[Bb]ut .*?(\\.|\n|:)", "", x[, y])
but it eradicated the whole sentence. How do I remove the phrase with 'but' in it and keep the rest of the phrases in each sentence?

You may use
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)
See the R demo online
The PCRE pattern matches:
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
\\bbut\\b - a whole word but (\b are word boundaries)
.* - any 0+ chars other than line break chars, 0 or more, as many as possible
[\r\n]* - 0 or more line break chars.
Note that the first gsub has a perl=TRUE argument that makes R use the PCRE regex engine to parse the pattern, and . does not match a line break char there. The second gsub uses a TRE (default) regex engine, and one needs to use (?n) inline modifier to make . fail to match line break chars there.

Note that you mixed up "\n" and "/n", which I did correct.
My idea for a solution:
1) Simply catch all chars which are no linebreak ([^\n]) before and after the "but".
2) (Edit) To address the issue Wiktors found, we also have to check that no char ([^a-zA-Z]) is directly before or after the "but".
x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this"

Related

Passing function to R regex-based tools

This string manipulation problem has evaded my best efforts. I have a string, e.g.
eg_str="[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
for which I would like to replace all spaces in the wildcard for ([wildcard].md) with underscores. My first thought was to use either gsub or stringr's str_replace_all to pass the appropriate substrings to a simple function. Something like
convert_space_to_underscore<-function(string){
return(str_replace(string," ","_"))
}
normal_eg_str<-gsub("\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),"md)"),normal_eg_str)
or
normal_eg_str<-str_replace_all(document,"\\((.+?)md\\)",paste0("(",convert_space_to_underscore("\\1"),".md)"))
When I run these however, it appears that the argument to convert_space_to_underscore is being passed, rather than the output, because the string returns unchanged (if you make an error in the paste0 component, say have paste0("(",convert_space_to_underscore("\\1"),".m)"), then the string returns as
eg_str="[probability space](posts/probability space.m) is ... [Sigma Field](posts/Sigma Field.m)"
so I'm quite sure that what is happening is that str_replace_all and gsub are simply not evaluating the function).
Is there a way to force evaluation? This would be most ideal, as it would allow for the regex component to remain somewhat readable. However, I would welcome any pure-regex solutions as well — my attempts have all lead to greedy errors, no matter where I seem to sprinkle ? and {0} special characters. (Word of caution: there will be some matching substrings with more than one space e.g. [Dynklin's Pi Lambda](posts/dynklins pi lambda.md))
You can use
library(stringr)
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
str_replace_all(eg_str, "\\([^()]+\\.md\\)", function(x) gsub(" ", "_", x, fixed=TRUE) )
## => [1] "[probability space](posts/probability_space.md) is ... [Sigma Field](posts/Sigma_Field.md)"
See online R demo.
NOTE: To replace one or more whitespace chunks with a single underscore, you will need a regex in gsub: gsub("\\s+", "_", x).
The first regex finds all strings that
\( - start with (
[^()]+ - have one or more chars other than ( and )
\.md - a .md string
\) - and end with )
Then, the match is passed to an anonymous function that replaced each regular space with a _ (with gsub(" ", "_", x, fixed=TRUE)).
A base R solution (less readable, but using a plain regex):
eg_str <- "[probability space](posts/probability space.md) is ... [Sigma Field](posts/Sigma Field.md)"
gsub("(?:\\G(?!^)|\\()[^()\\s]*\\K\\s+(?=[^()]*\\.md\\))", "_", eg_str, perl=TRUE)
See this R demo online. See this regex demo. Details:
(?:\G(?!^)|\() - end of the preceding match or a ( char
[^()\s]* - any 0 or more chars other than (, ) and whitespace
\K - match reset operator that discards all text matched so far from the overall match memory buffer
\s+ - one or more whitespaces
(?=[^()]*\.md\)) - there should be zero or more chars other than ( and ) followed with .md) immediately to the right of the current location.

How to strip off a file ending but only when it starts with a non-digit? (a regex problem related to file_path_sans_ext)

I would like to write a function rm_ext similar to tools::file_path_sans_ext but does not strip off file endings if they start with a digit. By replacing [:alnum:] by [:alpha:] in tools::file_path_sans_ext I almost got there, but if the base name of the file ends in a dot itself, it fails:
rm_ext <- function(x) sub("([^.]+)\\.[[:alpha:]]+$", "\\1", x) # adapted from tools::file_path_sans_ext()
rm_ext("test.string.with.dots.but.ending.alpha=0.25.rda") # works
rm_ext("test.string.with.dots.but.without.ending.alpha=0.25") # works
rm_ext("test.string.with.dots.but.ending.alpha=0.25.") # fails (should remove the final . too)
I tried to match [:alpha:] or EOL, but that didn't make the last case work.
Note: As a comparison, tools::file_path_sans_ext (of course) fails, see tools::file_path_sans_ext("test.string.with.dots.but.without.ending=0.25"). Also note that this is somewhat related but different.
You may use
\.(?:[^0-9.][^.]*)?$
See the regex demo and the regex graph:
Details
\. - a dot
(?:[^0-9.][^.]*)? - an optional sequence of a char other than a dot and a digit and then any 0+ chars other than a dot
$ - end of string.
In the code:
sub("\\.(?:[^0-9.][^.]*)?$", "", x)

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

Negation in R, how can I replace words following a negation in R?

I'm following up on a question that has been asked here about how to add the prefix "not_" to a word following a negation.
In the comments, MrFlick proposed a solution using a regular expression gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T).
I would like to edit this regular expression in order to add the not_ prefix to all the words following "not" or "n't" until there is some punctuation.
If I'm editing cptn's example, I'd like:
x <- "They didn't sell the company, and it went bankrupt"
To be transformed into:
"They didn't not_sell not_the not_company, and it went bankrupt"
Can the use of backreference still do the trick here? If so, any example would be much appreciated. Thanks!
You may use
(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b
and replace with not_\1. See the regex demo.
Details
(?:\bnot|n't|\G(?!\A)) - either of the three alternatives:
\bnot - whole word not
n't - n't
\G(?!\A) - the end of the previous successful match position
\s+ - 1+ whitespaces
\K - match reset operator that discards the text matched so far
(\w+) - Group 1 (referenced to with \1 from the replacement pattern): 1+ word chars (digits, letters or _)
\b - a word boundary.
R demo:
x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"
First you should split the string on the punctuation you want. For example:
x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]")
[[1]]
[1] "They didn't sell the company" " and it went bankrupt" " Then something else"
and then apply the regex to every element of the list x_split. Finally merge all the pieces (if needed).
This is not ideal, but gets the job done:
x <- "They didn't sell the company, and it did not go bankrupt. That's it"
gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s",
" not_", x,
perl = TRUE)
# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"
Notes:
This uses the (*SKIP)(*FAIL) trick to avoid any pattern you don't want to regex to match. This basically replaces every space with not_ except for those spaces where they fall between:
Start of string or punctuation and "not" or "n't" or
Punctuation and Punctuation (not followed by space) or end of string

R utf-8 and replace a word from a sentence based on ending character

I have a requirement where I am working on a large data which is having double byte characters, in korean text. i want to look for a character and replace it. In order to display the korean text correctly in the browser I have changed the locale settings in R. But not sure if it gets updated for the code as well. below is my code to change locale to korean and the korean text gets visible properly in viewer, however in console it gives junk character on printing-
Sys.setlocale(category = "LC_ALL", locale = "korean")
My data is in a data.table format that contains a column with text in korean. example -
"광주광역시 동구 제봉로 49 (남동,(지하))"
I want to get rid of the 1st word which ends with "시" character. Then I want to get rid of the "(남동,(지하))" an the end. I was trying gsub, but it does not seem to be working.
New <- c("광주광역시 동구 제봉로 49 (남동,(지하))")
data <- as.data.table(New)
data[,New_trunc := gsub("\\b시", "", data$New)]
Please let me know where I am going wrong. Since I want to search the end of word, I am using \\b and since I want to replace any word ending with "시" character I am giving it as \\b시.....is this not the way to give? How to take care of () at the end of the sentence.
What would be a good source to refer to for regular expressions.
Is a utf-8 setting needed for the script as well?How to do that?
Since you need to match the letter you have at the end of the word, you need to place \b (word boundary) after the letter, so as to require a transition from a letter to a non-letter (or end of string) after that letter. A PCRE pattern that will handle this is
"\\s*\\b\\p{L}*시\\b"
Details
\\s* - zero or more whitespaces
\\b - a leading word boundary
\\p{L}* - zero or more letters
시 - your specific letter
\\b - end of the word
The second issue is that you need to remove a set of nested parentheses at the end of the string. You need again to rely on the PCRE regex (perl=TRUE) that can handle recursion with the help of a subroutine call.
> sub("\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] "광주광역시 동구 제봉로 49"
Details:
\\s* - zero or more whitespaces
(\\((?:[^()]++|(?1))*\\)) - Group 1 (will be recursed) matching
\\( - a literal (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1 or more chars other than ( and ) (possessively)
| - or
(?1) - a subroutine call that repeats the whole Group 1 subpattern
\\) - a literal )
$ - end of string.
Now, if you need to combine both, you would see that R PCRE-powered gsub does not handle Unicode chars in the pattern so easily. You must tell it to use Unicode mode with (*UCP) PCRE verb.
> gsub("(*UCP)\\b\\p{L}*시\\b|\\s*(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE)
[1] " 동구 제봉로 49"
Or using trimws to get rid of the leading/trailing whitespace:
> trimws(gsub("(*UCP)\\b\\p{L}*시\\b|(\\((?:[^()]++|(?1))*\\))$", "", New, perl=TRUE))
[1] "동구 제봉로 49"
See more details about the verb at PCRE Man page.

Resources