Remove several strings between two specific characters - r

I need help with regex in R.
I have a bunch of strings each of which has a structure similar to this one:
mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes. Thank you very much for completing this\": ME.' 'You!' sai"
Notice that this strings contains substrings within "" followed by a ":" and some text without quotation marks - until we encounter a "|" - then a new quotation mark appears etc.
Notice also that at the very end there is text after a ":" - but at the VERY end there is no "|"
My objective is to completely eliminate all text starting with any ":" (and INCLUDING ":") and until the next "|" (but "|" has to stay). I also need to eliminate all text that comes after the very last ":"
Finally (that's more of a bonus) - I want to get rid of all "\" characters and all quotation marks - because in the final solution I need to have "clean text": A bunch of strings separated only by "|" characters.
Is it possible?
Here is my awkward first attempt:
gsub('\\:.*?\\|', '', mytext)

This method uses 3 passes of g?sub.
sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
The first strips out the text in between ":" and "|" inclusive and replaces it with "|". The second pass removes "\" and """ and the third pass removes the "|" at the end.

With a single gsub you can match text after a : (including the :), so long as it doesn't contain a pipe: :[^|]*. This matches the case at the end of the string, too. You can also match double quotes by searching for another pattern after the alternation character (|): [\"]
gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"

Related

how to remove decimal point between numbers in R

I am trying to remove the decimal points in decimal numbers in R. Please note I want to keep the full stop of strings.
Example:
data= c("It's 6.00pm, and is late.")
I know that I have to use regex for this, but I am struggling. My desired output is:
"It's 6 00pm, and is late."
Thank you in advance.
Try this:
sub("(?<=\\d)\\.(?=\\d)", " ", data, perl = TRUE)
This solution uses lookbehind (?<=...) and lookahead (?=...)to assert that the period you wish to remove be enclosed by digits (thus avoiding matching the period at the sentence end). If you have several such cases within strings, then use gsubinstead of sub.
I suggest using a simple pattern to find the target text, then adding parenthesis to identify the parts of the matching text that you want to retain.
# Test data
data <- c("It's 6.00pm, and is late.")
The target pattern is a literal dot with a string of digits before and after it. \\d+ matches one or more digits and \\. matches a literal dot. Testing the pattern to see if it works:
grepl("\\d+\\.\\d+", data)
Result
TRUE
If we wanted too eliminate the whole thing we could do a simple replacement with an empty string. Testing if this targets the correct text:
sub("\\d+\\.\\d+", "", data)
Result
"It's pm, and is late."
Instead, to discard only a section of matched text we can identify the parts we want to keep, which is done by surrounding them with parenthesis. Once done we can refer to the captured text in the replacement. \\1 refers to the first chunk of text captured and \\2 refers to the second chunk of text, corresponding to the first and second sets of parenthesis
# pattern replacement
sub("(\\d+)\\.(\\d+)", "\\1\\2", data)
Result
[1] "It's 600pm, and is late."
This effectively removes the dot by omitting it from the replacement text.

Remove whitespace before bracket " (" in R

I have nearly 100,000 rows of scraped data that I have converted to data frames. One column is a string of text characters but is operating strangely. In the example below, there is text, that has bracketed information that I want to remove, and I also want to remove " (c)". However the space in front is not technically a space (is it considered whitespace?).
I am not sure how to reproduce the example here because when I copy/paste a record, it is treated like normal and works, but in the scraped data, it does not. Gut check was to count spaces and it gave me 4, which means the space in front of ( is not a true space. I do not know how to remove this!
My code that I usually would run is as follows. Again, works this way, but does not work in my scraped data.
test<-c("Barry Windham (c) & Mike Rotundo (c)")
test<-gsub("[ ][(]c[)]","",test)
You can consider using:
test<-c("Barry Windham (c) & Mike Rotundo (c)")
gsub("(*UCP)\\s+\\(c\\)", "", test, perl=TRUE)
# => [1] "Barry Windham & Mike Rotundo"
See an online R demo
Details
(*UCP) - makes all shorthand character classes in the PCRE regex (it is PCRE due to perl=TRUE) Unicode aware
\\s+ - any one or more Unicode whitespaces
\\(c\\) - (c) substring.
If you need to keep (c), capture it and use a backreference in the replacement:
gsub("(*UCP)\\s+(\\(c\\))", "\\1", test, perl=TRUE)

How to add the removed space in a sentence?

I have the following string:
x = "marchTextIWantToDisplayWithSpacesmarch"
I would like to delete the 'march' portion at the beginning of the string and then add a space before each uppercase letter in the remainder to yield the following result:
"Text I Want To Display With Spacesmarch"
To insert whitepace, I used gsub("([a-z]?)([A-Z])", "\\1 \\2", x, perl= T) but I have no clue how to modify the pattern so that the first 'march' is excluded from the returned string. I'm trying to get better at this so any help would be greatly appreciated.
An option would be to capture the upper case letter as a group ((...)) and in the replacement create a space followed by the backreference (\\1) of the captured group
gsub("([A-Z])", " \\1", x)
#[1] "march Text I Want To Display With Spacesmarch"
If we need to remove the 'march'
sub("\\b[a-z]\\w+\\s+", "", gsub("([A-Z])", " \\1", x))
[#1] "Text I Want To Display With Spacesmarch"
data
x <- "marchTextIWantToDisplayWithSpacesmarch"
No, you can't achieve your replacement using single gsub because in one of your requirement, you want to remove all lowercase letters starting from the beginning, and your second requirement is to introduce a space before every capital letter except the first capital letter of the resultant string after removing all lowercase letters from the beginning of text.
Doing it in single gsub call would have been possible in cases where somehow we can re-use some of the existing characters to make the conditional replace which can't be the case here. So in first step, you can use ^[a-z]+ regex to get rid of all lowercase letters only from the beginning of string,
sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch")
leaving you with this,
[1] "TextIWantToDisplayWithSpacesmarch"
And next step you can use this (?<!^)(?=[A-Z]) regex to insert a space before every capital letter except the first one as you might not want an extra space before your sentence. But you can combine both and write them as this,
gsub('(?<!^)(?=[A-Z])', ' ', sub('^[a-z]+', '', "marchTextIWantToDisplayWithSpacesmarch"), perl=TRUE)
which will give you your desired string,
[1] "Text I Want To Display With Spacesmarch"
Edit:
Explanation of (?<!^)(?=[A-Z]) pattern
First, let's just take (?=[A-Z]) pattern,
See the pink markers in this demo
As you can see, in the demo, every capital letter is preceded by a pink mark which is the place where a space will get inserted. But we don't want space to be inserted before the very first letter as that is not needed. Hence we need a condition in regex, which will not select the first capital letter which appears at the start of string. And for that, we need to use a negative look behind (?<!^) which means that Do not select the position which is preceded by start of string and hence this (?<!^) helps in discarding the upper case letter that is preceded by just start of string.
See this demo where the pink marker is gone from the very first uppercase letter
Hope this clarifies how every other capital letter is selected but not the very first. Let me know if you have any queries further.
You may use a single regex call to gsub coupled with trimws to trim the resulting string:
trimws(gsub("^\\p{Ll}+|(?<=.)(?=\\p{Lu})", " ", x, perl=TRUE))
## => [1] "Text I Want To Display With Spacesmarch"
It also supports all Unicode lowercase (\p{Ll}) and uppercase (\p{Lu}) letters.
See the R demo online and the regex demo.
Details
^\\p{Ll}+ - 1 or more lowercase letters at the string start
| - or
(?<=.)(?=\\p{Lu}) - any location between any char but linebreak chars and an uppercase letter.
Here is an altenative with a single call to gsubfn regex with some ifelse logic:
> gsubfn("^\\p{Ll}*(\\p{L})|(?<=.)(?=\\p{Lu})", function(n) ifelse(nchar(n)>0,n," "), x, perl=TRUE,backref=-1)
[1] "Text I Want To Display With Spacesmarch"
Here, the ^\\p{Ll}*(\\p{L}) part matches 0+ lowercase letters and captures the next uppercase into Group 1 that will be accessed by passing n argument to the anonymous function. If n length is non-zero, this alternative matched and the we need to replace with this value. Else, we replace with a space.
Since this is tagged perl, my 2 cents:
Can you chain together the substitutions inside sub() and gsub()? In newer perl versions an /r option can be added to the s/// substitution so the matched string can be returned "non-destructively" and then matched again. This allows hackish match/substitution/rematches without mastering advanced syntax, e.g.:
perl -E '
say "marchTextIWantToDisplayWithSpacesmarch" =~
s/\Amarch//r =~ s/([[:upper:]])/ $1/gr =~ s/\A\s//r;'
Output
Text I Want To Display With Spacesmarch
This seems to be what #pushpesh-kumar-rajwanshi and #akrun are doing by wrapping gsub inside sub() (and vice versa). In general I don't thinkperl = T captures the full magnificently advanced madness of perl regexps ;-) but gsub/sub must be fast operating on vectors, no?

How to replace words between two punctuations

I have a dataset that looks like the following
sentence <-
"active ingredients: avobenzone, octocrylene, octyl salicylate.
other stuff inactive ingredients: water, glycerin, edta."
And I am trying to get
"avobenzone, octocrylene, octyl salicylate, water, glycerin, edta."
The logic that I'm thinking in plain English is match on anything that is between a punctuation and a semi-colon to remove them. OR, match between beginning of string and semi-colon and remove them. I am using gsub in r and have gotten so far to here:
gsub("([:punct:][^:]*:)|^([^:]*:)", "", sentence)
but my result is this...
[1] " avobe water, glycerin, edta."
Why is this catching everything between the the first word all the way to the last semi-colon instead of the first? Can someone point me to the right direction to understand this logic?
Thank you!
At least one way is:
gsub(".*?:\\s*(.*?)\\.", "\\1, ", sentence)
[1] "avobenzone, octocrylene, octyl salicylate, water, glycerin, edta, "
Notice the ? after .* That makes the matching be not greedy. Without the ?, .* matches as much as possible.
Addition:
The idea of this is to replace everything except the part that you want with nothing. You said that you wanted to stop at punctuation marks, but you obviously did not want to stop at commas, so I took the liberty of interpreting the problem as finding the parts of the sting between colon and period. In my expression, .*?: matches everything up to the first colon. I put in \\s* to also cut out any blank spaces that might follow the colon. We want everything after that up to the next period. That is represented by .*?\\. BUT we want to keep that part so I put it in parentheses to make it a 'capture group'. Because it is in parens, whatever is between the colon and the period will be stored in the variable called \1 (but you have to type \\1 to get the string \1). I also added ", " (comma-blank) to the end of the capture group to help separate it from whatever comes next. SO This will take
active ingredients: avobenzone, octocrylene, octyl salicylate. and replace it with avobenzone, octocrylene, octyl salicylate, . Since I used gsub (global substitution), it will then start over and try to do the same thing to the rest of the string, replacing other stuff inactive ingredients: water, glycerin, edta. with water, glycerin, edta, . Sorry about the ugly trailing ", ".

remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)

Resources