search and replace a character - r

I have a string like this
x<-c("This is a test (120)")
I need to replace the empty space between test and ( so that text will be like this
x<-c("This is a test(120)")
I tried this
s<-gsub("\t\v\(", "", x)
not working, any input would be appreciated.

Using a lookahead:
gsub("\\s+(?=\\()", "", x, perl=TRUE)
[1] "This is a test(120)"
Answer depends on more specifications you require though. Do you want to remove all spaces in front of opening brackets? Or just one? Or only in front of brackets containing numbers?

One simple approach is to used fixed = TRUE as in:
gsub(" (", "(", x, fixed = TRUE)
or:
gsub(" \\(", "\\(", x)

You have to "double escape" things in R. One for R and one for the regex:
s <- gsub('\\s\\(', '(', x)
That said, depending on your specific use case, you might want this to be more robust:
s <- gsub('(.+) \\((.+)\\)', '\\1(\\2)', x)

Related

Within a column, I'd like to gsub each row of string values and remove any value that matches a list of values I created

Context
I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
Issue
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
"\\b(a|a's|able|about|above|according|accordingly|across|actually|after|afterwards|again|against|ain't|all|allow|allows|almost|alone|along|already|also|although|always|am|among|amongst|an|and|another|any|anybody|anyhow|anyone|anything|anyway|anyways|anywhere|apart|appear|appreciate|appropriate|are|aren't|around|as|aside|ask|asking|associated|at|available|away|awfully|b|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|believe|below|beside|besides|best|better|between|beyond|both|brief|but|by|c|c'mon|c's|came|can|can't|cannot|cant|cause|causes|certain|certainly|changes|clearly|co|com|come|comes|concerning|consequently|consider|considering|contain|containing|contains|corresponding|could|couldn't|course|currently|d|definitely|described|despite|did|didn't|different|do|does|doesn't|doing|don't|done|down|downwards|during|e|each|edu|eg|eight|either|else|elsewhere|enough|entirely|especially|et|etc|even|ever|every|everybody|everyone|everything|everywhere|ex|exactly|example|except|f|far|few|fifth|first|five|followed|following|follows|for|former|formerly|forth|four|from|further|furthermore|g|get|gets|getting|given|gives|go|goes|going|gone|got|gotten|greetings|h|had|hadn't|happens|hardly|has|hasn't|have|haven't|having|he|he's|hello|help|hence|her|here|here's|hereafter|hereby|herein|hereupon|hers|herself|hi|him|himself|his|hither|hopefully|how|howbeit|however|i|i'd|i'll|i'm|i've|ie|if|ignored|immediate|in|inasmuch|inc|indeed|indicate|indicated|indicates|inner|insofar|instead|into|inward|is|isn't|it|it'd|it'll|it's|its|itself|j|just|k|keep|keeps|kept|know|knows|known|l|last|lately|later|latter|latterly|least|less|lest|let|let's|like|liked|likely|little|look|looking|looks|ltd|m|mainly|many|may|maybe|me|mean|meanwhile|merely|might|more|moreover|most|mostly|much|must|my|myself|n|name|namely|nd|near|nearly|necessary|need|needs|neither|never|nevertheless|new|next|nine|no|nobody|non|none|noone|nor|normally|not|nothing|novel|now|nowhere|o|obviously|of|off|often|oh|ok|okay|old|on|once|one|ones|only|onto|or|other|others|otherwise|ought|our|ours|ourselves|out|outside|over|overall|own|p|particular|particularly|per|perhaps|placed|please|plus|possible|presumably|probably|provides|q|que|quite|qv|r|rather|rd|re|really|reasonably|regarding|regardless|regards|relatively|respectively|right|s|said|same|saw|say|saying|says|second|secondly|see|seeing|seem|seemed|seeming|seems|seen|self|selves|sensible|sent|serious|seriously|seven|several|shall|she|should|shouldn't|since|six|so|some|somebody|somehow|someone|something|sometime|sometimes|somewhat|somewhere|soon|sorry|specified|specify|specifying|still|sub|such|sup|sure|t|t's|take|taken|tell|tends|th|than|thank|thanks|thanx|that|that's|thats|the|their|theirs|them|themselves|then|thence|there|there's|thereafter|thereby|therefore|therein|theres|thereupon|these|they|they'd|they'll|they're|they've|think|third|this|thorough|thoroughly|those|though|three|through|throughout|thru|thus|to|together|too|took|toward|towards|tried|tries|truly|try|trying|twice|two|u|un|under|unfortunately|unless|unlikely|until|unto|up|upon|us|use|used|useful|uses|using|usually|uucp|v|value|various|very|via|viz|vs|w|want|wants|was|wasn't|way|we|we'd|we'll|we're|we've|welcome|well|went|were|weren't|what|what's|whatever|when|whence|whenever|where|where's|whereafter|whereas|whereby|wherein|whereupon|wherever|whether|which|while|whither|who|who's|whoever|whole|whom|whose|why|will|willing|wish|with|within|without|won't|wonder|would|would|wouldn't|x|y|yes|yet|you|you'd|you'll|you're|you've|your|yours|yourself|yourselves|z|zero|i|me|my|myself|we|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|would|should|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|around|as|ask|asked|asking|asks|at|away|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|best|better|between|big|both|but|by|came|can|cannot|case|cases|certain|certainly|clear|clearly|come|could|did|differ|different|differently|do|does|done|down|down|downed|downing|downs|during|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|had|has|have|having|he|her|here|herself|high|high|high|higher|highest|him|himself|his|how|however|i|if|important|in|interest|interested|interesting|interests|into|is|it|its|itself|just|keep|keeps|kind|knew|know|known|knows|large|largely|last|later|latest|least|less|let|lets|like|likely|long|longer|longest|made|make|making|man|many|may|me|member|members|men|might|more|most|mostly|mr|mrs|much|must|my|myself|necessary|need|needed|needing|needs|never|new|new|newer|newest|next|no|nobody|non|noone|not|nothing|now|nowhere|number|numbers|of|off|often|old|older|oldest|on|once|one|only|open|opened|opening|opens|or|order|ordered|ordering|orders|other|others|our|out|over|part|parted|parting|parts|per|perhaps|place|places|point|pointed|pointing|points|possible|present|presented|presenting|presents|problem|problems|put|puts|quite|rather|really|right|right|room|rooms|said|same|saw|say|says|second|seconds|see|seem|seemed|seeming|seems|sees|several|shall|she|should|show|showed|showing|shows|side|sides|since|small|smaller|smallest|some|somebody|someone|something|somewhere|state|states|still|still|such|sure|take|taken|than|that|the|their|them|then|there|therefore|these|they|thing|things|think|thinks|this|those|though|thought|thoughts|three|through|thus|to|today|together|too|took|toward|turn|turned|turning|turns|two|under|until|up|upon|us|use|used|uses|very|want|wanted|wanting|wants|was|way|ways|we|well|wells|went|were|what|when|where|whether|which|while|who|whole|whose|why|will|with|within|without|work|worked|working|works|would|year|years|yet|you|young|younger|youngest|your|yours)\\b"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
with
stringr::str_replace_all(x, sw, " ")

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

How to do a replace with backreferences, when the number of occurences is unknown?

In order to make a few corrections to a .tex file generated by Bookdown, I need to replace occurrences of }{ with , when it is used in a citation, i.e.
s <- "Text.\\autocites{REF1}{REF2}{REF3}. More text \\autocites{REF4}{REF5} and \\begin{tabular}{ll}"
Should become
"Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}
Because I need to keep the references I tried to look into backreferences, but I cannot seem to get it right, because the number of groups to match is unknown beforehand. Also, I cannot do stringr::str_replace_all(s, "\\}\\{", ","), because }{ occurs in other places in the document as well.
My best approach so far, is to use a look-behind to only do the replace when the occurence is after \\autocites, but then I cannot get the backreferences and grouping right:
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)(\\}\\{)", "\\1,")
[1] "Text.\\autocites{REF1,REF2}{REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)((\\}\\{)([:alnum:]+))*", "\\1,\\4")
[1] "Text.\\autocites{REF1,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
I might be missing some completely obvious approach, so I hope someone can help.
pat matches
autocites followed by
the shortest string that ends in } and is
followed by end of string or a non-{
It then uses gsubfn to replace each occurrence of }{ in that with a comma. It uses formula notation to express the replacement function -- the body of the function is on the RHS of the ~ and because the body contains ..1 the arguments are taken to be ... . It does not use zero width lookahead or lookbehind.
library(gsubfn)
pat <- "(autocites.*?\\}($|[^{]))"
gsubfn(pat, ~ gsub("}{", ",", ..1, fixed = TRUE), s)
giving:
[1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
Variation
One minor simplificaiton of the regular expression shown above is to remove the outer parentheses from pat and instead specify backref = 0 in gsubfn. That tells it to pass the entire match to the function. We could use ..1 to specify the argument as above but since we know that there is necessarily only one argument passed we can specify it as x in the body of the function. Any variable name would do as it assumes that any free variable is an argument. The output would be the same as above.
pat2 <- "autocites.*?\\}($|[^{])"
gsubfn(pat2, ~ gsub("}{", ",", x, fixed = TRUE), s, backref = 0)
Cool problem - I got to learn a new trick with str_replace. You can make the return value a function, and it applies the function to the strings you've picked out.
replace_brakets <- function(str) {
str_replace_all(str, "\\}\\{", ",")
}
s %>% str_replace_all("(?<=\\\\autocites\\{)([:alnum:]+\\}\\{)+", replace_brakets)
# [1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"

String: extract wanted character instead of removing unwanted

I was wandering if in R their is a function like KeepChar("abcde....xyz", some_text) that you feed with all the desired character that you want to keep, and returns the strings with only the desired character left in it. Here the function would only keep the letters of the alphabet in lower case. I would like something that looks like this:
some_text <- "Hel-_l0o W#oRr^ld"
some_text <- KeepChar("abcdefghijklmnopqrstuvwxyz ", some_text)
some_text
> "hello world"
I feel that the removing method that I am currently using gsub("#\\w+", "", some_text), tm_map(some_text, stripWhitespace) or str_replace_all(some_text,"[^[:graph:]]", " ") takes a lot of time and coding line with a constant risk of forgetting to remove a specific character, especially when you already know exactly what you want to keep.
Why I ask this question is because I am coding a plateform to process sentiment analysis on texts from various sources like twitter and I want to make sure not to forget to remove any unwanted character.
To handle a pattern without using regex I will try this:
string <- "Hel-_l0o W#oRr^ld"
pattern <- "abcdefghijklmnopqrstuvwxyz"
KeepChar = function(pattern, string){
splitted_string <- unlist(strsplit(string, ""))
splitted_pattern <- unlist(strsplit(pattern, ""))
ids_string <- splitted_string %in% splitted_pattern
return(paste(splitted_string[ids_string], sep = "", collapse = ""))
}
some_text <- KeepChar(pattern = pattern, string = string)
You can try this:
some_text <- "Hel-_l0o W#oRr^ld"
gsub("[^[:alpha:] ]", "", some_text)#will return all characters
gsub("[^[:lower:] ]", "", some_text)#will return only lower characters alongwith space
gsub("[^[:upper:] ]", "", some_text)#will return higher case characters alongwith space
You can also look at the page https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html to see the matches available in R

regex out a text from a line in R

I have line like this:
x<-c("System configuration: lcpu=96 mem=196608MB ent=16.00")
I need to the the value equal to ent and store it in val object in R
I am doing this not not seem to be working. Any ideas?
val<-x[grep("[0-9]$", x)]
use sub:
val <- sub('^.* ent=([[:digit:]]+)', '\\1', x)
If ent is always at the end then:
sub(".*ent=", "", x)
If not try strapplyc in the gsubfn package which returns only the portion of the regular expression within parentheses:
library(gsubfn)
strapplyc(x, "ent=([.0-9]+)", simplify = TRUE)
Also it could be converted to numeric at the same time using strapply :
strapply(x, "ent=([.0-9]+)", as.numeric, simplify = TRUE)
Using rex may make this type of task a little simpler.
Note this solution correctly includes . in the capture, as does G. Grothendieck's answer.
x <- c("System configuration: lcpu=96 mem=196608MB ent=16.00")
library(rex)
val <- as.numeric(
re_matches(x,
rex("ent=",
capture(name = "ent", some_of(digit, "."))
)
)$ent
)
#>[1] 16

Resources