Replacing text in gsub by evaluating a backreference - r

Let's say i have some text :
myF <- "lag.variable.1+1"
I would like to get for all similar expressions the following result : lag.variable.2 (that is replacing 1+1 by the actual sum
The following doesn't seem to work, it appears that the backreference doesnt carry through in the eval(parse() bit ):
myF<-gsub("(\\.\\w+)\\.([0-9]+\\+[0-9]+)",
paste0( "\\1." ,eval(parse(text ="\\2"))) ,
myF )
Any tips on how to achieve the desired result ?
Thanks!

Here is how you can use your current pattern with gsubfn:
library(gsubfn)
x <- " lag.variable0.3 * lag.variable1.1+1 + 9892"
p <- "(\\.\\w+)\\.([0-9]+\\+[0-9]+)"
gsubfn(p, function(n,m) paste0(n, ".", eval(parse(text = m))), x)
# => [1] " lag.variable0.3 * lag.variable1.2 + 9892"
Note the match is passed to the callable in this case where Group 1 is assigned to n variable and Group 2 is assigned to m. The return is a concatenation of Group 1, . and evaled Group 2 contents.
Note you may simplify the callable part using a PCRE regex (add perl=TRUE argument) \K, match reset operator that discards all text matched so far:
p <- "\\.\\w+\\.\\K(\\d+\\+\\d+)"
gsubfn(p, ~ eval(parse(text = z)), x, perl=TRUE)
[1] " lag.variable0.3 * lag.variable1.2 + 9892"
You may further enhance the pattern to support other operands by replacing \\+ with [-+/*] and if you need to support numbers with fractional parts, replace [0-9]+ with \\d*\\.?\\d+:
p <- "(\\.\\w+)\\.(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"
## or a PCRE regex:
p <- "\\.\\w+\\.\\K(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"

We can use gsubfn
library(gsubfn)
gsubfn("(\\d+\\+\\d+)", ~ eval(parse(text = x)), myF)
#[1] "lag.variable.2"
gsubfn("\\.([0-9]+\\+[0-9]+)", ~ paste0(".", eval(parse(text = x))), myF2)
#[1] "lag.variable0.3 * lag.variable1.2 + 9892"
Or with str_replace
library(stringr)
str_replace(myF, "(\\d+\\+\\d+)", function(x) eval(parse(text = x)))
#[1] "lag.variable.2"
Or an option with strsplit and paste
v1 <- strsplit(myF, "\\.(?=\\d)", perl = TRUE)[[1]]
paste(v1[1], eval(parse(text = v1[2])), sep=".")
#[1] "lag.variable.2"
data
myF <- "lag.variable.1+1"
myF2 <- "lag.variable0.3 * lag.variable1.1+1 + 9892"

Related

A way to strsplit and replace all of one character with several variations of alternate strings?

I am sure there is a simple solution and I am just getting too frustrated to work through it but here is the issue, simplified:
I have a string, ex: AB^AB^AB^^BAAA^^BABA^
I want to replace the ^s (so, 7 characters in the string), but iterate through many variants and be able to retain them all as strings
for example:
replacement 1: CCDCDCD to get: ABCABCABDCBAAADCBABAD
replacement 2: DDDCCCD to get: ABDABDABDCBAAACCBABAD
I imagine strsplit is the way, and I would like to do it in a for loop, any help would be appreciated!
The positions of the "^" can be found using gregexpr, see tmp
x <- "AB^AB^AB^^BAAA^^BABA^"
y <- c("CCDCDCD", "DDDCCCD")
tmp <- gregexpr(pattern = "^", text = x, fixed = TRUE)
You can then split the 'replacements' character by character using strsplit, this gives a list. Finally, iterate over that list and replace the "^" with the characters from your replacements one after the other.
sapply(strsplit(y, split = ""), function(i) {
`regmatches<-`("AB^AB^AB^^BAAA^^BABA^", m = tmp, value = i)
})
Result
# [1] "ABCABCABCCBAAACCBABAC" "ABDABDABDDBAAADDBABAD"
You don't really need a for loop. You can strplit your string and pattern, and then replace the "^" with the vector.
str <- unlist(strsplit(str, ""))
pat <- unlist(strsplit("CCDCDCD", ""))
str[str == "^"] <- pat
paste(str, collapse = "")
# [1] "ABCABCABDCBAAADCBABAD"
An option is also with gsubfn
f1 <- Vectorize(function(str1, str2) {
p <- proto(fun = function(this, x) substr(str2, count, count))
gsubfn::gsubfn("\\^", p, str1)
})
-testing
> unname(f1(x, y))
[1] "ABCABCABDCBAAADCBABAD" "ABDABDABDCBAAACCBABAD"
data
x <- "AB^AB^AB^^BAAA^^BABA^"
y <- c("CCDCDCD", "DDDCCCD")
Given x <- "AB^AB^AB^^BAAA^^BABA^" and y <- c("CCDCDCD", "DDDCCCD"), we can try utf8ToInt + intToUtf8 + replace like below
sapply(
y,
function(s) {
intToUtf8(
replace(
u <- utf8ToInt(x),
u == utf8ToInt("^"),
utf8ToInt(s)
)
)
}
)
which gives
CCDCDCD DDDCCCD
"ABCABCABDCBAAADCBABAD" "ABDABDABDCBAAACCBABAD"

Remove line return in a string from the second "\n" and the rest (do not remove the first one)

I have sequences that I want to reformat in R.
But R presents me the sequences with returns of lines (as can be seen with the "\n":
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
But I want to remove all the \nexcept the first one. I.e.:
[1] ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAACCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAAAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCTTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTTTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC----------AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAAAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCACCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAATTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTACCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTATTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGTTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATCCTTTCTC-------------"
If I do this, it removes all of the returns.
gsub(pattern = "\n",replacement = "", x = seqs)
This is not working:
sub("^(.*? \n .*?) \n .*", "\\1", seqs)
This gives me an error:
gsub(pattern = "${'\n'[*]:0:2}",replacement = "", x = seqs)
Error in gsub(pattern = "${'\n'[*]:0:2}", replacement = "", x = seqs) :
invalid regular expression '${'
'[*]:0:2}', reason 'Invalid contents of {}'
My sequences are variable:
">Whatever here before \n the sequence start \n the rest \n..."
The end result would be
">Whatever here before \n the sequence start the rest..."
Interestingly, the code below partially works for the test sentence, but not the sequence above:
seqss = ">Whatever here before \n the sequence start \n the rest \n..."
sub("^(.*? \n .*?) \n .*", "\\1", seqss)
[1] ">Whatever here before \n the sequence start"
Try it like this:
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
gsub(pattern = "(^.*?\\n)|\\n",replacement = "\\1", x = seqs, perl = TRUE)
regex101 demo
The idea of the regex
(^.*?\\n)|\\n
is to capture everything up to the first newline in a group to retain and put it back in the replacement.
A stringr approach that simply splits the string and combines the two parts back together.
seqs <- ">PRTRE213-13 Volkameria aculeatum matK \n------------------------------------------------------------------CCAAC\nCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAA\nAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCT\nTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTT\nTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC---------\n-AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAA\nAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCAC\nCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAAT\nTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTAC\nCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTAT\nTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGT\nTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATC\nCTTTCTC-------------"
library(stringr)
keep_first_newline <- function(string){
first_newline <- str_locate(string, "\\n")[1]
head <- str_sub(string, end = first_newline)
tail = string %>%
str_sub(start = first_newline + 1) %>%
str_remove_all("\\n")
out <- str_c(head, tail)
}
seqs %>%
keep_first_newline %>%
writeLines
#> >PRTRE213-13 Volkameria aculeatum matK
#> ------------------------------------------------------------------CCAACCGAGAGCCAGCTCC------TCTTTTTCAAAA---------CGAAAT---------------------CAAAAGACTATTCTTATTCTTATAT------------AATTCTCATGTATGTGAATATGAATCCGTTTTCGTCTTTTCTACGTAACCAATCTTTT---CATTTACGATCAACATCTTTTGAAGTTCTTCTTGAACGAATCTATTTTCTATGTA---------AAAGTAGAACGTCTT------GTGAACGTCTTTGTTAAGATTAAC----------AATTTTCGGGCGAACCCGTGGTTGGTCAAG------GAACCTTTCATGCATTATATTAGGTATCAAAGAAAGATCCATTCTGGCTTCA------AAGGGAACATCTTTTTTCATGAAAAAATGGCAATTTTATCTTGTCACCTTTTTGGCAATGGCATTTTTCGCTGTGGTTTCATCCAAGAAGGATTTATCTAAAC---CAATTATCCAATTTATTCCCTTGAA------TTTTTGGGCTATCTTTCA------AGCGTGCGAATGAACCCCTCTGTGGTACCGGAGTCAAATTCTAGAAAATGCATTTCTAATCAATAATGCTATT------AAGAAGTTTGATACCCTTATTTCCAATTATTCCAATGATTGCGTCATTGGCTAAAGCGAAATTTTGTAACGTATTTGGGCATCCTGTTAGTTAAGCCGATTTGGGCTGATTTATCAGATTCTAATATTATTGACCGATTTGGTCGTATA---TGCAGAAATCCTTTCTC-------------
Created on 2018-06-29 by the reprex package (v0.2.0).
Using gsubfn we can do. Not any better than the regex in this case, but more easily extended if you want to keep the first n occurrences with n>1.
library(gsubfn)
p <- proto(fun = function(this, x) if(count > 1) '' else x)
out <- gsubfn('\n', p, seqs)
Same as accepted answer
out == gsub(pattern = "(^.*?\\n)|\\n",replacement = "\\1", x = seqs, perl = TRUE)
#[1] TRUE

Replacing the nth number in a string

I have a set of files which I had named incorrectly. The file name is as follows.
Generation_Flux_0_Model_200.txt
Generation_Flux_101_Model_43.txt
Generation_Flux_11_Model_3.txt
I need to replace the second number (the model number) by adding 1 to the existing number. So the correct names would be
Generation_Flux_0_Model_201.txt
Generation_Flux_101_Model_44.txt
Generation_Flux_11_Model_4.txt
This is the code I wrote. I would like to know how to specify the position of the number (replace second number in the string with the new number)?
reNameModelNumber <- function(modelName){
#get the current model number
modelNumber = as.numeric(unlist(str_extract_all(modelName, "\\d+"))[2])
#increment it by 1
newModelNumber = modelNumber + 1
#building the new name with gsub
newModelName = gsub(" regex ", newModelNumber, modelName)
#rename
file.rename(modelName, newModelName)
}
reactionModels = list.files(pattern = "^Generation_Flux_\\d+_Model_\\d+.txt$")
sapply(reactionFiles, function(x) reNameModelNumber(x))
We can use gsubfn to incremement by 1. Capture the digits ((\\d+))
followed by a . and 'txt' at the end ($`) of the string, and replace it by adding 1 to it
library(gsubfn)
gsubfn("(\\d+)\\.txt$", ~ as.numeric(x) + 1, str1)
#[1] "Generation_Flux_0_Model_201" "Generation_Flux_101_Model_44"
#[3] "Generation_Flux_11_Model_4"
data
str1 <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")
Answering the question, if you want to increment a certain number inside a string, you may use
> library(gsubfn)
> nth = 2
> reactionFiles <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt", "Generation_Flux_11_Model_3.txt")
> gsubfn(paste0("^((?:\\D*\\d+){", nth-1, "}\\D*)(\\d+)"), function(x,y,z) paste0(x, as.numeric(y) + 1), reactionFiles)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt" "Generation_Flux_11_Model_4.txt"
nth here is the number of the digit chunk to increment.
Pattern details
^((?:\\D*\\d+){n}\\D*) - Capturing group 1 (the value is accessed in the gsubfn method via x):
(?:\\D*\\d+){n} - an n occurrences of
\\D* - 0 or more chars other than digits
\\d+ - 1+ digits
\\D* - 0+ non-digits
(\\d+) - Capturing group 2 (the value is accessed in the gsubfn method via y): one or more digits
Using base-R.
data <- c( # Just an example
"Generation_Flux_0_Model_200.txt",
"Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt"
)
fixNameModel <- function(data){
n <- length(data)
# get the current model number and increment it by 1
newn = as.integer(sub(".+_(\\d+)\\.txt", "\\1", data)) + 1L
#building the new name with gsub
newModelName <- vector(mode = "character", length = n)
for (i in 1:n) {
newModelName[i] <- gsub("\\d+\\.txt$", paste0(newn[i], ".txt"), data[i])
}
newModelName
}
fixNameModel(data)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
You can now do something like file.rename(modelName, fixNameModel(modelName))
EDIT:
Here is a bit neater version but makes stronger assumptions instead:
fixNameModel2 <- function(data) {
sapply(
strsplit(data, "_|\\."),
function(x) {
x[5] <- as.integer(x[5]) + 1L
x <- paste0(x, collapse = "_")
gsub("_txt", ".txt", x, fixed = TRUE)
}
)
}
Assuming that the digit always occurs before the extension, as is mentioned in the comments, here is another base R solution that is a little bit simpler.
sapply(regmatches(tmp, regexec("\\d+(?=\\.)", tmp, perl=TRUE), invert=NA),
function(x) paste0(c(x[1], as.integer(x[2]) + 1L, x[3]), collapse=""))
This returns
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
regexec with the invert=NA a list of indices where each list element is the index matching the portions of the full with the matched element returned as the second indexed element. regmatches takes this information and returns a list of character vectors that breaks up the original string along the matches. Feed this list to sapply, convert the second element to integer and increment. Then paste the result to return an atomic vector.
The regex "\d+(?=\.)" uses a perl look behind, "(?=\.)", looking for the dot without capturing it, but capturing the digits with "\d+".
data
tmp <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")

Regular Expression: replace the n-th occurence

does someone know how to find the n-th occurcence of a string within an expression and how to replace it by regular expression?
for example I have the following string
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
and I want to replace the 5th occurence of '-' by '|'
and the 7th occurence of '-' by "||" like
[1] aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa
How do I do this?
Thanks,
Florian
(1) sub It can be done in a single regular expression with sub:
> sub("(^(.*?-){4}.*?)-(.*?-.*?)-", "\\1|\\3||", txt, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(2) sub twice or this variation which calls sub twice:
> txt2 <- sub("(^(.*?-){6}.*?)-", "\\1|", txt, perl = TRUE)
> sub("(^(.*?-){4}.*?)-", "\\1||", txt2, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(3) sub.fun or this variation which creates a function sub.fun which does one substitute. it makes use of fn$ from the gsubfn package to substitute n-1, pat, and value into the sub arguments. First define the indicated function and then call it twice.
library(gsubfn)
sub.fun <- function(x, pat, n, value) {
fn$sub( "(^(.*?-){`n-1`}.*?)$pat", "\\1$value", x, perl = TRUE)
}
> sub.fun(sub.fun(txt, "-", 7, "||"), "-", 5, "|")
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(We could have modified the arguments to sub in the body of sub.fun using paste or sprintf to give a base R solution but at the expense of some additional verbosity.)
This can be reformulated as a replacement function giving this pleasing sequence:
"sub.fun<-" <- sub.fun
tt <- txt # make a copy so that we preserve the input txt
sub.fun(tt, "-", 7) <- "||"
sub.fun(tt, "-", 5) <- "|"
> tt
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(4) gsubfn Using gsubfn from the gsubfn package we can use a particularly simple regular expression (its just "-") and the code has quite a straight forward structure. We perform the substitution via a proto method. The proto object containing the method is passed in place of a replacement string. The simplicity of this approach derives fron the fact that gsubfn automatically makes a count variable available to such methods:
library(gsubfn) # gsubfn also pulls in proto
p <- proto(fun = function(this, x) {
if (count == 5) return("|")
if (count == 7) return("||")
x
})
> gsubfn("-", p, txt)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
UPDATE: Some corrections.
UPDATE 2: Added a replacement function approach to (3).
UPDATE 3: Added pat argument to sub.fun.
An alternative possibility is using Hadley's stringr package which builds the basis for the function I wrote:
require(stringr)
replace.nth <- function(string, pattern, replacement, n) {
locations <- str_locate_all(string, pattern)
str_sub(string, locations[[1]][n, 1], locations[[1]][n, 2]) <- replacement
string
}
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
txt.new <- replace.nth(txt, "-", "|", 5)
txt.new <- replace.nth(txt.new, "-", "||", 7)
txt.new
# [1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa-aaa||aaa-aaa"
One way to do this is to use gregexpr to find the positions of the -:
posns <- gregexpr("-",txt)[[1]]
And then pasting together the relevant pieces and separators:
paste0(substr(txt,1,posns[5]-1),"|",substr(txt,posns[5]+1,posns[7]-1),"||",substr(txt,posns[7]+1,nchar(txt)))
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"

gsub return an empty string when no match is found

I'm using the gsub function in R to return occurrences of my pattern (reference numbers) on a list of text. This works great unless no match is found, in which case I get the entire string back, instead of an empty string. Consider the example:
data <- list("a sentence with citation (Ref. 12)",
"another sentence without reference")
sapply(data, function(x) gsub(".*(Ref. (\\d+)).*", "\\1", x))
Returns:
[1] "Ref. 12" "another sentence without reference"
But I'd like to get
[1] "Ref. 12" ""
Thanks!
I'd probably go a different route, since the sapply doesn't seem necessary to me as these functions are vectorized already:
fun <- function(x){
ind <- grep(".*(Ref. (\\d+)).*",x,value = FALSE)
x <- gsub(".*(Ref. (\\d+)).*", "\\1", x)
x[-ind] <- ""
x
}
fun(data)
according to the documentation, this is a feature of gsub it returns the input string if there are no matches to the supplied pattern matches returns the entire string.
here, I use the function grepl first to return a logical vector of the presence/absence of the pattern in the given string:
ifelse(grepl(".*(Ref. (\\d+)).*", data),
gsub(".*(Ref. (\\d+)).*", "\\1", data),
"")
embedding this in a function:
mygsub <- function(x){
ans <- ifelse(grepl(".*(Ref. (\\d+)).*", x),
gsub(".*(Ref. (\\d+)).*", "\\1", x),
"")
return(ans)
}
mygsub(data)
xs <- sapply(data, function(x) gsub(".*(Ref. (\\d+)).*", "\\1", x))
xs[xs==data] <- ""
xs
#[1] "Ref. 12" ""
Try strapplyc in the gsubfn package:
library(gsubfn)
L <- fn$sapply(unlist(data), ~ strapplyc(x, "Ref. \\d+"))
unlist(fn$sapply(L, ~ ifelse(length(x), x, "")))
which gives this:
a sentence with citation (Ref. 12) another sentence without reference
"Ref. 12" ""
If you don't mind list output then you could just use L and forget about the last line of code. Note that the fn$ prefix turns the formula arguments of the function its applied to into function calls so the first line of code could be written without fn as sapply(unlist(data), function(x) strapplyc(x, "Ref x. \\d+")) .
You might try embedding grep( ..., value = T) in that function.
data <- list("a sentence with citation (Ref. 12)",
"another sentence without reference")
unlist( sapply(data, function(x) {
x <- gsub(".*(Ref. (\\d+)).*", "\\1", x)
grep( "Ref\\.", x, value = T )
} ) )
Kind of bulky but it works? It also removes the empty 2nd reference.
based on #joran 's answer
function:
extract_matches <- function(x,pattern,replacement,replacement_nomatch=""){
x <- gsub(pattern,replacement,x)
x[-grep(pattern,x,value = FALSE)] <- replacement_nomatch
x
}
usage:
data <- list("with citation (Ref. 12)", "without reference", "")
extract_matches(data, ".*(Ref. (\\d+)).*", "\\1")
Another simple way is to use gsub but specify you want '' in a new function
noFalsePositives <- function(a,b,x) {
return(ifelse(gsub(a,b,x)==x,'',gsub(a,b,x)))
}
# usage
noFalsePositives(".*(Ref. (\\d+)).*", "\\1", data)
# [1] "Ref. 12" ""

Resources