Regular Expression: replace the n-th occurence

Regular Expression: replace the n-th occurence - r

does someone know how to find the n-th occurcence of a string within an expression and how to replace it by regular expression?
for example I have the following string
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
and I want to replace the 5th occurence of '-' by '|'
and the 7th occurence of '-' by "||" like
[1] aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa
How do I do this?
Thanks,
Florian

(1) sub It can be done in a single regular expression with sub:
> sub("(^(.*?-){4}.*?)-(.*?-.*?)-", "\\1|\\3||", txt, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(2) sub twice or this variation which calls sub twice:
> txt2 <- sub("(^(.*?-){6}.*?)-", "\\1|", txt, perl = TRUE)
> sub("(^(.*?-){4}.*?)-", "\\1||", txt2, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(3) sub.fun or this variation which creates a function sub.fun which does one substitute. it makes use of fn$ from the gsubfn package to substitute n-1, pat, and value into the sub arguments. First define the indicated function and then call it twice.
library(gsubfn)
sub.fun <- function(x, pat, n, value) {
fn$sub( "(^(.*?-){`n-1`}.*?)$pat", "\\1$value", x, perl = TRUE)
}
> sub.fun(sub.fun(txt, "-", 7, "||"), "-", 5, "|")
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(We could have modified the arguments to sub in the body of sub.fun using paste or sprintf to give a base R solution but at the expense of some additional verbosity.)
This can be reformulated as a replacement function giving this pleasing sequence:
"sub.fun<-" <- sub.fun
tt <- txt # make a copy so that we preserve the input txt
sub.fun(tt, "-", 7) <- "||"
sub.fun(tt, "-", 5) <- "|"
> tt
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(4) gsubfn Using gsubfn from the gsubfn package we can use a particularly simple regular expression (its just "-") and the code has quite a straight forward structure. We perform the substitution via a proto method. The proto object containing the method is passed in place of a replacement string. The simplicity of this approach derives fron the fact that gsubfn automatically makes a count variable available to such methods:
library(gsubfn) # gsubfn also pulls in proto
p <- proto(fun = function(this, x) {
if (count == 5) return("|")
if (count == 7) return("||")
x
})
> gsubfn("-", p, txt)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
UPDATE: Some corrections.
UPDATE 2: Added a replacement function approach to (3).
UPDATE 3: Added pat argument to sub.fun.

An alternative possibility is using Hadley's stringr package which builds the basis for the function I wrote:
require(stringr)
replace.nth <- function(string, pattern, replacement, n) {
locations <- str_locate_all(string, pattern)
str_sub(string, locations[[1]][n, 1], locations[[1]][n, 2]) <- replacement
string
}
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
txt.new <- replace.nth(txt, "-", "|", 5)
txt.new <- replace.nth(txt.new, "-", "||", 7)
txt.new
# [1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa-aaa||aaa-aaa"

One way to do this is to use gregexpr to find the positions of the -:
posns <- gregexpr("-",txt)[[1]]
And then pasting together the relevant pieces and separators:
paste0(substr(txt,1,posns[5]-1),"|",substr(txt,posns[5]+1,posns[7]-1),"||",substr(txt,posns[7]+1,nchar(txt)))
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"

Related

Replacing text in gsub by evaluating a backreference

Let's say i have some text :
myF <- "lag.variable.1+1"
I would like to get for all similar expressions the following result : lag.variable.2 (that is replacing 1+1 by the actual sum
The following doesn't seem to work, it appears that the backreference doesnt carry through in the eval(parse() bit ):
myF<-gsub("(\\.\\w+)\\.([0-9]+\\+[0-9]+)",
paste0( "\\1." ,eval(parse(text ="\\2"))) ,
myF )
Any tips on how to achieve the desired result ?
Thanks!

Here is how you can use your current pattern with gsubfn:
library(gsubfn)
x <- " lag.variable0.3 * lag.variable1.1+1 + 9892"
p <- "(\\.\\w+)\\.([0-9]+\\+[0-9]+)"
gsubfn(p, function(n,m) paste0(n, ".", eval(parse(text = m))), x)
# => [1] " lag.variable0.3 * lag.variable1.2 + 9892"
Note the match is passed to the callable in this case where Group 1 is assigned to n variable and Group 2 is assigned to m. The return is a concatenation of Group 1, . and evaled Group 2 contents.
Note you may simplify the callable part using a PCRE regex (add perl=TRUE argument) \K, match reset operator that discards all text matched so far:
p <- "\\.\\w+\\.\\K(\\d+\\+\\d+)"
gsubfn(p, ~ eval(parse(text = z)), x, perl=TRUE)
[1] " lag.variable0.3 * lag.variable1.2 + 9892"
You may further enhance the pattern to support other operands by replacing \\+ with [-+/*] and if you need to support numbers with fractional parts, replace [0-9]+ with \\d*\\.?\\d+:
p <- "(\\.\\w+)\\.(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"
## or a PCRE regex:
p <- "\\.\\w+\\.\\K(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"

We can use gsubfn
library(gsubfn)
gsubfn("(\\d+\\+\\d+)", ~ eval(parse(text = x)), myF)
#[1] "lag.variable.2"
gsubfn("\\.([0-9]+\\+[0-9]+)", ~ paste0(".", eval(parse(text = x))), myF2)
#[1] "lag.variable0.3 * lag.variable1.2 + 9892"
Or with str_replace
library(stringr)
str_replace(myF, "(\\d+\\+\\d+)", function(x) eval(parse(text = x)))
#[1] "lag.variable.2"
Or an option with strsplit and paste
v1 <- strsplit(myF, "\\.(?=\\d)", perl = TRUE)[[1]]
paste(v1[1], eval(parse(text = v1[2])), sep=".")
#[1] "lag.variable.2"
data
myF <- "lag.variable.1+1"
myF2 <- "lag.variable0.3 * lag.variable1.1+1 + 9892"

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?

You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"

Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.

Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Replacing the nth number in a string

I have a set of files which I had named incorrectly. The file name is as follows.
Generation_Flux_0_Model_200.txt
Generation_Flux_101_Model_43.txt
Generation_Flux_11_Model_3.txt
I need to replace the second number (the model number) by adding 1 to the existing number. So the correct names would be
Generation_Flux_0_Model_201.txt
Generation_Flux_101_Model_44.txt
Generation_Flux_11_Model_4.txt
This is the code I wrote. I would like to know how to specify the position of the number (replace second number in the string with the new number)?
reNameModelNumber <- function(modelName){
#get the current model number
modelNumber = as.numeric(unlist(str_extract_all(modelName, "\\d+"))[2])
#increment it by 1
newModelNumber = modelNumber + 1
#building the new name with gsub
newModelName = gsub(" regex ", newModelNumber, modelName)
#rename
file.rename(modelName, newModelName)
}
reactionModels = list.files(pattern = "^Generation_Flux_\\d+_Model_\\d+.txt$")
sapply(reactionFiles, function(x) reNameModelNumber(x))

We can use gsubfn to incremement by 1. Capture the digits ((\\d+))
followed by a . and 'txt' at the end ($`) of the string, and replace it by adding 1 to it
library(gsubfn)
gsubfn("(\\d+)\\.txt$", ~ as.numeric(x) + 1, str1)
#[1] "Generation_Flux_0_Model_201" "Generation_Flux_101_Model_44"
#[3] "Generation_Flux_11_Model_4"
data
str1 <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")

Answering the question, if you want to increment a certain number inside a string, you may use
> library(gsubfn)
> nth = 2
> reactionFiles <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt", "Generation_Flux_11_Model_3.txt")
> gsubfn(paste0("^((?:\\D*\\d+){", nth-1, "}\\D*)(\\d+)"), function(x,y,z) paste0(x, as.numeric(y) + 1), reactionFiles)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt" "Generation_Flux_11_Model_4.txt"
nth here is the number of the digit chunk to increment.
Pattern details
^((?:\\D*\\d+){n}\\D*) - Capturing group 1 (the value is accessed in the gsubfn method via x):
(?:\\D*\\d+){n} - an n occurrences of
\\D* - 0 or more chars other than digits
\\d+ - 1+ digits
\\D* - 0+ non-digits
(\\d+) - Capturing group 2 (the value is accessed in the gsubfn method via y): one or more digits

Using base-R.
data <- c( # Just an example
"Generation_Flux_0_Model_200.txt",
"Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt"
)
fixNameModel <- function(data){
n <- length(data)
# get the current model number and increment it by 1
newn = as.integer(sub(".+_(\\d+)\\.txt", "\\1", data)) + 1L
#building the new name with gsub
newModelName <- vector(mode = "character", length = n)
for (i in 1:n) {
newModelName[i] <- gsub("\\d+\\.txt$", paste0(newn[i], ".txt"), data[i])
}
newModelName
}
fixNameModel(data)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
You can now do something like file.rename(modelName, fixNameModel(modelName))
EDIT:
Here is a bit neater version but makes stronger assumptions instead:
fixNameModel2 <- function(data) {
sapply(
strsplit(data, "_|\\."),
function(x) {
x[5] <- as.integer(x[5]) + 1L
x <- paste0(x, collapse = "_")
gsub("_txt", ".txt", x, fixed = TRUE)
}
)
}

Assuming that the digit always occurs before the extension, as is mentioned in the comments, here is another base R solution that is a little bit simpler.
sapply(regmatches(tmp, regexec("\\d+(?=\\.)", tmp, perl=TRUE), invert=NA),
function(x) paste0(c(x[1], as.integer(x[2]) + 1L, x[3]), collapse=""))
This returns
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
regexec with the invert=NA a list of indices where each list element is the index matching the portions of the full with the matched element returned as the second indexed element. regmatches takes this information and returns a list of character vectors that breaks up the original string along the matches. Feed this list to sapply, convert the second element to integer and increment. Then paste the result to return an atomic vector.
The regex "\d+(?=\.)" uses a perl look behind, "(?=\.)", looking for the dot without capturing it, but capturing the digits with "\d+".
data
tmp <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")

How to convert CamelCase to not.camel.case in R

In R, I'd like to convert
c("ThisText", "NextText")
to
c("this.text", "next.text")
This is the reverse of this SO question, and the same as this one but with dots in R rather than underscores in PHP.

Not clear what the entire set of rules is here but we have assumed that
we should lower case any upper case character after a lower case one and insert a dot between them and also
lower case the first character of the string if succeeded by a lower case character.
To do this we can use perl regular expressions with sub and gsub:
# test data
camelCase <- c("ThisText", "NextText", "DON'T_CHANGE")
s <- gsub("([a-z])([A-Z])", "\\1.\\L\\2", camelCase, perl = TRUE)
sub("^(.[a-z])", "\\L\\1", s, perl = TRUE) # make 1st char lower case
giving:
[1] "this.text" "next.text" "DON'T_CHANGE"

You could do this also via the snakecase package:
install.packages("snakecase")
library(snakecase)
to_snake_case(c("ThisText", "NextText"), sep_out = ".")
# [1] "this.text" "next.text"
Github link to package: https://github.com/Tazinho/snakecase

You can replace all capitals with themselves and a preceeding dot with gsub, change everything tolower, and the substr out the initial dot:
x <- c("ThisText", "NextText", "LongerCamelCaseText")
substr(tolower(gsub("([A-Z])","\\.\\1",x)),2,.Machine$integer.max)
[1] "this.text" "next.text" "longer.camel.case.text"

Using stringr
x <- c("ThisText", "NextText")
str_replace_all(string = x,
pattern = "(?<=[a-z0-9])(?=[A-Z])",
replacement = ".") %>%
str_to_lower()
OR
x <- c("ThisText", "NextText")
str_to_lower(
str_replace_all(string = x,
pattern = "(?<=[a-z0-9])(?=[A-Z])",
replacement = ".")
)

regex out a text from a line in R

I have line like this:
x<-c("System configuration: lcpu=96 mem=196608MB ent=16.00")
I need to the the value equal to ent and store it in val object in R
I am doing this not not seem to be working. Any ideas?
val<-x[grep("[0-9]$", x)]

use sub:
val <- sub('^.* ent=([[:digit:]]+)', '\\1', x)

If ent is always at the end then:
sub(".*ent=", "", x)
If not try strapplyc in the gsubfn package which returns only the portion of the regular expression within parentheses:
library(gsubfn)
strapplyc(x, "ent=([.0-9]+)", simplify = TRUE)
Also it could be converted to numeric at the same time using strapply :
strapply(x, "ent=([.0-9]+)", as.numeric, simplify = TRUE)

Using rex may make this type of task a little simpler.
Note this solution correctly includes . in the capture, as does G. Grothendieck's answer.
x <- c("System configuration: lcpu=96 mem=196608MB ent=16.00")
library(rex)
val <- as.numeric(
re_matches(x,
rex("ent=",
capture(name = "ent", some_of(digit, "."))
)
)$ent
)
#>[1] 16

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regular Expression: replace the n-th occurence - r

Related

Replacing text in gsub by evaluating a backreference

Replace multiple strings comprising of a different number of characters with one gsubfn()

Replacing the nth number in a string

How to convert CamelCase to not.camel.case in R

regex out a text from a line in R

Categories

Resources