Replacing the nth number in a string - r

I have a set of files which I had named incorrectly. The file name is as follows.
Generation_Flux_0_Model_200.txt
Generation_Flux_101_Model_43.txt
Generation_Flux_11_Model_3.txt
I need to replace the second number (the model number) by adding 1 to the existing number. So the correct names would be
Generation_Flux_0_Model_201.txt
Generation_Flux_101_Model_44.txt
Generation_Flux_11_Model_4.txt
This is the code I wrote. I would like to know how to specify the position of the number (replace second number in the string with the new number)?
reNameModelNumber <- function(modelName){
#get the current model number
modelNumber = as.numeric(unlist(str_extract_all(modelName, "\\d+"))[2])
#increment it by 1
newModelNumber = modelNumber + 1
#building the new name with gsub
newModelName = gsub(" regex ", newModelNumber, modelName)
#rename
file.rename(modelName, newModelName)
}
reactionModels = list.files(pattern = "^Generation_Flux_\\d+_Model_\\d+.txt$")
sapply(reactionFiles, function(x) reNameModelNumber(x))

We can use gsubfn to incremement by 1. Capture the digits ((\\d+))
followed by a . and 'txt' at the end ($`) of the string, and replace it by adding 1 to it
library(gsubfn)
gsubfn("(\\d+)\\.txt$", ~ as.numeric(x) + 1, str1)
#[1] "Generation_Flux_0_Model_201" "Generation_Flux_101_Model_44"
#[3] "Generation_Flux_11_Model_4"
data
str1 <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")

Answering the question, if you want to increment a certain number inside a string, you may use
> library(gsubfn)
> nth = 2
> reactionFiles <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt", "Generation_Flux_11_Model_3.txt")
> gsubfn(paste0("^((?:\\D*\\d+){", nth-1, "}\\D*)(\\d+)"), function(x,y,z) paste0(x, as.numeric(y) + 1), reactionFiles)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt" "Generation_Flux_11_Model_4.txt"
nth here is the number of the digit chunk to increment.
Pattern details
^((?:\\D*\\d+){n}\\D*) - Capturing group 1 (the value is accessed in the gsubfn method via x):
(?:\\D*\\d+){n} - an n occurrences of
\\D* - 0 or more chars other than digits
\\d+ - 1+ digits
\\D* - 0+ non-digits
(\\d+) - Capturing group 2 (the value is accessed in the gsubfn method via y): one or more digits

Using base-R.
data <- c( # Just an example
"Generation_Flux_0_Model_200.txt",
"Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt"
)
fixNameModel <- function(data){
n <- length(data)
# get the current model number and increment it by 1
newn = as.integer(sub(".+_(\\d+)\\.txt", "\\1", data)) + 1L
#building the new name with gsub
newModelName <- vector(mode = "character", length = n)
for (i in 1:n) {
newModelName[i] <- gsub("\\d+\\.txt$", paste0(newn[i], ".txt"), data[i])
}
newModelName
}
fixNameModel(data)
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
You can now do something like file.rename(modelName, fixNameModel(modelName))
EDIT:
Here is a bit neater version but makes stronger assumptions instead:
fixNameModel2 <- function(data) {
sapply(
strsplit(data, "_|\\."),
function(x) {
x[5] <- as.integer(x[5]) + 1L
x <- paste0(x, collapse = "_")
gsub("_txt", ".txt", x, fixed = TRUE)
}
)
}

Assuming that the digit always occurs before the extension, as is mentioned in the comments, here is another base R solution that is a little bit simpler.
sapply(regmatches(tmp, regexec("\\d+(?=\\.)", tmp, perl=TRUE), invert=NA),
function(x) paste0(c(x[1], as.integer(x[2]) + 1L, x[3]), collapse=""))
This returns
[1] "Generation_Flux_0_Model_201.txt" "Generation_Flux_101_Model_44.txt"
[3] "Generation_Flux_11_Model_4.txt"
regexec with the invert=NA a list of indices where each list element is the index matching the portions of the full with the matched element returned as the second indexed element. regmatches takes this information and returns a list of character vectors that breaks up the original string along the matches. Feed this list to sapply, convert the second element to integer and increment. Then paste the result to return an atomic vector.
The regex "\d+(?=\.)" uses a perl look behind, "(?=\.)", looking for the dot without capturing it, but capturing the digits with "\d+".
data
tmp <- c("Generation_Flux_0_Model_200.txt", "Generation_Flux_101_Model_43.txt",
"Generation_Flux_11_Model_3.txt")

Related

extract the shortest and first encounter match between two strings in R

I want the function to return the string that follows the below condition.
after "def"
in the parentheses right before the first %ile after "def"
So the desirable output is "4", not "5". So far, I was able to extract "2)(3)(4". If I change the function to str_extract_all, the output became "2)(3)(4" and "5" . I couldn't figure out how to fix this problem. Thanks!
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
string.after.match <- str_match(string = x,
pattern = "(?<=def)(.*)")[1, 1]
parentheses.value <- str_extract(string.after.match, # get value in ()
"(?<=\\()(.*?)(?=\\)\\%ile)")
parentheses.value
Take the
Here is a one liner that will do the trick using gsub()
gsub(".*def.*(\\d+)\\)%ile.*%ile", "\\1", x, perl = TRUE)
Here's an approach that will work with any number of "%ile"s. Based on str_split()
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile(9)%ile"
x %>%
str_split("def", simplify = TRUE) %>%
subset(TRUE, 2) %>%
str_split("%ile", simplify = TRUE) %>%
subset(TRUE, 1) %>%
str_replace(".*(\\d+)\\)$", "\\1")
sub(".*?def.*?(\\d)\\)%ile.*", "\\1", x)
[1] "4"
You can use
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
library(stringr)
result <- str_match(x, "\\bdef(?:\\((\\d+)\\))+%ile")
result[,2]
See the R demo online and the regex demo.
Details:
\b - word boundary
def - a def string
(?:\((\d+)\))+ - zero or more occurrences of ( + one or more digits (captured into Group 1) + ) and the last one captured is saved in Group 1
%ile - an %ile string.

argument 'replacement' has length > 1 and only the first element will be used

I would like to replace the first 3 letters of txt.files with a sequence.
x <- list.files()
n <- seq(length(list.files()))
x2 <- gsub('^.{3}', n, x)
file.rename(x, x2)
the 4 files in the folder
2eEMORT.txt
3h4MORT.txt
4F1MORT.txt
841MORT.txt
were replaced by one file
1MORT.txt
In the OP's code, gsub (or sub) is not vectorized for replacement - i.e. it takes a vector of length 1). Hence, we get the warning message. One option is to make use of substring (faster and efficient) along with paste
x2 <- paste0(seq_along(x), substring(x, 4))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
Or with paste and sub. Here, we match first 3 characters as in the OP's code and replace it with blank ("") and then paste
x2 <- paste0(seq_along(x), sub("^.{3}", "", x))
Also, if we need to do this using regex, a vectorized option is str_replace
library(stringr)
x2 <- str_replace(x, "^.{3}", as.character(n))
x2
#[1] "1MORT.txt" "2MORT.txt" "3MORT.txt" "4MORT.txt"
NOTE: None of the solutions use any loop
Now, we simply do
file.rename(x, x2)
data
x <- c("2eEMORT.txt", "3h4MORT.txt", "4F1MORT.txt", "841MORT.txt")
The reason you're getting the warning "argument 'replacement' has length >1 and only the first element will be used" is because you're supplying n -- a vector of the form c(1, 2, ...) -- as a string to replace the substring matching your regex ^.{3}.
If what you want to do is replace the first three characters of each filename with a number you can sort by, here is one way to do it (comments explain each step):
# the files to be renamed
fnames <- list.files()
# new prefixes to add: '001', '002', '003', etc.
# (note usage of sprintf() to get left-padding for nice sorting)
fname_prefixes <- sprintf("%03d", seq_along(fnames))
# sub the i-th prefix for the first three characters of the i-th filename
new_fnames <- Map(function(fname, idx) gsub("^.{3}", idx, fname),
fnames, fname_prefixes)
Then you can rename each file by iterating over the named list new_fnames:
for (idx in seq_along(new_fnames)){
# can show a message so you can track what's going on
message('renaming ', names(new_fnames)[idx], ' to: ', new_fnames[[idx]])
file.rename(from=names(new_fnames)[idx], to=new_fnames[[idx]])
}

Mathematical operation on captured number that comes after a certain character in R

I have a vector that has elements like this:
x <- c("3434/1233", "3434.332/232.2", "220.23/932.89", "908.11111/9")
I want to replace the numbers after slash with their value multiplied by 60.
So the resulting vector will be:
x.times.sixty <- c("3434/73980", "3434.332/13932", "220.23/55973.4", "908.11111/540")
How can I do this?
I have tried the following which does not work:
gsub(x = x, pattern = "/(.*)", replacement = as.numeric('\\1' * 60))
Also this one:
gsub(x = x, pattern = "/(.*)", replacement = '\\1 * 60')
We can do the multiplication using gsubfn. Capture the numbers including the decimals at the end of the string (([0-9.]+$)), convert it to numeric and multiply by 60
library(gsubfn)
gsubfn("([0-9.]+$)", ~ as.numeric(x)*60, x)
#[1] "3434/73980" "3434.332/13932" "220.23/55973.4" "908.11111/540"
Or to follow the conditions correctly
gsubfn("\\/([0-9.]+$)", ~ paste0("/", as.numeric(x)*60), x)
#[1] "3434/73980" "3434.332/13932" "220.23/55973.4" "908.11111/540"

How to remove common parts of strings in a character vector in R?

Assume a character vector like the following
file1_p1_analysed_samples.txt
file1_p1_raw_samples.txt
f2_file2_p1_analysed_samples.txt
f3_file3_p1_raw_samples.txt
Desired output:
file1_p1_analysed
file1_p1_raw
file2_p1_analysed
file3_p1_raw
I would like to compare the elements and remove parts of the string from start and end as much as possible but keep them unique.
The above one is just an example. The parts to be removed are not common to all elements. I need a general solution independent of the strings in the above example.
So far I have been able to chuck off parts that are common to all elements, provided the separator and the resulting split parts are of same length. Here is the function,
mf <- function(x,sep){
xsplit = strsplit(x,split = sep)
xdfm <- as.data.frame(do.call(rbind,xsplit))
res <- list()
for (i in 1:ncol(xdfm)){
if (!all(xdfm[,i] == xdfm[1,i])){
res[[length(res)+1]] <- as.character(xdfm[,i])
}
}
res <- as.data.frame(do.call(rbind,res))
res <- apply(res,2,function(x) paste(x,collapse="_"))
return(res)
}
Applying the above function:
a = c("a_samples.txt","b_samples.txt")
mf(a,"_")
V1 V2
"a" "b"
2.
> b = c("apple.fruit.txt","orange.fruit.txt")
> mf(b,sep = "\\.")
V1 V2
"apple" "orange"
If the resulting split parts are not same length, this doesn't work.
What about
files <- c("file1_p1_analysed_samples.txt", "file1_p1_raw_samples.txt", "f2_file2_p1_analysed_samples.txt", "f3_file3_p1_raw_samples.txt")
new_files <- gsub('_samples\\.txt', '', files)
new_files
... which yields
[1] "file1_p1_analysed" "file1_p1_raw" "f2_file2_p1_analysed" "f3_file3_p1_raw"
This removes the _samples.txt part from your strings.
Why not:
strings <- c("file1_p1_analysed_samples.txt",
"file1_p1_raw_samples.txt",
"f2_file2_p1_analysed_samples.txt",
"f3_file3_p1_raw_samples.txt")
sapply(strings, function(x) {
pattern <- ".*(file[0-9].*)_samples\\.txt"
gsub(x, pattern = pattern, replacement = "\\1")
})
Things that match between ( and ) can be called back as a group in the replacement with backwards referencing. You can do this with \\1. You can even specify multiple groups!
Seeing your comment on Jan's answer. Why not define your static bits and paste together a pattern and always surround them with parentheses? Then you can always call \\i in the replacement of gsub.

Regular Expression: replace the n-th occurence

does someone know how to find the n-th occurcence of a string within an expression and how to replace it by regular expression?
for example I have the following string
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
and I want to replace the 5th occurence of '-' by '|'
and the 7th occurence of '-' by "||" like
[1] aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa
How do I do this?
Thanks,
Florian
(1) sub It can be done in a single regular expression with sub:
> sub("(^(.*?-){4}.*?)-(.*?-.*?)-", "\\1|\\3||", txt, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(2) sub twice or this variation which calls sub twice:
> txt2 <- sub("(^(.*?-){6}.*?)-", "\\1|", txt, perl = TRUE)
> sub("(^(.*?-){4}.*?)-", "\\1||", txt2, perl = TRUE)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(3) sub.fun or this variation which creates a function sub.fun which does one substitute. it makes use of fn$ from the gsubfn package to substitute n-1, pat, and value into the sub arguments. First define the indicated function and then call it twice.
library(gsubfn)
sub.fun <- function(x, pat, n, value) {
fn$sub( "(^(.*?-){`n-1`}.*?)$pat", "\\1$value", x, perl = TRUE)
}
> sub.fun(sub.fun(txt, "-", 7, "||"), "-", 5, "|")
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(We could have modified the arguments to sub in the body of sub.fun using paste or sprintf to give a base R solution but at the expense of some additional verbosity.)
This can be reformulated as a replacement function giving this pleasing sequence:
"sub.fun<-" <- sub.fun
tt <- txt # make a copy so that we preserve the input txt
sub.fun(tt, "-", 7) <- "||"
sub.fun(tt, "-", 5) <- "|"
> tt
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
(4) gsubfn Using gsubfn from the gsubfn package we can use a particularly simple regular expression (its just "-") and the code has quite a straight forward structure. We perform the substitution via a proto method. The proto object containing the method is passed in place of a replacement string. The simplicity of this approach derives fron the fact that gsubfn automatically makes a count variable available to such methods:
library(gsubfn) # gsubfn also pulls in proto
p <- proto(fun = function(this, x) {
if (count == 5) return("|")
if (count == 7) return("||")
x
})
> gsubfn("-", p, txt)
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"
UPDATE: Some corrections.
UPDATE 2: Added a replacement function approach to (3).
UPDATE 3: Added pat argument to sub.fun.
An alternative possibility is using Hadley's stringr package which builds the basis for the function I wrote:
require(stringr)
replace.nth <- function(string, pattern, replacement, n) {
locations <- str_locate_all(string, pattern)
str_sub(string, locations[[1]][n, 1], locations[[1]][n, 2]) <- replacement
string
}
txt <- "aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa-aaa"
txt.new <- replace.nth(txt, "-", "|", 5)
txt.new <- replace.nth(txt.new, "-", "||", 7)
txt.new
# [1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa-aaa||aaa-aaa"
One way to do this is to use gregexpr to find the positions of the -:
posns <- gregexpr("-",txt)[[1]]
And then pasting together the relevant pieces and separators:
paste0(substr(txt,1,posns[5]-1),"|",substr(txt,posns[5]+1,posns[7]-1),"||",substr(txt,posns[7]+1,nchar(txt)))
[1] "aaa-aaa-aaa-aaa-aaa|aaa-aaa||aaa-aaa-aaa"

Resources