gsub / sub to extract between certain characters - r

How can I extract the numbers / ID from the following string in R?
link <- "D:/temp/sample_data/0000098618-13-000011.htm"
I want to just extract 0000098618-13-000011
That is discard the .htm and the D:/temp/sample_data/.
I have tried grep and gsub without much luck.

1) basename Use basename followed by sub:
sub("\\..*", "", basename(link))
## [1] "0000098618-13-000011"
2) file_path_sans_ext
library(tools)
file_path_sans_ext(link)
## [1] "0000098618-13-000011"
3) sub
sub(".*/(.*)\\..*", "\\1", link)
## [1] "0000098618-13-000011"
4) gsub
gsub(".*/|\\.[^.]*$", "", link)
## [1] "0000098618-13-000011"
5) strsplit
sapply(strsplit(link, "[/.]"), function(x) tail(x, 2)[1])
## [1] "0000098618-13-000011"
6) read.table. If link is a vector this will only work if all elements have the same number of /-separated components. Also this assumes that the only dot is the one separting the extension.
DF <- read.table(text = link, sep = "/", comment = ".", as.is = TRUE)
DF[[ncol(DF)]]
## [1] "0000098618-13-000011"

Using stringr:
library(stringr)
str_extract(link , "[0-9-]+")
# "0000098618-13-000011"

Related

How to extract text between two separators in R?

I have a vector of strings like so:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
I only want the part between the two forward slashes, i.e., "10g" and "6g".
You could sub() here with a capture group:
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
sub(".*/([^/]+)/.*", "\\1", mystr)
[1] "10g" "6g"
similar to Tim Biegeleisen, but with a lookbehind and lookahead, using srt_extract from stringr:
library(stringr)
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")
str_extract(mystr,"(?<=/)[^/]+(?=/)")
[1] "10g" "6g"
More simply you can capitalize on the fact that the desired substring is one or more digits followed by literal g:
library(stringr)
str_extract(mystr, "\\d+g")
[1] "10g" "6g"
Here are a few alternatives. They use no packages and the first two do not use any regular expressions.
basename(dirname(mystr))
## [1] "10g" "6g"
read.table(text = mystr, sep = "/")[[2]]
## [1] "10g" "6g"
trimws(trimws(mystr,, "[^/]"),, "/")
## [1] "10g" "6g"
We could also reformulate these using pipes
mystr |> dirname() |> basename()
## [1] "10g" "6g"
read.table(text = mystr, sep = "/") |> (`[[`)(2)
## [1] "10g" "6g"
mystr |> trimws(, "[^/]") |> trimws(, "/")
## [1] "10g" "6g"
Note
From the question the input is
mystr <- c("./10g/13.9264.csv", "./6g/62.0544.csv")

Using regex to remove the filepath and the trailing _# of a filename in a sub command [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

How to get rid of the file extensions in the output of list.files in R [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

strsplit and keep part before first underscore

I would like to keep the part after the FIRST undescore. Please see example code.
colnames(df)
"EGAR00001341740_P32_1" "EGAR00001341741_PN32"
My try, but does not give P32_1 but only P32 which is wrong.
sapply(strsplit(colnames(df), split='_', fixed=TRUE), function(x) (x[2]))
desired output: P32_1, PN32
It could be done with a regex by matching zero or more characters that are not an underscore ([^_]*) from the start (^) of the string, followed by an underscore (_) and replace it with blanks ("")
colnames(df) <- sub("^[^_]*_", "", colnames(df))
colnames(df)
#[1] "P32_1" "PN32"
With strsplit, it will split whereever the split character occurs. One option is str_split from stringr where there is an option to specify the 'n' i.e. number of split parts. If we choose n = 2, we get 2 substrings as it will only split at the first _
library(stringr)
sapply(str_split(colnames(df), "_", n = 2), `[`, 2)
#[1] "P32_1" "PN32"
Here are a few ways. The first fixes the code in the question and the remaining ones are alternatives. All use only base except (6). (4) and (7) assume that the first field is fixed length, which is the case in the question.
x <- c("EGAR00001341740_P32_1", "EGAR00001341741_PN32")
# 1 - using strsplit
sapply(strsplit(x, "_"), function(x) paste(x[-1], collapse = "-"))
## [1] "P32_1" "PN32"
# 2 - a bit easier using sub. *? is a non-greedy match
sub(".*?_", "", x)
## [1] "P32_1" "PN32"
# 3 - locate the first underscore and extract all after that
substring(x, regexpr("_", x) + 1)
## [1] "P32_1" "PN32"
# 4 - if the first field is fixed length as in the example
substring(x, 17)
## [1] "P32_1" "PN32"
# 5 - replace first _ with character that does not appear and remove all until it
sub(".*;", "", sub("_", ";", x))
## [1] "P32_1" "PN32"
# 6 - extract everything after first _
library(gsubfn)
strapplyc(x, "_(.*)", simplify = TRUE)
## [1] "P32_1" "PN32"
# 7 - like (4) assumes fixed length first field
read.fwf(textConnection(x), widths = c(16, 99), as.is = TRUE)$V2
## [1] "P32_1" "PN32"

Get filename without extension in R

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

Resources