Remove numbers at the beginning and end of a string - r

I've got the following vector:
words <- c("5lang","kasverschil2","b2b")
I want to remove "5" in "5lang" and "2" in "kasverschil2". But I do NOT want to remove "2" in "b2b".

gsub("^\\d+|\\d+$", "", words)
#[1] "lang" "kasverschil" "b2b"
Another option would be to use stringi
library(stringi)
stri_replace_all_regex(words, "^\\d+|\\d+$", "")
#[1] "lang" "kasverschil" "b2b"
Using a variant of the data set provided by the OP here are benchmarks for 3 three main solutions (note that these strings are very short and contrived; results may differ on a larger, real data set):
words <- rep(c("5lang","kasverschil2","b2b"), 100000)
library(stringi)
library(microbenchmark)
GSUB <- function() gsub("^\\d+|\\d+$", "", words)
STRINGI <- function() stri_replace_all_regex(words, "^\\d+|\\d+$", "")
GREGEXPR <- function() {
gregexpr(pattern='(^[0-9]+|[0-9]+$)', text = words) -> mm
sapply(regmatches(words, mm, invert=TRUE), paste, collapse="")
}
microbenchmark(
GSUB(),
STRINGI(),
GREGEXPR(),
times=100L
)
## Unit: milliseconds
## expr min lq median uq max neval
## GSUB() 301.0988 349.9952 396.3647 431.6493 632.7568 100
## STRINGI() 465.9099 513.1570 569.1972 629.4176 738.4414 100
## GREGEXPR() 5073.1960 5706.8160 6194.1070 6742.1552 7647.8904 100

You can use gsub which uses regular expressions:
gsub("^[0-9]|[0-9]$", "", words)
# [1] "lang" "kasverschil" "b2b"
Explanation:
The pattern ^[0-9] matches any number at the beginning of a string, while the pattern [0-9]$ matches any number at the end of the string. by separating these two patterns by | you want to match either the first or the second pattern. Then, you replace the matched pattern with an empty string.

Get instances where numbers appear at the beginning or end of a word and match everything else. You need to collapse results because of possible multiple matches:
gregexpr(pattern='(^[0-9]+|[0-9]+$)', text = words) -> mm
sapply(regmatches(words, mm, invert=TRUE), paste, collapse="")

Related

Using regex to remove the filepath and the trailing _# of a filename in a sub command [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

How to get rid of the file extensions in the output of list.files in R [duplicate]

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

fastest way to split strings into fixed-length elements in R

How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(

Get filename without extension in R

I have a file:
ABCD.csv
The length before the .csv is not fixed and vary in any length.
How can I extract the portion before the .csv?
There's a built in file_path_sans_ext from the standard install tools package that grabs the file without the extension.
tools::file_path_sans_ext("ABCD.csv")
## [1] "ABCD"
basename will also remove the path leading to the file. And with this regex, any extension will be removed.
filepath <- "d:/Some Dir/ABCD.csv"
sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(filepath))
# [1] "ABCD"
Or, using file_path_sans_ext as Tyler Rinker suggested:
file_path_sans_ext(basename(filepath))
# [1] "ABCD"
You can use sub or substr
sub('\\.csv$', '', str1)
#[1] "ABCD"
or
substr(str1, 1, nchar(str1)-4)
#[1] "ABCD"
Using the 'file_path' from #JasonV's post
sub('\\..*$', '', basename(filepath))
#[1] "ABCD"
Or
library(stringr)
str_extract(filepath, perl('(?<=[/])([^/]+)(?=\\.[^.]+)'))
#[1] "ABCD"
data
str1 <- 'ABCD.csv'
fs::path_ext_remove() "removes the last extension and returns the rest of the path".
fs::path_ext_remove(c("ABCD.csv", "foo.bar.baz.txt", "d:/Some Dir/ABCD.csv"))
# Produces: [1] "ABCD" "foo.bar.baz" "D:/Some Dir/ABCD"
You can try this also:
data <- "ABCD.csv"
gsub(pattern = "\\.csv$", "", data)
#[1] "ABCD"
This will be helpful in case of list of files as well, say
data <- list.files(pattern="\\.csv$") , using the code will remove extension of all the files in the list.
If you have filenames with multiple (possible extensions) and you want to strip off only the last extension, you can try the following.
Consider the filename foo.bar.baz.txt this
sub('\\..[^\\.]*$', '', "foo.bar.baz.txt")
will leave you with foo.bar.baz.
Here is an implementation that works for compression and multiple files:
remove.file_ext <- function(path, basename = FALSE) {
out <- c()
for (p in path) {
fext <- file_ext(path)
compressions <- c("gzip", "gz", "bgz", "zip")
areCompressed <- fext %in% compressions
if (areCompressed) {
ext <- file_ext(file_path_sans_ext(path, compression = FALSE))
regex <- paste0("*\\.",ext,"\\.", fext,"$")
} else {
regex <- paste0("*\\.",fext,"$")
}
new <- gsub(pattern = regex, "", path)
out <- c(out, new)
}
return(ifelse(basename, basename(out), out))
}
The above answers are great, but I was interested in which was fastest for dealing with millions of paths at once. It seems that using sub via this SO question is the fastest for getting the filename out of the path. and then comparing three of the methods above, using tools::file_path_sans_ext is faster.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
microbenchmark(
fs::path_ext_remove(sub(".*/", "", files)),
tools::file_path_sans_ext(sub(".*/", "", files)),
str_extract(files, '(?<=[/])([^/]+)(?=\\.[^.]+)')
)
Unit: milliseconds
expr min lq mean median uq max neval
fs::path_ext_remove(sub(".*/", "", files)) 10.6273 10.98940 11.323063 11.20500 11.4992 14.5834 100
tools::file_path_sans_ext(sub(".*/", "", files)) 1.3717 1.44260 1.532092 1.48560 1.5588 2.4806 100
str_extract(files, "(?<=[/])([^/]+)(?=\\\\.[^.]+)") 7.4197 7.62875 7.985206 7.88835 8.2311 9.4107 100
Loading the library needed :
> library(stringr)
Extracting all the matches from the regex:
> str_match("ABCD.csv", "(.*)\\..*$")
[,1] [,2]
[1,] "ABCD.csv" "ABCD"
Returning only the second part of the result, which corresponds to the group matching the file name:
> str_match("ABCD.csv", "(.*)\\..*$")[,2]
[1] "ABCD"
EDIT for #U-10-Forward:
It is basically the same principle as the other answer. Just that I found this solution more robust.
Regex wise it means:
() = group
.* = any single character except the newline character any number of time
// is escape notation, thus //. means literally "."
.* = any characters any number of time again
$ means should be at the end of the input string
The logic is then that it will return the group preceding a "." followed by a group of characters at the end of the string (which equals the file extension in this case).

Matching multiple patterns

I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with a single command?
I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.
Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")
Here is one solution using stringr package
require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")
Use the -e argument to add additional patterns:
echo '1100' | grep -e '001' -e '110' -e '101'
If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:
stri_detect_regex(x, pattern)
## [1] TRUE TRUE TRUE FALSE
And some benchmarks:
require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
expr min lq mean median uq max neval
stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658 100
grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421 100
Sorry for making this an additonal answer, but it is too many lines for a comment.
I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.
For a really large number of items, a loop would be required to check each item of the pattern.
set.seed(0)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')
#since execution takes a while, I have commented out the following lines
#result <- grepl(paste(words, collapse = "|"), text)
# Error in grepl(pattern, text) :
# invalid regular expression
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...
#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))
# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) :
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)
You can also use the %like% operator from data.table library.
library(data.table)
# input
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
# check for pattern
x %like% pattern
> [1] TRUE TRUE TRUE FALSE

Resources