I have to split a string by the delimiter "-", and take out the part on the far right.
SKU <- c("PPM-UA-L", "RVK-JI-XL", "KMN-WO-XS", "YYL-S")
However, in below codes, [ ,3] will not work for all cases, as some of them have only one "-". In below example, the last value "YYL-S" will return nothing.
size <- str_split(SKU, "-", simplify = T)[ ,3]
I also tried this to do backward indexing, but got error message. Also tried [ , -1], but negative index number in R does not indicate counting backwards.
size <- str_split(SKU, "-", simplify = T)(rev[ ,3])
Vectorised string operations are faster than creating and destroying objects in memory (see benchmarks below)
Solutions which create lists of vectors that you do not need tend to be relatively slow. You can use regular expressions here to replace everything up to and including the final -.
sub(pattern = "^.+-", replacement = "", SKU)
# [1] "L" "XL" "XS" "S"
The caret (^) is a regex metacharacter which matches the beginning of the string. The matches any character except a new line. The + means "match the preceding character one or more times". The .+ combination is greedy, meaning it will find the longest match from the start to the end of the string. All together this means, match from the beginning of the string until and including the final -.
The sub() function replaces the first occurrence of the pattern in x (which in this case is SKU) with the replacement (which in this case is a blank string).
You can read more here about the syntax used in regular expressions.
Benchmarking
I benchmarked five approaches:
Base R sub().
Base R strsplit() |> sapply().
Base R strsplit() |> vapply().
stringr::str_split_i().
stringr::str_split() |> vapply(\(x) tail(x, 1), character(1)).
base R lookbehind: regmatches(gregexpr().
stringr::str_extract() lookbehind.
I repeated the vector from 10 to 1e5 times. sub() is consistently the fastest approach with the least garbage collection (gc), i.e. fewest memory allocations.
There is not much difference between base::strsplit() and stringr::str_split(). sapply does not appear different to vapply(). stringr::str_split_i() is faster than the other approaches which split the vector, and has less garbage collection, but is not as fast as sub().
stringr::str_extract() with a lookbehind is almost as fast as sub(). Using the same pattern in base R with regmatches(gregexpr()) is much slower (presumably because it returns a list).
Code to generate the plot
results <- bench::press(
rep_num = rep_nums,
{
x <- rep(SKU, rep_num)
bench::mark(
min_iterations = 10,
sub = {
sub("^.+-", "", x)
},
strsplit_base_sapply = {
strsplit(x, "-") |>
sapply(tail, 1)
},
strsplit_base_vapply = {
strsplit(x, "-") |>
vapply(\(x) tail(x, 1), character(1))
},
str_split_i = {
str_split_i(x, "-", -1)
},
str_split_vapply = {
str_split(x, "-") |>
vapply(\(x) tail(x, 1), character(1))
},
base_r_lookbehind = {
regmatches(
x,
gregexpr("(?<=-)[^-]+$", x, perl = TRUE)
) |> unlist()
},
stringr_lookbehind = {
str_extract(x, "(?<=-)[^-]+$")
}
)
}
)
library(ggplot2)
autoplot(results) +
theme_bw() +
facet_wrap(vars(rep_num), scales = "free_x")
You can use str_split_i with i = -1 to get the last part:
library(stringr) #1.5.0
str_split_i(SKU, "-", -1)
# [1] "L" "XL" "XS" "S"
Why not use str_extract with lookbehind (?<=-), a negative character class disallowing the - character and, finally, the last-position anchor $:
library(stringr)
str_extract(SKU, "(?<=-)[^-]+$")
[1] "L" "XL" "XS" "S"
To simplify (and perhaps to speed up) things, we can drop the look-behind entirely and rely solely on the negative chracter class in combination with the string-end anchor $:
str_extract(SKU, "[^-]+$")
[1] "L" "XL" "XS" "S"
Here then, str_extract extracts that substring that does not include a - and that ends when the whole string ends
Related
I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")
Say I have a file of characters that I would like to split at a character and then select the left side of the split to a new field. Is there a way to do this one step? For example:
x <- strsplit('dark_room',"_")
x[[1]][2]
[1] "room"
Is a two-step operation to access the "room" - how can I subset "room" in one operation?
An easier option is to use word from stringr and specify the sep and other parameters
library(stringr)
word('dark_room', -1, sep="_")
#[1] "room"
Or with str_extract to match characters other than _ ([^_]+) at the end ($) of the string
str_extract('dark_room', "[^_]+$")
#[1] "room"
Or with stri_extract from stringi where we can specify the first and last
library(stringi)
stri_extract_first('dark_room', regex = '[^_]+')
#[1] "dark"
stri_extract_last('dark_room', regex = '[^_]+')
#[1] "room"
The strsplit always returns a list. So, as a general case (applicable when there are more than one element) would be to loop through the list with sapply and subset with either tail
sapply( strsplit('dark_room',"_"), tail, 1)
#[1] "room"
or head to get the first n elements
Or using scan
tail(scan(text = 'dark_room', what = "", sep="_"), 1)
Here's a way using sub from base R which just removes everything up to the first underscore -
# for left side of '_'
sub(".*_", "", x)
[1] "room"
# for right side of '_'
sub("(.*)_.*", "\\1", x)
[1] "dark"
Define your own functions (using the solution in your question or any of the other solutions from the answers)
left = function(x, sep){
substring(x, 1, regexpr(sep, x) - 1)
}
right = function(x, sep){
substring(x, regexpr(sep, x) + 1, nchar(x))
}
s = c("dark_room")
left(s, "_")
#[1] "dark"
right(s, "_")
#[1] "room"
I am trying to get the host of an IP address from a list of strings.
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
I want to get the first 2 digits from the ips. output:
ips <- c('140.112', '132.212', '31.2', '7.112')
This is the code that I wrote to convert them:
cat(unlist(strsplit(ips, "\\.", fixed = FALSE))[1:2], sep = ".")
When I check the type of individual ips in the end I get something like this:
140.112 NULL
Not sure what I am doing wrong. If you have some other ideas completely different from this that is completely fine too.
With sub:
ips <- c('140.112.204.42', '132.212.14.139', '31.2.47.93', '7.112.221.238')
sub('\\.\\d+\\.\\d+$', '', ips)
# [1] "140.112" "132.212" "31.2" "7.112"
With str_extract from stringr:
library(stringr)
str_extract(ips, '^\\d+\\.\\d+')
# [1] "140.112" "132.212" "31.2" "7.112"
With strsplit + sapply:
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.'))
# [1] "140.112" "132.212" "31.2" "7.112"
With read.table + apply:
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.')
#[1] "140.112" "132.212" "31.2" "7.112"
Notes:
sub('\\.\\d+\\.\\d+$', '', ips):
i. \\.\\d+\\.\\d+$ matches a literal dot, a digit one or more times, a literal dot again, and a digit one or more times at the end of the string
ii. sub removes the above match from the string
str_extract(ips, '^\\d+\\.\\d+'):
i. ^\\d+\\.\\d+ matches a digit one or more times, a literal dot and a digit one or more times in the beginning of the string
ii. str_extract extracts the above match from the string
sapply(strsplit(ips, '\\.'), function(x) paste(x[1:2], collapse = '.')):
i. strsplit(ips, '\\.') splits each ip using a literal dot as the delimiter. This returns a list of vectors after the split
ii. With sapply, paste(x[1:2], collapse = '.') is applied to every element of the list, thus taking only the first two numbers from each vector, and collapsing them with a dot as the separator. sapply then coerces the list to a vector, thus returning a vector of the desired ips.
apply(read.table(textConnection(ips), sep='.')[1:2], 1, paste, collapse = '.'):
i. read.table(textConnection(ips), sep='.')[1:2] treats ips as text input and reads it in with dot as a delimiter. Only taking the first two columns.
ii. apply enables paste to be operated on each row, and collapses with a dot.
Could you please try following.
gsub("([0-9]+.[0-9]+)(.*)","\\1",ips)
Explanation: Using gsub function and putting regex there to match digits then DOT then digits in memory's 1st place holder and keeping .* everything after it in 2nd place holder of memory. Then substituting these with \\1 with first regex's value which will be first 2 fields.
One solution is the following:
vapply(strsplit(ips, ".", fixed = TRUE),
function(x) paste(x[1:2], collapse = "."),
character(1L))
vapply applies function(x) to each element of the output of strsplit
strsplit produces a list where each element of the list is the components of the IP addresses separated by "."; setting fixed = TRUE requests to split using the exact value of the splitting string (i.e., "."), not using regex
function(x) takes the first two elements (x[1:2]) of each item coming out of strsplit and pastes them together, seperated by "."
character(1L) tells vapply that each element of the output (i.e., returned from function(x) should be a string of length 1.
Edit: #useR posted this solution right before me (using sapply).
substr is vectorised on the stop argument, so you can use this with a vector of positions before the second dot. regexpr gives the positions of the first match, so if you sub out the first one you can match on the second - which will be conveniently one before it's true position as needed (since you removed the first one).
substr(ips,1,regexpr("\\.",sub("\\.","",ips)))
[1] "140.112" "132.212" "31.2" "7.112"
We can convert the ip addresses to numeric_version class and then format using this base R one-liner that employs no regular expressions:
format(numeric_version(ips)[, 1:2])
[1] "140.112" "132.212" "31.2" "7.112"
How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(
Suppose a vector:
xx.1 <- c("zz_ZZ_uu_d", "II_OO_d")
I want to get a new vector splitted from right most and only split once. The expected results would be:
c("zz_ZZ_uu", "d", "II_OO", "d").
It would be like python's rsplit() function. My current idea is to reverse the string, and split the with str_split() in stringr.
Any better solutions?
update
Here is my solution returning n splits, depending on stringr and stringi. It would be nice that someone provides a version with base functions.
rsplit <- function (x, s, n) {
cc1 <- unlist(stringr::str_split(stringi::stri_reverse(x), s, n))
cc2 <- rev(purrr::map_chr(cc1, stringi::stri_reverse))
return(cc2)
}
Negative lookahead:
unlist(strsplit(xx.1, "_(?!.*_)", perl = TRUE))
# [1] "zz_ZZ_uu" "d" "II_OO" "d"
Where a(?!b) says to find such an a which is not followed by a b. In this case .*_ means that no matter how far (.*) there should not be any more _'s.
However, it seems to be not that easy to generalise this idea. First, note that it can be rewritten as positive lookahead with _(?=[^_]*$) (find _ followed by anything but _, here $ signifies the end of a string). Then a not very elegant generalisation would be
rsplit <- function(x, s, n) {
p <- paste0("[^", s, "]*")
rx <- paste0(s, "(?=", paste(rep(paste0(p, s), n - 1), collapse = ""), p, "$)")
unlist(strsplit(x, rx, perl = TRUE))
}
rsplit(vec, "_", 1)
# [1] "a_b_c_d_e_f" "g" "a" "b"
rsplit(vec, "_", 3)
# [1] "a_b_c_d" "e_f_g" "a_b"
where e.g. in case n=3 this function uses _(?=[^_]*_[^_]*_[^_]*$).
Another two. In both I use "(.*)_(.*)" as the pattern to capture both parts of the string. Remember that * is greedy so the first (.*) will match as many characters as it can.
Here I use regexec to capture where your susbtrings start and end, and regmatches to reconstruct them:
unlist(lapply(regmatches(xx.1, regexec("(.*)_(.*)", xx.1)),
tail, -1))
And this one is a little less academic but easy to understand:
unlist(strsplit(sub("(.*)_(.*)", "\\1###\\2", xx.1), "###"))
What about just pasting it back together after it's split?
rsplit <- function( x, s ) {
spl <- strsplit( "zz_ZZ_uu_d", s, fixed=TRUE )[[1]]
res <- paste( spl[-length(spl)], collapse=s, sep="" )
c( res, spl[length(spl)] )
}
> rsplit("zz_ZZ_uu_d", "_")
[1] "zz_ZZ_uu" "d"
I also thought about a very similar approach to that of Ari
> res <- lapply(strsplit(xx.1, "_"), function(x){
c(paste0(x[-length(x)], collapse="_" ), x[length(x)])
})
> unlist(res)
[1] "zz_ZZ_uu" "d" "II_OO" "d"
This gives exactly what you want and is the simplest approach:
require(stringr)
as.vector(t(str_match(xx.1, '(.*)_(.*)') [,-1]))
[1] "zz_ZZ_uu" "d" "II_OO" "d"
Explanation:
str_split() is not the droid you're looking for, because it only does left-to-right split, and splitting then repasting all the (n-1) leftmost matches is a total waste of time. So use str_split() with a regex with two capture groups. Note the first (.*)_ will greedy match everything up to the last occurrence of _, which is what you want. (This will fail if there isn't at least one _, and return NAs)
str_match() returns a matrix where the first column is the entire string, and subsequent columns are individual capture groups. We don't want the first column, so drop it with [,-1]
as.vector() will unroll that matrix column-wise, which is not what you want, so we use t() to transpose it to unroll row-wise
str_match(string, pattern) is vectorized over both string and pattern, which is neat