Split a string every 5 characters - r

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?

Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"

You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.

No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.

You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Related

Extract last digit [duplicate]

How can I get the last n characters from a string in R?
Is there a function like SQL's RIGHT?
I'm not aware of anything in base R, but it's straight-forward to make a function to do this using substr and nchar:
x <- "some text in a string"
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 6)
[1] "string"
substrRight(x, 8)
[1] "a string"
This is vectorised, as #mdsumner points out. Consider:
x <- c("some text in a string", "I really need to learn how to count")
substrRight(x, 6)
[1] "string" " count"
If you don't mind using the stringr package, str_sub is handy because you can use negatives to count backward:
x <- "some text in a string"
str_sub(x,-6,-1)
[1] "string"
Or, as Max points out in a comment to this answer,
str_sub(x, start= -6)
[1] "string"
Use stri_sub function from stringi package.
To get substring from the end, use negative numbers.
Look below for the examples:
stri_sub("abcde",1,3)
[1] "abc"
stri_sub("abcde",1,1)
[1] "a"
stri_sub("abcde",-3,-1)
[1] "cde"
You can install this package from github: https://github.com/Rexamine/stringi
It is available on CRAN now, simply type
install.packages("stringi")
to install this package.
str = 'This is an example'
n = 7
result = substr(str,(nchar(str)+1)-n,nchar(str))
print(result)
> [1] "example"
>
Another reasonably straightforward way is to use regular expressions and sub:
sub('.*(?=.$)', '', string, perl=T)
So, "get rid of everything followed by one character". To grab more characters off the end, add however many dots in the lookahead assertion:
sub('.*(?=.{2}$)', '', string, perl=T)
where .{2} means .., or "any two characters", so meaning "get rid of everything followed by two characters".
sub('.*(?=.{3}$)', '', string, perl=T)
for three characters, etc. You can set the number of characters to grab with a variable, but you'll have to paste the variable value into the regular expression string:
n = 3
sub(paste('.+(?=.{', n, '})', sep=''), '', string, perl=T)
UPDATE: as noted by mdsumner, the original code is already vectorised because substr is. Should have been more careful.
And if you want a vectorised version (based on Andrie's code)
substrRight <- function(x, n){
sapply(x, function(xx)
substr(xx, (nchar(xx)-n+1), nchar(xx))
)
}
> substrRight(c("12345","ABCDE"),2)
12345 ABCDE
"45" "DE"
Note that I have changed (nchar(x)-n) to (nchar(x)-n+1) to get n characters.
A simple base R solution using the substring() function (who knew this function even existed?):
RIGHT = function(x,n){
substring(x,nchar(x)-n+1)
}
This takes advantage of basically being substr() underneath but has a default end value of 1,000,000.
Examples:
> RIGHT('Hello World!',2)
[1] "d!"
> RIGHT('Hello World!',8)
[1] "o World!"
Try this:
x <- "some text in a string"
n <- 5
substr(x, nchar(x)-n, nchar(x))
It shoudl give:
[1] "string"
An alternative to substr is to split the string into a list of single characters and process that:
N <- 2
sapply(strsplit(x, ""), function(x, n) paste(tail(x, n), collapse = ""), N)
I use substr too, but in a different way. I want to extract the last 6 characters of "Give me your food." Here are the steps:
(1) Split the characters
splits <- strsplit("Give me your food.", split = "")
(2) Extract the last 6 characters
tail(splits[[1]], n=6)
Output:
[1] " " "f" "o" "o" "d" "."
Each of the character can be accessed by splits[[1]][x], where x is 1 to 6.
someone before uses a similar solution to mine, but I find it easier to think as below:
> text<-"some text in a string" # we want to have only the last word "string" with 6 letter
> n<-5 #as the last character will be counted with nchar(), here we discount 1
> substr(x=text,start=nchar(text)-n,stop=nchar(text))
This will bring the last characters as desired.
For those coming from Microsoft Excel or Google Sheets, you would have seen functions like LEFT(), RIGHT(), and MID(). I have created a package known as forstringr and its development version is currently on Github.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("gbganalyst/forstringr")
library(forstringr)
the str_left(): This counts from the left and then extract n characters
the str_right()- This counts from the right and then extract n characters
the str_mid()- This extract characters from the middle
Examples:
x <- "some text in a string"
str_left(x, 4)
[1] "some"
str_right(x, 6)
[1] "string"
str_mid(x, 6, 4)
[1] "text"
I used the following code to get the last character of a string.
substr(output, nchar(stringOfInterest), nchar(stringOfInterest))
You can play with the nchar(stringOfInterest) to figure out how to get last few characters.
A little modification on #Andrie solution gives also the complement:
substrR <- function(x, n) {
if(n > 0) substr(x, (nchar(x)-n+1), nchar(x)) else substr(x, 1, (nchar(x)+n))
}
x <- "moSvmC20F.5.rda"
substrR(x,-4)
[1] "moSvmC20F.5"
That was what I was looking for. And it invites to the left side:
substrL <- function(x, n){
if(n > 0) substr(x, 1, n) else substr(x, -n+1, nchar(x))
}
substrL(substrR(x,-4),-2)
[1] "SvmC20F.5"
Just in case if a range of characters need to be picked:
# For example, to get the date part from the string
substrRightRange <- function(x, m, n){substr(x, nchar(x)-m+1, nchar(x)-m+n)}
value <- "REGNDATE:20170526RN"
substrRightRange(value, 10, 8)
[1] "20170526"

Regex expression to match every nth occurence of a pattern

Consider this string,
str = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
I'd like to separate the string at every nth occurrence of a pattern, here -:
f(str, n = 2)
[1] "abc-de" "fghi-j" "k-lm" "n-o"...
f(str, n = 3)
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw"...
I know I could do it like this:
spl <- str_split(str, "-", )[[1]]
unname(sapply(split(spl, ceiling(seq(spl) / 2)), paste, collapse = "-"))
[1] "abc-de" "fghi-j" "k-lm" "n-o" "p-qrst" "u-vw" "x-yz"
But I'm looking for a shorter and cleaner solution
What are the possibilities?
What about the following (where 'n-1' is a placeholder for a number):
(?:[^-]*(?:-[^-]*){n-1})\K-
See an online demo
(?: - Open 1st non-capture group;
[^-]* - Match 0+ characters other hyphen;
(?: - Open a nested 2nd non-capture group;
-[^-]* - Match an hyphen and 0+ characters other than hyphen;
){n} - Close nested non-capture group and match n-times;
) - Close 1st non-capture group;
\K- - Forget what we just matched and match the trailing hyphen.
Note: The use of \K means we must use PCRE (perl=TRUE)
To create the 'n-1' we can use sprintf() functionality to use a variable:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
for (n in 1:10) {
print(strsplit(str, sprintf("(?:[^-]*(?:-[^-]*){%s})\\K-", n-1), perl=TRUE)[[1]])
}
Prints:
You could use str_extract_all with the pattern \w+(?:-\w+){0,2}, for instance to find terms with 3 words and 2 hyphens:
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 2
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
n <- 3
regex <- paste0("\\w+(?:-\\w+){0,", n, "}")
str_extract_all(str, regex)[[1]]
[1] "abc-de-fghi-j" "k-lm-n-o" "p-qrst-u-vw" "x-yz"
1) gsubfn gsubfn in the package of the same name is like gsub except that the replacement can be a function, list or proto object. In the case of a proto object one can supply a fun method which has a built in count variable that can be used to distinguish the occurrences. For each match the match is passed to fun and replaced with the output of fun.
We use the input shown in the Note at the end and also n to specify the number of components to use in each element of the result and sep to specify a character that does not appear in the input.
gsubfn replaces every n-th minus with sep and the strsplit splits on that.
No complex regular expressions are needed.
library(gsubfn)
n <- 3
sep <- " "
p <- proto(fun = function(., x) if (count %% n) "-" else sep)
strsplit(gsubfn("-", p, STR), sep)
## [[1]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
##
## [[2]]
## [1] "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"
2) rollapply Another approach is to split on every - and the paste it together again using rollapply giving the same result as in (1).
library(zoo)
roll <- function(x) rollapply(x, n, by = n, paste, collapse = "-",
partial = TRUE, align = "left")
lapply(strsplit(STR, "-"), roll)
Note
# input
STR = "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
STR <- c(STR, STR)
another approach: First split on every split-pattern found, then paste/collapse into groups of n-length, using the split-pattern-variable as collapse character.
str <- "abc-de-fghi-j-k-lm-n-o-p-qrst-u-vw-x-yz"
n <- 3
pattern <- "-"
ans <- unlist(strsplit(str, pattern))
sapply(split(ans,
ceiling(seq_along(ans)/n)),
paste0, collapse = pattern)
# "abc-de-fghi" "j-k-lm" "n-o-p" "qrst-u-vw" "x-yz"

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

fastest way to split strings into fixed-length elements in R

How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(

Use paste to combine letters instead or loops. R

I'm a newbie to R, but I'm trying to make a sliding window in R.
Using loops I can it like this, but this gets very inefficient.
results=c(1:7)
letters=c("A","B","C","D","E","F","G","H","I","J")
for(i in 1:7){
results[i]=paste(letters[i:(i+3)],collapse="")
}
How can I use an apply function to get the same output?
A little different to Ramnath's answer:
lets <- LETTERS[1:10]
substring(paste(lets,collapse=""),1:7,4:10)
#[1] "ABCD" "BCDE" "CDEF" "DEFG" "EFGH" "FGHI" "GHIJ"
Here is one way to do this
sapply(1:7, function(i) {
paste(letters[i:(i+3)], collapse = '')
})
With the zoo time series package:
apply(rollapply(letters,4,c), 1, paste, collapse="")
[1] "ABCD" "BCDE" "CDEF" "DEFG" "EFGH" "FGHI" "GHIJ"
A "roll your own" way just for fun:
## n letters
nl <- 10
## length of string
len <- 4
## note I use the inbuilt LETTERS
apply(matrix(LETTERS[seq_len(nl)], nl + 1, len), 1, paste, collapse = "")[seq_len(nl - len + 1)]
(Leaves you with a warning based on incomplete recycling, but I like the trick of using a matrix to provide the offset for rolling windows).

Resources