Splitting a string into fixed-size chunks - r

This seems like a really simple task, but I can't find a good solution in base R. I have a character string with 2N characters. How do I split this into a character vector of length N, with each element being a 2-character string?
I could use something like substr with Vectorize:
vss <- Vectorize(substr, c("start", "stop"))
ch <- paste(rep("a", 1e6), collapse="")
vss(ch, seq(1, nchar(ch), by=2), seq(2, nchar(ch), by=2))
but this is really slow for long strings (O(N^2) I believe).

If you want speed, Rcpp is always a good choice:
library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');
ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000
Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.
More demos:
strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.
There are two important caveats with the above implementation:
1: It doesn't handle NAs correctly. Rcpp seems to stringify to 'NA' when it's forced to come up with a std::string. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA.
x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##
2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.
strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##

Related

How to vectorize the output created by a while loop in R?

I would like to get the results from a while loop as a vector. My code looks like this and nld is just some numeric data and lk represents yearly rate of a country:
i<-1
while (i<=length(nld)) {
lk<-((nld[i+1]-nld[i])/nld[i])*100
i <- i+1
print(lk) }
But the output looks like this:
> [1] 2.34391
[1] 4.421947
[1] 0.6444809
[1] 11.29308
[1] 4.282817
[1] 1.773046
[1] 5.443044
[1] 6.332272
[1] 9.207917
[1] 6.173719
[1] 5.449088
[1] 3.977678
[1] 7.697896
[1] 6.313985
[1] 1.449447
[1] 5.149968
[1] 1.840442
[1] 2.628424
[1] 2.269874
[1] 4.195588
[1] -2.868499
[1] -2.764851
[1] 0.216549
[1] 1.907869
[1] -2.13202
[1] 4.637701
[1] 1.051423
[1] 3.946669
[1] 4.332345
[1] 6.260946
[1] 3.113528
[1] 1.537622
[1] 3.075729
[1] 2.925915
[1] 5.146445
[1] 6.129935
[1] 5.185049
[1] 3.45909
[1] 7.835161
[1] 9.649116
[1] 1.311721
[1] 0.3325002
... etc.
and i can't get and plot these results from this loop. I would be appreciated if someone could enlighthen me.
Thanks in advance.
i <- 1
result <- c()
while (i<=length(nld)) {
lk<-((nld[i+1]-nld[i])/nld[i])*100
i <- i+1
result <- c(result, lk) } # this collects `lk` in the vector `result`.
But what you are doing is very C-ish (or C++-ish).
Whenever in R or Python you see indexes and index incrementation,
in 99% of the cases there is a better expression in R or Python.
E.g. in this case, you are actually going through nld using a while loop - that is not good.
In R you would use Map() - which can iterate in parallel through vectors/lists.
nld <- 1:10
result <- Map(f=function(x, y) (x - y)/y * 100,
nld[2:length(nld)],
nld)
But there is a mistake in your original code.
You loop from i=1 to i=length(nld) but requires nld[i+1].
The i+1 would in the last case demand sth not existing.
so it should be while (i < length(nld)) { ...
and
result <- Map(f=function(x, y) (x - y)/y * 100,
nld[2:length(nld)],
nld[1:(length(nld)-1)])
Or even more R-ish: use vectorization:
f <- function(x, y) (x-y)/y*100
> f(nld[2:length(nld)], nld[1:(length(nld)-1)])
## [1] 100.00000 50.00000 33.33333 25.00000 20.00000 16.66667 14.28571
## [8] 12.50000 11.11111
Or:
f <- function(vec) {
vec1 <- vec[2:length(vec)]
vec2 <- vec[1:(length(vec)-1)]
(vec1 - vec2)/vec1 * 100 # this uses vectorization!
}
f(nld)

My gsub is not working for my string, why? [duplicate]

This question already has answers here:
Chopping a string into a vector of fixed width character elements
(13 answers)
Closed 8 years ago.
I have a string such as:
"aabbccccdd"
I want to break this string into a vector of substrings of length 2 :
"aa" "bb" "cc" "cc" "dd"
Here is one way
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
Timings
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
There are two easy possibilities:
s <- "aabbccccdd"
gregexpr and regmatches:
regmatches(s, gregexpr(".{2}", s))[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"
strsplit:
strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
# [1] "aa" "bb" "cc" "cc" "dd"
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)
# the indices where each substr will start
starts <- seq(1,num.chars, by=2)
# chop it up
sapply(starts, function(ii) {
substr(string, ii, ii+1)
})
Which gives
[1] "aa" "bb" "cc" "cc" "dd"
One can use a matrix to group the characters:
s2 <- function(x) {
m <- matrix(strsplit(x, '')[[1]], nrow=2)
apply(m, 2, paste, collapse='')
}
s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"
Unfortunately, this breaks for an input of odd string length, giving a warning:
s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
## data length [3] is not a sub-multiple or multiple of the number of rows [2]
More unfortunate is that g1 and g2 from #GSee silently return incorrect results for an input of odd string length:
g1('abc')
## [1] "ab"
g2('abc')
## [1] "ab" "cb"
Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:
s <- function(x, n) {
sst <- strsplit(x, '')[[1]]
m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
m[seq_along(sst)] <- sst
apply(m, 2, paste, collapse='')
}
s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d"
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"
(It is indeed slower than g2, but faster than g1 by about a factor of 7)
Ugly but works
sequenceString <- "ATGAATAAAG"
J=3#maximum sequence length in file
sequenceSmallVecStart <-
substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J),
seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")
Gives
ATG
AAT
AAA
G

How to extract elements and their indices from a list in R

I would like to extract list elements and their indices in R while removing items with 0 length. Let's say I have the following list in R:
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list(l1, l1, l2, l1, l3)
Then list1 returns the following:
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "a" "b"
[[4]]
character(0)
[[5]]
[1] "c" "d" "e"
I would like to somehow extract an object that displays the index/position for each non-empty element, as well as the contents of that element. So something that looks like this:
[[3]]
[1] "a" "b"
[[5]]
[1] "c" "d" "e"
The closest I've come to doing this is by removing the empty elements, but then I lose the original index/position of the remaining elements:
list2 <- list1[lapply(list1, length) > 0]
list2
[[1]]
[1] "a" "b"
[[2]]
[1] "c" "d" "e"
keep, will keep elements matching a predicate. negate(is_empty) creates a function that returns TRUE if a vector is not empty.
library("purrr")
names(list1) <- seq_along(list1)
keep(list1, negate(is_empty))
#> $`3`
#> [1] "a" "b"
#>
#> $`5`
#> [1] "c" "d" "e"
Overview
Keeping the indices required me to name each element in the list. This answer uses which() to set the condition that I apply to list1 to keep non-zero length elements.
# load data
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list( l1, l1, l2, l1, l3)
# name each element in the list
names( list1 ) <- as.character( 1:length( list1 ) )
# create a condition that
# keeps only non zero length elements
# from list1
non.zero.length.elements <-
which( lapply( X = list1, FUN = length ) != 0 )
# apply the condition to list1
# to view the non zero length elements
list1[ non.zero.length.elements ]
# $`3`
# [1] "a" "b"
#
# $`5`
# [1] "c" "d" "e"
# end of script #
I'm not sure exactly what 'extract an object that displays' means, but if you just want to print you can use this modified print.
I just slightly edited print.listof (it's not recursive! zero length subelements will be displayed):
print2 <- function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste0("[[", seq.int(ll),"]]")
for (i in seq_len(ll)[lengths(x)>0]) {
cat(nn[i], "\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
print2(list1)
[[3]]
[1] "a" "b"
[[5]]
[1] "c" "d" "e"
A very simple solution is to provide names to the elements of your list and then run your function again. There are several ways to name your elements.
l1 <- character(0)
l2 <- c("a","b")
l3 <- c("c","d","e")
list1 <- list(e1=l1, e2=l1, e3=l2, e4=l1, e5=l3)
list1
names(list1)<-paste0("element",seq(length(list1)))
list1[lapply(list1, length) > 0]

How to convert a nested list to a single list?

Consider the following nested list:
vars <- c("A", "B")
lapply(1:2, function(x) combn(vars, x, simplify=FALSE))
[[1]]
[[1]][[1]]
[1] "A"
[[1]][[2]]
[1] "B"
[[2]]
[[2]][[1]]
[1] "A" "B"
I want to convert the results (maybe by using a different function other than lapply) to the following:
[[1]]
[1] "A"
[[2]]
[1] "B"
[[3]]
[1] "A" "B"
Seems like you just want to remove one level of nesting. You can do that with unlist(..., recursive=FALSE)
vars <- c("A", "B")
x <- lapply(1:2, function(x) combn(vars, x, simplify=FALSE))
unlist(x, recursive=FALSE)
# [[1]]
# [1] "A"
#
# [[2]]
# [1] "B"
#
# [[3]]
# [1] "A" "B"

Strange bracket assignment call ('[<-') with matrix argument

Recently I've stumbled upon this bit of code:
y <- NULL
y[cbind(1:2, 1:2)] <- list( list(1,2), list(2,3))
From the second answer here.
But it doesn't seem to differ from y <- list(...), as the comparisons below show:
> identical(y, list( list(1,2), list(2,3)))
[1] TRUE
> identical(y, y[cbind(1:2, 1:2)])
[1] FALSE
What is going on in the bracket assignment here? Why it doesn't throw an error? And why is it different from the non-assigment version in the last line of code?
Matrix indexing only applies when y has dim. Combine this with standard R recycling and the fact that all matrices are actually vectors, and this behavior makes sense.
When you initialize y to NULL, you ensure it has no dim. Therefore, when you index y by a matrix, say ind, you get the same results as having called y[as.vector(ind)]
identical(y[ind], y[as.vector(ind)])
# [1] TRUE
If there are repeat values in ind and you are also assigning, then for each index, only the last value assigned ot it will remain. For example Lets assume we are executing
y <- NULL; y[cbind(1:2, 2:1)] <- list( list(1,2), list(3,4) )
# y has no dimension, so `y[cbind(1:2, 2:1)]`
# is the equivalent of `y[c(1:2, 2:1)]`
When you assign y[c(1, 2, 2, 1)] <- list("A", "B") , in effect what happens is analogous to:
y[[1]] <- "A"
y[[2]] <- "B"
y[[2]] <- "B" # <~~ 'Overwriting' previous value
y[[1]] <- "A" # <~~ 'Overwriting' previous value
Here is a further look at the indexing that occurs: (Notice how the first two letters are being repeated)
ind <- cbind(1:2, 1:2)
L <- as.list(LETTERS)
L[ind]
# [[1]]
# [1] "A"
#
# [[2]]
# [1] "B"
#
# [[3]]
# [1] "A"
#
# [[4]]
# [1] "B"
Here is the same thing, now with assignment. Notice how only the 3rd and 4th values being assigned have been kept.
L[ind] <- c("FirstWord", "SecondWord", "ThirdWord", "FourthWord")
L[ind]
# [[1]]
# [1] "ThirdWord"
#
# [[2]]
# [1] "FourthWord"
#
# [[3]]
# [1] "ThirdWord"
#
# [[4]]
# [1] "FourthWord"
Try a different index for further clarity:
ind <- cbind(c(3, 2), c(1, 3)) ## will be treated as c(3, 2, 1, 3)
L <- as.list(LETTERS)
L[ind] <- c("FirstWord", "SecondWord", "ThirdWord", "FourthWord")
L[1:5]
# [[1]]
# [1] "ThirdWord"
#
# [[2]]
# [1] "SecondWord"
#
# [[3]]
# [1] "FourthWord"
#
# [[4]]
# [1] "D"
#
# [[5]]
# [1] "E"
L[ind]
# [[1]]
# [1] "FourthWord"
#
# [[2]]
# [1] "SecondWord"
#
# [[3]]
# [1] "ThirdWord"
#
# [[4]]
# [1] "FourthWord"
Edit regarding #agstudy's questions:
Looking at the src for [ we have the following comments:
The special [ subscripting where dim(x) == ncol(subscript matrix)
is handled inside VectorSubset. The subscript matrix is turned
into a subscript vector of the appropriate size and then
VectorSubset continues.
Looking at the function static SEXP VectorSubset(SEXP x, SEXP s, SEXP call) the relevant check is the following:
/* lines omitted */
attrib = getAttrib(x, R_DimSymbol);
/* lines omitted */
if (isMatrix(s) && isArray(x) && ncols(s) == length(attrib)) {
/* lines omitted */
...

Resources