Consider the following vector x and list s
x <- c("apples and pears", "one banana", "pears, oranges, and pizza")
s <- strsplit(x, "(,?)\\s+")
The desired result will be the following, but please keep reading.
> t(sapply(s, `length<-`, 4))
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" NA
#[2,] "one" "banana" NA NA
#[3,] "pears" "oranges" "and" "pizza"
That's fine, it's a good way to do it. But R's vectorization is one its best features, and I'd like to see if I can do this with recursive indexing, that is, using only [ subscript indexing.
I want start with the following, and use the row and column indices to turn the matrix s into a 3x4 matrix. So I'm calling cbind on the list s, and starting from there.
(cb <- cbind(s))
# s
# [1,] Character,3
# [2,] Character,2
# [3,] Character,4
class(cb[1])
#[1] "list"
is.recursive(cb)
#[1] TRUE
I've gotten this far, but now I'm struggling with the higher dimensions. Here's the first row, From here I as to unlist the rest of the matrix using the [ and [[ index.
w <- character(nrow(cb)+nrow(cb)^2)
dim(w) <- c(3,4)
w[cbind(1, 1:3)] <- cb[[1]]
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" ""
#[2,] "" "" "" ""
#[3,] "" "" "" ""
At level 2 it gets more difficult. I've been doing things like this
> cb[[c(1,2,1), exact = TRUE]]
# Error in cb[[c(1, 2, 1), exact = TRUE]] :
# recursive indexing failed at level 2
> cb[[cbind(1,2,1)]]
# Error in cb[[cbind(1, 2, 1)]] : recursive indexing failed at level 2
Here's an example of how the indexing proceeds. I've tried all kinds of combinations of w[[cbind(1, 1:2)]] and alike
w[cbind(1, 1:3)] <- cb[[1]]
w[cbind(2, 1:2)] <- cb[[2]]
w[cbind(3, 1:4)] <- cb[[3]]
From the empty matrix w, this produces the result
# [,1] [,2] [,3] [,4]
#[1,] "apples" "and" "pears" ""
#[2,] "one" "banana" "" ""
#[3,] "pears" "oranges" "and" "pizza"
Is it possible to use recursive indexing on all levels, so that I can unlist cb into an empty matrix directly from when it was a list? i.e. put the three w[] <- cb[[]] lines into one.
I'm asking this because it gets to the heart of matrix structures in R. It's about learning the indexing, and not about finding an alternative solution to my problem.
You can use the rbind.fill.matrix function from the plyr package.
library(plyr)
rbind.fill.matrix(lapply(s, rbind))
This returns
1 2 3 4
[1,] "apples" "and" "pears" NA
[2,] "one" "banana" NA NA
[3,] "pears" "oranges" "and" "pizza"
Note that this does use as.matrix internally: rbind.fill.matrix calls matrices[] <- lapply(matrices, as.matrix)
If you wanted to bypass the intermediary steps, you can just use my cSplit function, like this:
cSplit(as.data.table(x), "x", "(,?)\\s+", fixed = FALSE)
# x_1 x_2 x_3 x_4
# 1: apples and pears NA
# 2: one banana NA NA
# 3: pears oranges and pizza
as.matrix(.Last.value)
# x_1 x_2 x_3 x_4
# [1,] "apples" "and" "pears" NA
# [2,] "one" "banana" NA NA
# [3,] "pears" "oranges" "and" "pizza"
Under the hood, however, that still does require creating a matrix and filling it in. It uses matrix indexing to fill in the values, so it is quite fast.
A manual approach would look something like:
myFun <- function(invec, split, fixed = TRUE) {
s <- strsplit(invec, split, fixed)
Ncol <- vapply(s, length, 1L)
M <- matrix(NA_character_, ncol = max(Ncol),
nrow = length(invec))
M[cbind(rep(sequence(length(invec)), times = Ncol),
sequence(Ncol))] <- unlist(s, use.names = FALSE)
M
}
myFun(x, "(,?)\\s+", FALSE)
# [,1] [,2] [,3] [,4]
# [1,] "apples" "and" "pears" NA
# [2,] "one" "banana" NA NA
# [3,] "pears" "oranges" "and" "pizza"
Speed is not everything, but it certainly should be a consideration for this type of transformation.
Here are some tests of what has been suggested so far:
## The manual approach
fun1 <- function(x) myFun(x, "(,?)\\s+", FALSE)
## The cSplit approach
fun2 <- function(x) cSplit(as.data.table(x), "x", "(,?)\\s+", fixed = FALSE)
## The OP's approach
fun3 <- function(x) {
s <- strsplit(x, "(,?)\\s+")
mx <- max(sapply(s, length))
do.call(rbind, lapply(s, function(x) { length(x) <- mx; x }))
}
## The plyr approach
fun4 <- function(x) {
s <- strsplit(x, "(,?)\\s+")
rbind.fill.matrix(lapply(s, rbind))
}
And, for fun, here's another approach, this one using dcast.data.table:
fun5 <- function(x) {
dcast.data.table(
data.table(
strsplit(x, "(,?)\\s+"))[, list(
unlist(V1)), by = sequence(length(x))][, N := sequence(
.N), by = sequence], sequence ~ N, value.var = "V1")
}
Testing is on slightly bigger data. Not very big--12k values:
x <- unlist(replicate(4000, x, FALSE))
length(x)
# [1] 12000
## I expect `rbind.fill.matrix` to be slow:
system.time(fun4(x))
# user system elapsed
# 3.38 0.00 3.42
library(microbenchmark)
microbenchmark(fun1(x), fun2(x), fun3(x), fun5(x))
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(x) 97.22076 100.8013 102.5754 107.8349 166.6632 100
# fun2(x) 115.01466 120.6389 125.0622 138.0614 222.7428 100
# fun3(x) 146.33339 155.9599 158.8394 170.3917 228.5523 100
# fun5(x) 257.53868 266.5994 273.3830 296.8003 346.3850 100
A bit bigger data, but still not what others might consider big: 1.2M values.
X <- unlist(replicate(100, x, FALSE))
length(X)
# [1] 1200000
## Dropping fun3 and fun5 now, though they are very close...
## I wonder how fun5 scales further (but don't have the patience to wait)
system.time(fun5(X))
# user system elapsed
# 31.28 0.43 31.76
system.time(fun3(X))
# user system elapsed
# 31.62 0.33 31.99
microbenchmark(fun1(X), fun2(X), times = 10)
# Unit: seconds
# expr min lq median uq max neval
# fun1(X) 11.65622 11.76424 12.31091 13.38226 13.46488 10
# fun2(X) 12.71771 13.40967 14.58484 14.95430 16.15747 10
The penalty for the cSplit approach would be in terms of having to convert to a "data.table" and the checking of different conditions, but as your data grows, those penalties become less noticeable.
Related
Say I have a even-length vector such as this:
v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
It is 14 elements long. I wish to randomly sample 7 pairs of elements without replacement, but a rule is that no pair should contain two of the same item.
So the following result would be acceptable:
1-2, 1-2, 1-2, 1-3, 3-4, 3-5, 6-7
I am not sure how to do this systematically. Clearly brute force would work, e.g.
set.seed(1)
v=c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
length(v)
v1<-sample(v)
pairs <- split(v1, ceiling(seq_along(v1)/2))
sapply(pairs, diff)
1 2 3 4 5 6 7
1 1 2 3 -6 -3 3
This shows that no pair has duplicate elements as the difference is always not 0. In my case, I need to do this 1000s of times and it's not so easy to avoid duplicates. Is there a more effective way?
v0 <- table(v)
set.seed(2)
out <- replicate(7, sample(names(v0), size=2, prob=v0))
out
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "1" "2" "4" "1" "3" "2" "6"
# [2,] "5" "1" "7" "7" "2" "1" "1"
I use table(v) and names(v0) so that I'm guaranteed the names and the probs are in the same order. (I didn't want to assume that your actual data is structured identically.) If you need integers, then it's easy enough to us as.integer.
If you literally need 1-2, then
apply(out, 2, paste, collapse="-")
# [1] "1-5" "2-1" "4-7" "1-7" "3-2" "2-1" "6-1"
I'm confident that this will produce no dupes (because names(v0) is unique and the default replace=FALSE), but here's an empirical test:
set.seed(3)
l <- replicate(1e5, sample(unique(v), size=2, prob=table(v)))
any(l[1,] == l[2,])
# [1] FALSE
Here is a variation of your "brute-force" approach (better known as "hit-or-miss"):
rand.pairs <- function(v, time.out = 1000){
n <- length(v)
for(i in 1:time.out){
v <- sample(v)
first <- v[1:(n/2)]
second <- v[(n/2+1):n]
if(all(first != second)) return(unname(rbind(first,second)))
}
NULL
}
The point of time.out is to avoid infinite loops. For some input vectors a solution might be either impossible or too hard to hit upon by chance.
Example run:
> v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
> set.seed(1234)
> rand.pairs(v)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 6 3 3 7 2 2 5
[2,] 1 4 1 1 3 1 2
It is fast enough to run thousands of times:
> library(microbenchmark)
> microbenchmark(rand.pairs(v))
Unit: microseconds
expr min lq mean median uq max neval
rand.pairs(v) 6.7 7.758 16.17517 12.166 19.747 70.877 100
Your mileage may vary, but if your machine is at all comparable, you should be able to call this function over 50,000 times per second. replicate(10000,rand.pairs(v)) takes much less than a second to run. On the other hand, if you have an input for which the constraints are harder to satisfy, a solution might require more time.
I want to extract the minimum value of each element of several matrix that are stored inside a list. I'm using pmin:
do.call(pmin, mylist)
The problem is that some elements of those matrix are NAs, and pmin yields a NA where I want it to yield the minimum value after excluding the NAs. I tried to solve my problem using
do.call(pmin(na.rm=T), mylist)
but I get an error. I also tried with this answer: data.table and pmin with na.rm=TRUE argument, but I get the error because .SD is not on the environment.
Simple code for a similar problem would be:
mymat1 <- matrix(rnorm(10), ncol=2)
mymat2 <- matrix(rnorm(10), ncol=2)
mymat2[2,2] <- NA
mymat3 <- matrix(rnorm(10), ncol=2)
mylist <- list(mymat1, mymat2, mymat3)
do.call(pmin, mylist)
I get a NA in the position [2,2] of the resulting matrix, and I want to get the minimum values ignoring NAs.
Any suggestions?
Thank you.
Concatenate the na.rm = TRUE as a named list element and then use pmin with do.call so that the parameter na.rm will be found
do.call(pmin, c(mylist, list(na.rm = TRUE)))
# [,1] [,2]
#[1,] -1.0830716 -0.1237099
#[2,] -0.5949517 -3.7873790
#[3,] -2.1003236 -1.2565663
#[4,] -0.4500171 -1.0588205
#[5,] -1.0937602 -1.0537657
if you use purrr / tidyverse you can use purrr::invoke.
library(purrr)
invoke(pmin,mylist,na.rm=TRUE)
# [,1] [,2]
# [1,] -0.3053884 -1.3770596
# [2,] 0.9189774 -0.4149946
# [3,] -0.1027877 -0.3942900
# [4,] -0.6212406 -1.4707524
# [5,] -2.2146999 -0.4781501
It is basically do.call with a ... argument and its source code is more or less #akrun's answer :
function (.f, .x = NULL, ..., .env = NULL)
{
.env <- .env %||% parent.frame()
args <- c(as.list(.x), list(...))
do.call(.f, args, envir = .env)
}
purrr::partial is also interesting :
pmin2 <- partial(pmin,na.rm=TRUE)
do.call(pmin2,mylist)
# [,1] [,2]
# [1,] -0.3053884 -1.3770596
# [2,] 0.9189774 -0.4149946
# [3,] -0.1027877 -0.3942900
# [4,] -0.6212406 -1.4707524
# [5,] -2.2146999 -0.4781501
Basically the idea is for the values n=10,20,30,...100 to take the mean of 10,000 random samples, saving the 10,000 means for later usage.
In a language I'm more accustomed to, I would create a hashmap using each n as a key, and a list of means as the value.
In javascript for example:
var mydata
var map = {}
for (int i = 10; i <= 100; i += 10 ) {
map[i] = [] // create list
for (int j = 0; j < 10000; j++) {
map[i][j] = mean(sample(mydata, i))
}
}
Now I attempted to do this in R (this is my first time using it), and I ended up with:
hashmap <- new.env()
sunspots <- read.table("sunspots.txt")
for (i in seq(10, 100, by=10)) {
hashmap[[i]] <- c()
for (j in 1:10000) {
hashmap[[i]][j] <- mean(sample(sunspots$x, i))
}
}
However this throws an error:
wrong args for environment subassignment
Even if it didn't throw this error, I'm not entirely sure if I'm approaching it the right way.
Could someone help me understand the proper way to go about this?
The issue here is that i is a numeric, and environments must be keyed by character strings. Thus your immediate problem can be solved with a simple as.character() coercion on the i variable when it's used to index hashmap.
I would also recommend you refactor the inner loop into a vectorized function call, such as replicate(). Here's how I would do this:
set.seed(1L);
test.data <- 1:200;
N <- 3L;
e <- new.env();
for (i in seq(10L,100L,10L)) e[[as.character(i)]] <- replicate(N,mean(sample(test.data,i)));
Result:
ls(e);
## [1] "10" "100" "20" "30" "40" "50" "60" "70" "80" "90"
for (i in seq(10L,100L,10L)) print(e[[as.character(i)]]);
## [1] 108.3 109.4 82.4
## [1] 108.50 93.65 106.20
## [1] 103.3333 96.0000 101.2333
## [1] 98.075 95.250 83.275
## [1] 106.68 97.48 107.34
## [1] 97.48333 105.95000 98.76667
## [1] 101.8857 102.4857 114.6000
## [1] 99.5875 107.0875 96.0750
## [1] 92.9000 103.0889 100.7889
## [1] 91.19 99.80 101.57
You can change N to 10000 and test.data to sunspots for your real data.
Also, here's an alternative that produces a matrix output, built around the convenient feature of sapply() that it returns a matrix for multi-element return values from FUN():
set.seed(1L);
sapply(seq(10L,100L,10L),function(i) replicate(N,mean(sample(test.data,i))));
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 108.3 108.50 103.3333 98.075 106.68 97.48333 101.8857 99.5875 92.9000 91.19
## [2,] 109.4 93.65 96.0000 95.250 97.48 105.95000 102.4857 107.0875 103.0889 99.80
## [3,] 82.4 106.20 101.2333 83.275 107.34 98.76667 114.6000 96.0750 100.7889 101.57
Wouldn't this be the same, but simpler and more readable?
set.seed(123)
N = 10000
sunspots <- rnorm(N, 10, 2)
sim <- lapply(seq(10, 100, by=10), function(i){
sapply(1:N, function(j){
mean(sample(sunspots, i))
})
})
lapply(sim, head)
It would make sense, as replicate is just an sapply call.
> replicate
function (n, expr, simplify = "array")
sapply(integer(n), eval.parent(substitute(function(...) expr)),
simplify = simplify)
<bytecode: 0x19b0b7108>
<environment: namespace:base>
EDIT
As mentioned in the comments.
simulation <- function(data, i){
sapply(1:N, function(j) mean(sample(data, i)))
}
sim <- lapply(seq(10, 100, by=10), function(i) simulation(sunspots, i))
# This would give the same output.
do.call(cbind, lapply(sim, head))
# You could potentially use sapply on the first level also.
sim <- sapply(seq(10, 100, by=10), function(i) simulation(sunspots, i))
str(sim)
Okay, here's the situation:
I have the following list of arrays:
N <- c('A', 'B', 'C')
ll <- sapply(N, function(x) NULL)
ll <- lapply(ll, function(x) assign("x", array(0, dim = c(2,2)))) .
Now I want to replace, say, the element at position [1,1] in those arrays by a given quantity, say 10. What I'm doing, following this question here. That is, I'm doing the following:
x <- lapply(ll, function(x) {x[1,1] <- 10}),
which should make x a list of three 2x2 arrays with the [1,1] element equal to 10, all others equal to 0. Instead of that, I'm seeing this:
> x <- lapply(ll, function(x) {x[2,1] <- 10})
> x
$A
[1] 10
$B
[1] 10
$C
[1] 10
Any ideas of what's going on here?
You're not returning the whole vector. So, the last argument is returned. That is, when you do,
x <- lapply(ll, function(x) {x[2,1] <- 10})
You intend to say:
x <- lapply(ll, function(x) {x[2,1] <- 10; return(x)})
If you don't specify a return value, the last assigned value is returned by default which is 10. Instead you should use return(x) or equivalently just x as follows:
x <- lapply(ll, function(x) {x[2,1] <- 10; x})
# $A
# [,1] [,2]
# [1,] 0 0
# [2,] 10 0
#
# $B
# [,1] [,2]
# [1,] 0 0
# [2,] 10 0
#
# $C
# [,1] [,2]
# [1,] 0 0
# [2,] 10 0
Although apply would generally be preferred, here is an alternative, just for the sake of having one:
for (i in 1:3) ll[[i]][2,1] <- 10
Consider the following matrix,
m <- matrix(letters[c(1,2,NA,3,NA,4,5,6,7,8)], 2, byrow=TRUE)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "a" "b" NA "c" NA
## [2,] "d" "e" "f" "g" "h"
I wish to obtain the column indices corresponding to all non-NA elements, merged with the NA elements immediately following:
result <- c(list(1), list(2:3), list(4,5),
list(1), list(2), list(3), list(4), list(5))
Any ideas?
The column (and row) indicies of non-NA elements can be obtained with
which(!is.na(m), TRUE)
A full answer:
Since you want to work row-wise, but R treats vector column-wise, it is easier to work on the transpose of m.
t_m <- t(m)
n_cols <- ncol(m)
We get the array indicies as mentioned above, which gives the start point of each list.
ind_non_na <- which(!is.na(t_m), TRUE)
Since we are working on the transpose, we want the row indices, and we need to deal with each column separately.
start_points <- split(ind_non_na[, 1], ind_non_na[, 2])
The length of each list is given by the difference between starting points, or the difference between the last point and the end of the row (+1). Then we just call seq to get a sequence.
unlist(
lapply(
start_points,
function(x)
{
len <- c(diff(x), n_cols - x[length(x)] + 1L)
mapply(seq, x, length.out = len, SIMPLIFY = FALSE)
}
),
recursive = FALSE
)
This will get you close:
cols <- col(m)
cbind(cols[which(is.na(m))-1],cols[is.na(m)])
[,1] [,2]
[1,] 2 3
[2,] 4 5