generate nested number sequences R-style - r

I need to generate number sequences as follows:
1
1,2
1,2,3
...
1,2,3...,n
2
2,3
2,3,4
...
2,3,4,...,n
...
...
n-1
n-1,n
n
I come from other programming languages where loops are perfectly fine. But I understand the R community favors the so-called vectorized operations rather than loops (more efficient, although I haven't read all the details on the why is this).
So, the first thing that comes to my mind for what I need to do was loops. And I wrote this code that certainly does the job (R gurus say euhh in 3,2,1...)
n <- 30
accum <- list()
for (x in 1:n) {
for (y in x:n) {
accum[[paste(x,y)]] <- x:y
}
}
But this is ugly code (and I guess non-efficient).
So, what is the clever R-style code for my problem?
I certainly haven't mastered vectorized operations and the apply family functions. But my best shot at this was:
n <- 30
accum <- lapply(1:n, FUN = function(x){lapply(x:n, FUN = seq, from = x)})
no idea if this is good R-style coding, but it almost get the job done. The problem with this solution is that it produces a list with n elements, which are also lists and contain the sequences. But what I wanted was a list with 465 elements (in case of n=30), so one element per sequence without all the nesting of lists that this solution produces.
I would really appreciate solutions that are clever and elegant in the R world.

To get a single vector:
n <- 4
u <- sequence(n:1)
(v <- sequence(u) + rep(1:n, rev(cumsum(1:n))) - 1)
# [1] 1 1 2 1 2 3 1 2 3 4 2 2 3 2 3 4 3 3 4 4
and a list of vectors:
split(v, rep(cumsum(u), u))
or something very similar to your solution:
Reduce('c', lapply(1:n, function(x) lapply(x:n, seq, from = x)))

You second solution is good. All you have to do is unlist one layer.
unlist(lapply(1:n, FUN = function(x) lapply(x:n, FUN = seq, from = x)), rec=FALSE)
What you have here is the list monad in disguise. To make that more clear, consider the following, which is equivalent
mapcat <- function(x,f,...) unlist(lapply(x,f,...),rec=FALSE)
mapcat(1:n,function(a) mapcat(a:n, function(b) list(seq(a,b))))
Here mapcat is the bind operation, and list is the unit/return.
In languages with the do-notation for list monads, this could be written, for example in Haskell, as
do
a <- [1..n]
b <- [a..n]
return([a..b])
I don't know of any R package with such sugar implemented, but using the foreach library, we can get closer
library(foreach)
foreach(a=1:n, .combine='c') %:% foreach(b=a:n) %do% seq(a,b)

Related

optimizing code in R for vector comparisons in data.table

As part of my program in R, I have to compare a huge number of pair of sentences with some functions (the one im showing here is comparing sentences with the same number of words, and whether there is just exactly one different word between those two sentences)
To make things faster, I have already converted all words into integers so I am dealing with integer vectors so the example function is a very simple one
is_sub_num <- function(a,b){sum(!(a==b))==1}
where a,b are character vectors such as
a = c(1,2,3); b=c(1,4,3)
is_sub_num(a,b)
# [1] TRUE
my data will be stored in a data.table
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ ID: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2:List of 100
..$ : int 4 4 3 4
..$ : int 1 2 3 1
the length of each entry may be different (in the example below, the entries are all of size 4)
I have a table with candidate pair IDs to test the corresponding entries in DT with the function above as follow
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
here is a simplification of what I'm trying to do:
set.seed=234
z = lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
DT <- as.data.table(1:100)
DT$V2 <- z
colnames(DT) <- c("ID","V2")
print(system.time(tmp <-apply(pair_list,1,is_pair_ok)))
this takes around 22 seconds on my laptop although its only 10,000 entries and the functions are very very basic.
Do you have any advice on how to speed up the code ???
i have delved further myself into this issue, and here is my answer.
I think its an important one, and everyone should know it so please vote for this post, it doesn't deserve its bad score !!
The code to the answer is below. I have put some new parameters to make the problem a bit more general.
The key point is to use the unlist function.
Whenever we use apply to a list object, we get very very bad performance in R.
its a bit of a pain in the ass to explode objects and to do manual indexing in a vector, but the speedup is phenomenal.
set.seed=234
N=100
nobs=10000
z = lapply(1:N, function(x) sample(1:4,size=sample(3:5),replace=TRUE))
is_sub_num <- function(a,b){sum(!(a==b))==1}
is_pair_ok <- function(pair){
is_sub_num(DT[ID==pair[1],V2][[1]],DT[ID==pair[2],V2][[1]])}
is_pair_ok1 <- function(pair){
is_sub_num(zzz[pos_table[pair[1]]:(pos_table[pair[1]]+length_table[pair[1]] -1) ],
zzz[pos_table[pair[2]]:(pos_table[pair[2]]+length_table[pair[2]] -1) ]) }
pair_list <- as.data.table(cbind(sample(1:N,nobs,replace=TRUE),sample(1:N,nobs,replace=TRUE)))
DT <- as.data.table(1:N)
DT$V2 <- z
setnames(DT, c("ID","V2"))
setkey(DT, ID)
length_table <- sapply(z,length)
myfun <- function(i){sum(length_table[1:i])}
pos_table <- c(0,sapply(1:N,myfun))+1
zzz=unlist(z)
print(system.time(tmp_ref <- apply(pair_list,1,is_pair_ok)))
print(system.time(tmp <- apply(pair_list,1,is_pair_ok1)))
identical(tmp,tmp_ref)
here is the output
utilisateur système écoulé
20.96 0.00 20.96
utilisateur système écoulé
0.70 0.00 0.71
There were 50 or more warnings (use warnings() to see the first 50)
[1] TRUE
EDIT
it would a bit too long to post here. I tried to draw conclusions from the above and modify the source code of my program by trying to speed it up and using unlist, and manual indexing.
the new implementation actually is slower which came as a surprise to me, and i fail to understand why...
now I have 60% spare of time:
library(data.table)
set.seed(234)
is_sub_num <- function(a,b) sum(!(a==b))==1
is_pair_ok2 <- function(p1, p2) is_sub_num(DT[p1,V2][[1]],DT[p2,V2][[1]])
DT <- as.data.table(1:100)
DT$V2 <- lapply(1:100, function(x) sample(1:4,size=4,replace=TRUE))
setnames(DT, c("ID","V2"))
setkey(DT, ID)
pair_list <- as.data.table(cbind(sample(1:100,10000,replace=TRUE),sample(1:100,10000,replace=TRUE)))
print(system.time(tmp <- mapply(FUN=is_pair_ok2, pair_list$V1, pair_list$V2)))
most effect had setting the key for DT and using fast indexing in is_pair_ok2()
a little bit more (without the function is_sub_num()):
is_pair_ok3 <- function(p1, p2) sum(DT[p1,V2][[1]]!=DT[p2,V2][[1]])==1
print(system.time(tmp <- mapply(FUN=is_pair_ok3, pair_list$V1, pair_list$V2)))

adding values to the vector inside for loop in R

I have just started learning R and I wrote this code to learn on functions and loops.
squared<-function(x){
m<-c()
for(i in 1:x){
y<-i*i
c(m,y)
}
return (m)
}
squared(5)
NULL
Why does this return NULL. I want i*i values to append to the end of mand return a vector. Can someone please point out whats wrong with this code.
You haven't put anything inside m <- c() in your loop since you did not use an assignment. You are getting the following -
m <- c()
m
# NULL
You can change the function to return the desired values by assigning m in the loop.
squared <- function(x) {
m <- c()
for(i in 1:x) {
y <- i * i
m <- c(m, y)
}
return(m)
}
squared(5)
# [1] 1 4 9 16 25
But this is inefficient because we know the length of the resulting vector will be 5 (or x). So we want to allocate the memory first before looping. This will be the better way to use the for() loop.
squared <- function(x) {
m <- vector("integer", x)
for(i in seq_len(x)) {
m[i] <- i * i
}
m
}
squared(5)
# [1] 1 4 9 16 25
Also notice that I have removed return() from the second function. It is not necessary there, so it can be removed. It's a matter of personal preference to leave it in this situation. Sometimes it will be necessary, like in if() statements for example.
I know the question is about looping, but I also must mention that this can be done more efficiently with seven characters using the primitive ^, like this
(1:5)^2
# [1] 1 4 9 16 25
^ is a primitive function, which means the code is written entirely in C and will be the most efficient of these three methods
`^`
# function (e1, e2) .Primitive("^")
Here's a general approach:
# Create empty vector
vec <- c()
for(i in 1:10){
# Inside the loop, make one or elements to add to vector
new_elements <- i * 3
# Use 'c' to combine the existing vector with the new_elements
vec <- c(vec, new_elements)
}
vec
# [1] 3 6 9 12 15 18 21 24 27 30
If you happen to run out of memory (e.g. if your loop has a lot of iterations or vectors are large), you can try vector preallocation which will be more efficient. That's not usually necessary unless your vectors are particularly large though.

How to use with() function in R instead of apply()

I am trying to optimise a code that I have written using the apply() and similar functions (e.g. lapply()). Unfortunately I do not see much of improvement so searching I came across this post apply() is slow - how to make it faster or what are my alternatives? where a suggestion is to use the function with() instead of apply() which is certainly much faster.
What I want to do is to apply a user defined function to every row of a matrix. This function takes as input the data from the row, makes some calculations and returns a vector with the results.
A toy example where I use the apply() function, the with() and a vectorized version:
#Generate a matrix 10x3
prbl1=matrix(runif(30),nrow=10)
prbl2=data.frame(prbl1)
prbl3=prbl2
#function for the apply()
fn1=function(row){
x=row[1]
y=row[2]
z=row[3]
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#function for the with()
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(c(k1,k2,k3))
}
#Vectorise fn2
fn3=Vectorize(fn2)
#apply the functions:
rslt1=t(apply(prbl1,1,fn1))
rslt2=t(with(prbl2,fn2(X1,X2,X3)))
rslt2=cbind(rslt2[1:10],rslt2[11:20],rslt2[21:30])
rslt3=t(with(prbl3,fn3(X1,X2,X3)))
All three produce the same output, a matrix 10x3 which is what I want. Nevertheless, notice at rslt2 that I need to bind the results as the output of using with() is a vector of length 300. I suspected that this is due to the fact that the function is not vectorised (if I understood this correctly). In rslt3 I am using a vectorised version of fn2 which generated the output in the expected way.
When I compare the performance of the three, I get:
library(rbenchmark)
benchmark(rslt1=t(apply(prbl1,1,fn1)),
rslt2=with(prbl2,fn2(X1,X2,X3)),
rslt3=with(prbl3,fn3(X1,X2,X3)),
replications=1000000)
test replications elapsed relative user.self sys.self user.child sys.child
1 rslt1 1000000 103.51 7.129 102.63 0.02 NA NA
2 rslt2 1000000 14.52 1.000 14.41 0.01 NA NA
3 rslt3 1000000 123.44 8.501 122.41 0.05 NA NA
where with() without vectorisation is definitely faster.
My question: Since rslt2 is the most efficient approach, is there a way that I can use this correctly without the need to bind the results afterwards? It does the job but I feel is not efficient coding.
The first and third functions you give are being applied 1 row at a time, so are called 10 times in your example. The second function is taking advantage of the fact that multiplication and addition in R are already vectorised and so using any form of loop or ply function is unnecessary. The function is only called once. If you wanted to use your current code, all you'd need to do is change the c to cbind in fn2.
fn2=function(x,y,z){
k1=2*x+3*y+4*z
k2=2*x*3*y*4*z
k3=2*x*y+3*x*z
return(cbind(k1,k2,k3))
}
All that with does is evaluate the expression it's given in the list, data.frame or environment given. So with(prbl2,fn2(X1,X2,X3)) is entirely equivalent to fn2(prbl2$X1, prbl2$X2, prbl2$X3).
Is this your real function? If it is, then problem solved. If not, then it depends on whether your real function consists entirely of operations and functions that already are vectorised or can be replaced with vectorised equivalents.
For the amended function per the comments:
Single row:
fn1 <- function(row){
x <- row[1]
y <- row[2]
z <- row[3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
if (k1>0 & k2>0 &k3>0){
return(cbind(k1,k2,k3))
} else {
k1 <- 5*x+3*y+4*z
k2 <- 5*x*3*y*4*z
k3 <- 5*x*y+3*x*z
if (k1<0 || k2<0 || k3<0) {
return(cbind(0,0,0))
} else {
return(cbind(k1,k2,k3))
}
}
}
Whole matrix:
fn2 <- function(mat) {
x <- mat[, 1]
y <- mat[, 2]
z <- mat[, 3]
k1 <- 2*x+3*y+4*z
k2 <- 2*x*3*y*4*z
k3 <- 2*x*y+3*x*z
l1 <- 5*x+3*y+4*z
l2 <- 5*x*3*y*4*z
l3 <- 5*x*y+3*x*z
out <- array(0, dim = dim(mat))
useK <- k1 > 0 & k2 > 0 & k3 > 0
useL <- !useK & l1 >= 0 & l2 >= 0 & l3 >= 0
out[useK, ] <- cbind(k1, k2, k3)[useK, ]
out[useL, ] <- cbind(l1, l2, l3)[useL, ]
out
}

Calculate a geometric progression

I'm using brute force right now..
x <- 1.03
Value <- c((1/x)^20,(1/x)^19,(1/x)^18,(1/x)^17,(1/x)^16,(1/x)^15,(1/x)^14,(1/x)^13,(1/x)^12,(1/x)^11,(1/x)^10,(1/x)^9,(1/x)^8,(1/x)^7,(1/x)^6,(1/x)^5,(1/x)^4,(1/x)^3,(1/x)^2,(1/x),1,x,x^2,x^3,x^4,x^5,x^6,x^7,x^8,x^9,x^10,x^11,x^12,x^13,x^14,x^15,x^16,x^17,x^18,x^19,x^20)
Value
but I would like to use an increment loop just like the for loop in java
for(integer I = 1; I<=20; I++)
^ is a vectorized function in R. That means you can simply use x^(-20:20).
Edit because this gets so many upvotes:
More precisely, both the base parameter and the exponent parameter are vectorized.
You can do this:
x <- 1:3
x^2
#[1] 1 4 9
and this:
2^x
#[1] 2 4 8
and even this:
x^x
#[1] 1 4 27
In the first two examples the length-one parameter gets recycled to match the length of the longer parameter. Thats why the following results in a warning:
y <- 1:2
x^y
#[1] 1 4 3
#Warning message:
# In x^y : longer object length is not a multiple of shorter object length
If you try something like that, you probably want what outer can give you:
outer(x, y, "^")
# [,1] [,2]
#[1,] 1 1
#[2,] 2 4
#[3,] 3 9
Roland already addressed the fact that you can do this vectorized, so I will focus on the loop part in cases where you are doing something more that is not vectorized.
A Java (and C, C++, etc.) style loop like you show is really just a while loop. Something that you would like to do as:
for(I=1, I<=20, I++) { ... }
is really just a different way to write:
I=1 # or better I <- 1
while( I <= 20 ) {
...
I <- I + 1
}
So you already have the tools to do that type of loop. However if you want to assign the results into a vector, matrix, array, list, etc. and each iteration is independent (does not rely on the previous computation) then it is usually easier, clearer, and overall better to use the lapply or sapply functions.

How to apply a function to each element of a vector in R

Let's say I want to multiply each even element of a vector by 2 and each odd element of a vector by 3. Here is some code that can do this:
v <- 0:10
idx <- v %% 2 == 0
v[idx] <- v[idx] * 2
v[!idx] <- v[!idx] * 3
This would get difficult if I had more than two cases. It seems like the apply family of functions never deals with vectors so I don't know a better way to do this problem. Maybe using an apply function would work if I made transformations on the data, but it seems like that shouldn't be something that I would need to do to solve this simple problem.
Any ideas?
Edit: Sorry for the confusion. I am not specifically interested in the "%%" operator. I wanted to put some concrete code in my question, but, based on the responses to the question, was too specific. I wanted to figure out how to apply some arbitrary function to each member of the list. This was not possible with apply() and I thought sapply() only worked with lists.
You can do:
v <- v * c(2, 3)[v %% 2 + 1]
It is generalizable to any v %% n, e.g.:
v <- v * c(2, 3, 9, 1)[v %% 4 + 1]
Also it does not require that length(v) be a multiple of n.
You can use vector multiplication to do what you want:
tmp <- 1:10
tmp * rep(c(3,2), length(tmp)/2)
This is easy to extend to three or more cases:
tmp * rep(c(3,2,4), length(tmp)/3)
Easiest would be:
v*c(2,3) # as suggested by flodel in a comment.
The term to search for in the documentation is "argument recycling" ... a feature of the R language. Only works for dyadic infix functions (see ?Ops). For non-dyadcic vectorized functions that would not error out with some of the arguments and where you couldn't depend on the structure of "v" to be quite so regular, you could use ifelse:
ifelse( (1:length(v)) %% 2 == 0, func1(v), func2(v) )
This constructs two vectors and then chooses elements in the first or second based on the truth value of hte first argument. If you were trying to answer the question in the title of your posting then you should look at:
?sapply
Here is an answer allowing any set of arbitrary functions to be applied to defined groups within a vector.
# source data
test <- 1:9
# categorisations of source data
cattest <- rep(1:3,each=3)
#[1] 1 1 1 2 2 2 3 3 3
Make the function to differentially apply functions:
categ <- function(x,catg) {
mapply(
function(a,b) {
switch(b,
a * 2,
a * 3,
a / 2
)
},
x,
catg
)
}
# where cattest = 1, multiply by 2
# where cattest = 2, multiply by 3
# where cattest = 3, divide by 2
The result:
categ(test,cattest)
#[1] 2.0 4.0 6.0 12.0 15.0 18.0 3.5 4.0 4.5

Resources