Find vector overlap from the start - r

I am looking for an efficient way to get the first k elements that are the same between two vectors in R.
For example:
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
# [1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
# [1] 1 2 3
This is the same as the intersect behavior, but any values after the first mismatch should be dropped.
I also want this to work for strings.
So far, the solution that I have is this:
orderedIntersect <- function(a,b) {
a <- as.vector(a)
NAs <- is.na(match(a, as.vector(b)))
last <- ifelse(any(NAs), min(which(NAs)) - 1, length(a))
a[1:last]
}
I am troubled by the fact that I have to iterate over n input elements 6 times: match, is.na, any, which, min, and the subset [].
Clearly, it would be faster to write an external C function (with a for loop and a break), but I am wondering if there is any clever R trick I can use here.

You can compare the values of your vectors and drop elements when the first FALSE is reached:
orderedIntersect <- function(a,b) {
# check the lengths are equal and if not, "cut" the vectors so they are (to avoid warnings)
l_a <- length(a) ; l_b <- length(b)
if(l_a != l_b) {m_l <- min(l_a, l_b) ; a <- a[1:m_l] ; b <- b[1:m_l]}
# compare the elements : they are equal if both are not NA and have the same value or if both are NA
comp <- (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
# return the right vector : nothing if the first elements do not match, everything if all elements match or just the part that match
if(!comp[1]) return(c()) else if (all(comp)) return(a) else return(a[1:(which(!comp)[1]-1)])
}
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
#[1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
#[1] 1 2 3
orderedIntersect(c(1,2,3), c(2,3,4))
#NULL

The simple C solution (for integers) isn't really any longer than the R version, but it would be a little more work to extend to all the other classes.
library(inline)
orderedIntersect <- cfunction(
signature(x='integer', y='integer'),
body='
int i, l = length(x) > length(y) ? length(y) : length(x),
*xx = INTEGER(x), *yy = INTEGER(y);
SEXP res;
for (i = 0; i < l; i++) if (xx[i] != yy[i]) break;
PROTECT(res = allocVector(INTSXP, i));
for (l = 0; l < i; l++) INTEGER(res)[l] = xx[l];
UNPROTECT(1);
return res;'
)
## Tests
a <- c(1L,2L,3L,4L)
b <- c(1L,2L,5L,4L)
c <- c(1L,2L,8L,9L,9L,9L,9L,3L)
d <- c(9L,0L,0L,8L)
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# integer(0)
orderedIntersect(a, integer())
# integer(0)

This might work:
#test data
a <- c(1,2,3,4)
b <- c(1,2,5,4)
c <- c(1,2,8,9,9,9,9,3)
d <- c(9,0,0,8)
empty <- c()
string1 <- c("abc", "def", "ad","k")
string2 <- c("abc", "def", "c", "lds")
#function
orderedIntersect <- function(a, b) {
l <- min(length(a), length(b))
if (l == 0) return(numeric(0))
a1 <- a[1:l]
comp <- a1 != b[1:l]
if (all(!comp)) return(a1)
a1[ 0:(min(which(comp)) - 1) ]
}
#testing
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# numeric(0)
orderedIntersect(a, empty)
# numeric(0)
orderedIntersect(string1,string2)
# [1] "abc" "def"

Related

Limit 2 bounds of a vector in IF condition

I want to limit the 2 bounds of a vector in a IF condition. However I get the warnings "the condition has length > 1 and only the first element will be used" when I try to use the following function :
rho <- c(0.8,0,-0.5)
sigma.S <- 0.4
sigma.M <- 0.1
mu.S <- 0.06
T <- 1
N <- 365
dt <- T/N
m <- c(100,102,100,99,101)
z <- rnorm(N)
P <- matrix(0, N, 1)
P[1] <- m[1]
for (i in 2:N){
P[i] <- P[i-1]*( 1 + sigma.M*sqrt(dt)*z[i] )
}
tPts <- c(0,91,182,273,364)
yPts <- c(m[1]-P[1],m[2]-P[92],m[3]-P[183],m[4]-P[274],m[5]-P[365])
a <- tPts[1]
A <- yPts[1]
for(i in 2:5){
t <- seq(0,364,1)
b <- tPts[i]
B <- yPts[i]
if(a<=t & t<=b){
y <- ( B*(t-a) - A*(t-b) )/(b-a)
return(y)
}
a <- b
A <- B
}
Can anyone see what the problem is here? Thanks in advance!
We could change the if condition inside the loop by wrapping with all
if(all(a<=t) & all(t<=b))
assuming that we need condition to meet along the length of 't'
as a <= t or t <= b returns a logical vector of the same length as 't' and here 't' is created as a sequence from 0 to 364 i.e. even if one of the vector is of length 1 i.e. 'a' or 'b', the comparison operator does a recycling of that element to the do comparison across the larger length vector
5 < (1:6)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE
and if/else requires input to be of length 1.

Matching patterns in a matrix

My data looks like this:
S
0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000
the S indicates the column from which I am talking. It is col 26. All four rows share a 1 at that position.
I would need to be able to count for each row from 2 to 4:
How many columns to the left and right are the same as row 1?
For row 2 it would be 3 to the right (as it reaches 1/0) and 8 to the left (as it reaches 0/1).
The result for every row should be entered into a matrix like this:
row2 8 3
row3 11 9
Is there a fast and efficient way to do that? The matrix I am dealing with is very large.
If you need something fast, you could use Rcpp:
mat <- as.matrix(read.fwf(textConnection("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000"), widths = rep(1, 61)))
library(Rcpp)
cppFunction('
IntegerMatrix countLR(const LogicalMatrix& mat, const int S) {
const int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nr - 1, 2);
for(int i=1; i<nr;i++){
for(int j=S-2; j>=0;j--) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,0)++;
}
for(int j=S; j<nc;j++) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,1)++;
}
}
return(res);
}' )
countLR(mat, 26)
# [,1] [,2]
#[1,] 8 2
#[2,] 10 2
#[3,] 6 0
I assumed that column 26 itself doesn't count for the result. I also assumed that the matrix can only contain 0/1 (i.e., boolean) values. Adjust as needed.
It's pretty easy with strsplit and rle to pull apart and assemble this data:
> S <- scan(what="") #input of character mode
1: 0101001010000000000000000100111100000000000011101100010101010
2: 1001010000000001100000000100000000000100000010101110101010010
3: 1101010101010010000000000100000000100101010010110101010101011
4: 0000000000000000001000000111000110000000000000000000000000000
5:
s2 <- strsplit(S, split="")
sapply(s2, "[[", 26) # verify the 26th position is all ones
#[1] "1" "1" "1" "1"
#length of strings from 26th postion to right
rtlen <- length(s2[[1]])-(26-1)
# Pick from the `rle` $values where values TRUE
rle( tail( s2[[1]] == s2[[2]], rtlen) )
Run Length Encoding
lengths: int [1:11] 3 4 5 1 7 1 4 1 1 6 ...
values : logi [1:11] TRUE FALSE TRUE FALSE TRUE FALSE ...
Now that you have an algorithm for a single instance, you can iterate of the rest of the items in s2. To do the backwards look I just did the same operation on a rev-ersed section of the strings.
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3 # I think you counted wrong
[2,] 11 3
[3,] 7 1
Notice that I was comparing each one to the first row and your results suggest you were doing something else...perhaps comparing to the row above. That could easily be done instead with only a very small mod to the code indices for choice of the comparison vector:
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[i-1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[i-1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3
[2,] 9 9 #Again I think you may have miscounted. Easy to do, eh?
[3,] 7 1
This problem intrigued me. Since the matrix is binary, it's far more efficient to pack the matrix into a raw matrix than it is to use sparse matrices. It means that the storage for a 1,000 x 21,000,000 pattern matrix is approx. 2.4 GiB (print(object.size(raw(1000 * 21000000 / 8)), units = "GB")).
The following should be a relatively efficient way to tackle the problem. The Rcpp code takes a raw matrix which indicates the differences between the first row of the original matrix and the other rows. For efficiency in the R code, it's actually arranged with the patterns in columns rather than rows. The other functions help to convert existing sparse or regular matrices into packed ones and to read a matrix directly from a file.
library("Rcpp")
library("Matrix")
writeLines("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000", "example.txt")
cppFunction('
IntegerMatrix countLRPacked(IntegerMatrix mat, long S) {
long l = S - 2;
long r = S;
long i, cl, cr;
int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nc, 2);
for(int i=0; i<nc;i++){
// First the left side
// Work out which byte is the first to have a 1 in it
long j = l >> 3;
int x = mat(j, i) & ((1 << ((l & 7) + 1)) - 1);
long cl = l & 7;
while(j > 0 && !x) {
j --;
x = mat(j, i);
cl += 8;
}
// Then work out where the 1 is in the byte
while (x >>= 1) --cl;
// Now the right side
j = r >> 3;
x = mat(j, i) & ~((1 << ((r & 7))) - 1);
cr = 8 - (r & 7);
while(j < (nr-1) && !x) {
j ++;
x = mat(j, i);
cr += 8;
}
cr--;
while (x = (x << 1) & 0xff) --cr;
res(i, 0) = cl;
res(i, 1) = cr;
}
return(res);
}')
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readBinaryMatrix <- function(file = NULL, text = NULL) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]))
i <- i + 1
}
do.call("cbind", lst)
}
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of 8
packRow <- function(row) {
packBits(as.raw(c(row, rep(0, (8 - length(row)) %% 8 ))))
}
# Converts a binary integer matrix to a packed raw matrix
# Note the matrix is transposed (makes the subsequent xor more efficient)
packMatrix <- function(mat) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
apply(mat, 1, packRow)
}
# Takes either a packed raw matrix or a binary integer matrix, uses xor to compare all the first row
# with the others and then hands it over to the Rcpp code for processing
countLR <- function(mat, S) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
if (storage.mode(mat) != "raw") {
mat <- packMatrix(mat)
}
stopifnot(8 * nrow(mat) > S)
y <- xor(mat[, -1, drop = FALSE], mat[, 1, drop = TRUE])
countLRPacked(y, S)
}
sMat <- Matrix(as.matrix(read.fwf("example.txt", widths = rep(1, 61))))
pMat <- readBinaryMatrix("example.txt")
countLR(sMat, 26)
countLR(pMat, 26)
You should note that the width of the pattern matrix is right-padded to a multiple of 8, so if the patterns match all the way to the right hand side this will result in the right hand count being possibly a bit high. This could be corrected if need be.
Slow R version to do this (moved from duplicate):
countLR <- function(mat, S) {
mat2 <- mat[1, ] != t(mat[-1, , drop = FALSE])
l <- apply(mat2[(S - 1):1, ], 2, function(x) which(x)[1] - 1)
l[is.na(l)] <- S - 1
r <- apply(mat2[(S + 1):nrow(mat2), ], 2, function(x) which(x)[1] - 1)
r[is.na(l)] <- ncol(mat) - S
cbind(l, r)
}

Matching numbers by their order when in two different vectors

The title does not really do this question justice, but I could not think of any other way to phrase the question. I can best explain the problem with an example.
Let's say we have two vectors of numbers (each of which are always going to be ascending and unique):
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
What I am trying to do is create a function that will take these two vectors and match them in the following way:
1) Start with the first element of vector1 (in this case, 1)
2) Go to vector2 and match the element from #1 with the first element in vector 2 that is bigger than it (in this case, 5)
3) Go back to vector1 and skip all elements less than the value in #2 we found (in this case, we skip 3, and grab 10)
4) Go back to vector2 and skip all elements less than the value in #3 we found (in this case, we skip 9 and grab 15)
5) repeat until we are done with all elements.
The resulting two vectors we should have are:
result1 = c(1,10,24,30)
result2 = c(5,15,28,35)
My current solution goes something like this, but I believe it might be highly inefficient:
# establishes where we start from the vector2 numbers
# just in case we have vector1 <- c(5,8,10)
# and vector2 <- c(1,2,3,4,6,7). We would want to skip the 1,2,3,4 values
i <- 1
while(vector2[i]<vector1[1]){
i <- i+1
}
# starts the result1 vector with the first value from the vector1
result1 <- vector1[1]
# starts the result2 vector empty and will add as we loop through
result2 <- c()
# super complicated and probably hugely inefficient loop within a loop within a loop
# i really want to avoid doing this, but I cannot think of any other way to accomplish this
for(j in 1:length(vector1)){
while(vector1[j] > vector2[i] && (i+1) <= length(vector2)){
result1 <- c(result1,vector1[j])
result2 <- c(result2,vector2[i])
while(vector1[j] > vector2[i+1] && (i+2) <= length(vector2)){
i <- i+1
}
i <- i+1
}
}
## have to add on the last vector2 value cause while loop skips it
## if it doesn't exist (there are no more vector2 values bigger) we put in an NA
if(result1[length(result1)] < vector2[i]){
result2 <- c(result2,vector2[i])
}
else{
### we ran out of vector2 values that are bigger
result2 <- c(result2,NA)
}
This is really difficult to explain. Just call it magic :)
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
## another case
# vector2 <- c(0,9,15,19,21,23,28,35)
## handling the case where vector2 min value(s) are < vector1 min value
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
## interleave the two vectors
tmp <- c(vector1,vector2)[order(c(order(vector1), order(vector2)))]
## if we sort the vectors, which pairwise elements are from the same vector
r <- rle(sort(tmp) %in% vector1)$lengths
## I want to "remove" all the pairwise elements which are from the same vector
## so I again interleave two vectors:
## the first will be all TRUEs because I want the first instance of each *new* vector
## the second will be all FALSEs identifying the elements I want to throw out because
## there is a sequence of elements from the same vector
l <- rep(1, length(r))
ord <- c(l, r - 1)[order(c(order(r), order(l)))]
## create some dummy TRUE/FALSE to identify the ones I want
res <- sort(tmp)[unlist(Map(rep, c(TRUE, FALSE), ord))]
setNames(split(res, res %in% vector2), c('result1', 'result2'))
# $result1
# [1] 1 10 24 30
#
# $result2
# [1] 5 15 28 35
obviously this will only work if both vectors are ascending and unique which you said
EDIT:
works with duplicates:
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
vector2 <- c(0,9,15,19,21,23,28,35)
vector2 <- c(1,3,3,5,7,9,28,35)
f <- function(v1, v2) {
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
vector1 <- paste0(vector1, '.0')
vector2 <- paste0(vector2, '.00')
n <- function(x) as.numeric(x)
tmp <- c(vector1, vector2)[order(n(c(vector1, vector2)))]
m <- tmp[1]
idx <- c(TRUE, sapply(1:(length(tmp) - 1), function(x) {
if (n(tmp[x + 1]) > n(m)) {
if (gsub('^.*\\.','', tmp[x + 1]) == gsub('^.*\\.','', m))
FALSE
else {
m <<- tmp[x + 1]
TRUE
}
} else FALSE
}))
setNames(split(n(tmp[idx]), grepl('\\.00$', tmp[idx])), c('result1','result2'))
}
f(vector1, vector2)
# $result1
# [1] 1 10 30
#
# $result2
# [1] 3 28 35

List comprehension in R

Is there a way to implement list comprehension in R?
Like python:
sum([x for x in range(1000) if x % 3== 0 or x % 5== 0])
same in Haskell:
sum [x| x<-[1..1000-1], x`mod` 3 ==0 || x `mod` 5 ==0 ]
What's the practical way to apply this in R?
Nick
Something like this?
l <- 1:1000
sum(l[l %% 3 == 0 | l %% 5 == 0])
Yes, list comprehension is possible in R:
sum((1:1000)[(1:1000 %% 3) == 0 | (1:1000 %% 5) == 0])
And, (kind of) the for-comprehension of scala:
for(i in {x <- 1:100;x[x%%2 == 0]})print(i)
This is many years later but there are three list comprehension packages now on CRAN. Each has slightly different syntax. In alphabetical order:
library(comprehenr)
sum(to_vec(for(x in 1:1000) if (x %% 3 == 0 | x %% 5 == 0) x))
## [1] 234168
library(eList)
Sum(for(x in 1:1000) if (x %% 3 == 0 | x %% 5 == 0) x else 0)
## [1] 234168
library(listcompr)
sum(gen.vector(x, x = 1:1000, x %% 3 == 0 | x %% 5 == 0))
## [1] 234168
In addition the following is on github only.
# devtools::install.github("mailund/lc")
library(lc)
sum(unlist(lc(x, x = seq(1000), x %% 3 == 0 | x %% 5 == 0)))
## [1] 234168
The foreach package by Revolution Analytics gives us a handy interface to list comprehensions in R. https://www.r-bloggers.com/list-comprehensions-in-r/
Example
Return numbers from the list which are not equal as tuple:
Python
list_a = [1, 2, 3]
list_b = [2, 7]
different_num = [(a, b) for a in list_a for b in list_b if a != b]
print(different_num)
# Output:
[(1, 2), (1, 7), (2, 7), (3, 2), (3, 7)]
R
require(foreach)
list_a = c(1, 2, 3)
list_b = c(2, 7)
different_num <- foreach(a=list_a ,.combine = c ) %:% foreach(b=list_b) %:% when(a!=b) %do% c(a,b)
print(different_num)
# Output:
[[1]]
[1] 1 2
[[2]]
[1] 1 7
[[3]]
[1] 2 7
[[4]]
[1] 3 2
[[5]]
[1] 3 7
EDIT:
The foreach package is very slow for certain tasks.
A faster list comprehension implementation is given at List comprehensions for R
. <<- structure(NA, class="comprehension")
comprehend <- function(expr, vars, seqs, guard, comprehension=list()){
if(length(vars)==0){ # base case of recursion
if(eval(guard)) comprehension[[length(comprehension)+1]] <- eval(expr)
} else {
for(elt in eval(seqs[[1]])){
assign(vars[1], elt, inherits=TRUE)
comprehension <- comprehend(expr, vars[-1], seqs[-1], guard,
comprehension)
}
}
comprehension
}
## List comprehensions specified by close approximation to set-builder notation:
##
## { x+y | 0<x<9, 0<y<x, x*y<30 } ---> .[ x+y ~ {x<-0:9; y<-0:x} | x*y<30 ]
##
"[.comprehension" <- function(x, f,rectangularizing=T){
f <- substitute(f)
## First, we pluck out the optional guard, if it is present:
if(is.call(f) && is.call(f[[3]]) && f[[3]][[1]]=='|'){
guard <- f[[3]][[3]]
f[[3]] <- f[[3]][[2]]
} else {
guard <- TRUE
}
## To allow omission of braces around a lone comprehension generator,
## as in 'expr ~ var <- seq' we make allowances for two shapes of f:
##
## (1) (`<-` (`~` expr
## var)
## seq)
## and
##
## (2) (`~` expr
## (`{` (`<-` var1 seq1)
## (`<-` var2 seq2)
## ...
## (`<-` varN <- seqN)))
##
## In the former case, we set gens <- list(var <- seq), unifying the
## treatment of both shapes under the latter, more general one.
syntax.error <- "Comprehension expects 'expr ~ {x1 <- seq1; ... ; xN <- seqN}'."
if(!is.call(f) || (f[[1]]!='<-' && f[[1]]!='~'))
stop(syntax.error)
if(is(f,'<-')){ # (1)
lhs <- f[[2]]
if(!is.call(lhs) || lhs[[1]] != '~')
stop(syntax.error)
expr <- lhs[[2]]
var <- as.character(lhs[[3]])
seq <- f[[3]]
gens <- list(call('<-', var, seq))
} else { # (2)
expr <- f[[2]]
gens <- as.list(f[[3]])[-1]
if(any(lapply(gens, class) != '<-'))
stop(syntax.error)
}
## Fill list comprehension .LC
vars <- as.character(lapply(gens, function(g) g[[2]]))
seqs <- lapply(gens, function(g) g[[3]])
.LC <- comprehend(expr, vars, seqs, guard)
## Provided the result is rectangular, convert it to a vector or array
if(!rectangularizing) return(.LC)
tryCatch({
if(!length(.LC))
return(.LC)
dim1 <- dim(.LC[[1]])
if(is.null(dim1)){
lengths <- sapply(.LC, length)
if(all(lengths == lengths[1])){ # rectangular
.LC <- unlist(.LC)
if(lengths[1] > 1) # matrix
dim(.LC) <- c(lengths[1], length(lengths))
} else { # ragged
# leave .LC as a list
}
} else { # elements of .LC have dimension
dim <- c(dim1, length(.LC))
.LC <- unlist(.LC)
dim(.LC) <- dim
}
return(.LC)
}, error = function(err) {
return(.LC)
})
}
This implementation is faster then foreach, it allows nested comprehension, multiple parameters and parameters scoping.
N <- list(10,20)
.[.[c(x,y,z)~{x <- 2:n;y <- x:n;z <- y:n} | {x^2+y^2==z^2 & z<15}]~{n <- N}]
[[1]]
[[1]][[1]]
[1] 3 4 5
[[1]][[2]]
[1] 6 8 10
[[2]]
[[2]][[1]]
[1] 3 4 5
[[2]][[2]]
[1] 5 12 13
[[2]][[3]]
[1] 6 8 10
Another way
sum(l<-(1:1000)[l %% 3 == 0 | l %% 5 == 0])
I hope it's okay to self-promote my package listcompr which implements a list comprehension syntax for R.
The example from the question can be solved in the following way:
library(listcompr)
sum(gen.vector(x, x = 1:1000, x %% 3 == 0 || x %% 5 == 0))
## Returns: 234168
As listcompr does a row-wise (and not a vector-vise) evaluation of the conditions, it makes no difference if || or | is used a logical operator. It accepts arbitrary many arguments: First, a base expression which is transformed into the list or vector entries. Next, arbitrary many arguments which specify the variable ranges and the conditions.
More examples can be found on the readme page on the github repository of listcompr: https://github.com/patrickroocks/listcompr
For a strict mapping from Python to R, this might be the most direct equivalence:
Python:
sum([x for x in range(1000) if x % 3== 0 or x % 5== 0])
R:
sum((x <- 0:999)[x %% 3 == 0 | x %% 5 == 0])
One important difference: the R version works like Python 2 where the x variable is globally scoped outside of the expression. (I call it an "expression" here since R does not have the notion of "list comprehension".) In Python 3, the iterator is restricted to the local scope of the list comprehension. In other words:
In R (as in Python 2), the x variable persists after the expression. If it existed before the expression, then its value is changed to the final value of the expression.
In Python 3, the x variable exists only within the list comprehension. If there was an x variable created before the list comprehension, the list comprehension does not change it at all.
This list comprehension of the form:
[item for item in list if test]
is pretty straightforward with boolean indexing in R. But for more complex expressions, like implementing vector rescaling (I know this can be done with scales package too), in Python it's easy:
x = [1, 3, 5, 7, 9, 11] # -> [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
[(xi - min(x))/(max(x) - min(x)) for xi in x]
But in R this is the best I could come up with. Would love to know if there's something better:
sapply(x, function(xi, mn, mx) {(xi-mn)/(mx-mn)}, mn = min(x), mx = max(x))
You could convert a sequence of random numbers to a binary sequence as follows:
x=runif(1000)
y=NULL
for (i in x){if (i>.5){y<-c(y,1)}else{y=c(y,-1)}}
this could be generalized to operate on any list to another list based on:
x = [item for item in x if test == True]
where the test could use the else statement to not append the list y.
For the problem at hand:
x <- 0:999
y <- NULL
for (i in x){ if (i %% 3 == 0 | i %% 5 == 0){ y <- c(y, i) }}
sum( y )

How could I make this R snippet faster and more R-ish?

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.
Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:
isBinaryVector <- function(v) {
if (length(v) == 0) {
return (c(0, 1))
}
a <- v[1]
b <- a
lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
if (a < b) {
return (c(a, b))
} else {
return (c(b, a))
}
}
EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for
A B C
1 1 1 0
2 2 2 0
3 3 1 0
I will lapply this isBinaryVector and get:
$A
[1] 1 1
$B
[1] 1 1
$C
[1] 0 0
The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.
Is there anyway I could do this faster?
Thanks for any inputs!
You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.
Try this:
> dat <- data.frame(
+ A = 1:3,
+ B = c(1, 2, 1),
+ C = 0
+ )
>
> sapply(dat, function(x)length(unique(x))==2)
A B C
FALSE TRUE FALSE
Next, you want to get the min and max value. The function range does this. So:
> sapply(dat, range)
A B C
[1,] 1 1 0
[2,] 3 2 0
And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:
isBinary <- function(x)length(unique(x))==2
binaryValues <- function(x){
if(isBinary(x)) range(x) else NA
}
sapply(dat, binaryValues)
$A
[1] NA
$B
[1] 1 2
$C
[1] NA
This function returns true or false for vectors (or columns of a data frame):
is.binary <- function(v) {
x <- unique(v)
length(x) - sum(is.na(x)) == 2L
}
Also take a look at this post
I'd use something like that to get column indicies:
bivalued <- apply(my.data.frame, 2, is.binary)
nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]
Sample data:
my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
c.0..1. rnorm.100. c.5..19. letters.1.5. c..a....b..
TRUE FALSE TRUE FALSE TRUE

Resources