Matching numbers by their order when in two different vectors - r

The title does not really do this question justice, but I could not think of any other way to phrase the question. I can best explain the problem with an example.
Let's say we have two vectors of numbers (each of which are always going to be ascending and unique):
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
What I am trying to do is create a function that will take these two vectors and match them in the following way:
1) Start with the first element of vector1 (in this case, 1)
2) Go to vector2 and match the element from #1 with the first element in vector 2 that is bigger than it (in this case, 5)
3) Go back to vector1 and skip all elements less than the value in #2 we found (in this case, we skip 3, and grab 10)
4) Go back to vector2 and skip all elements less than the value in #3 we found (in this case, we skip 9 and grab 15)
5) repeat until we are done with all elements.
The resulting two vectors we should have are:
result1 = c(1,10,24,30)
result2 = c(5,15,28,35)
My current solution goes something like this, but I believe it might be highly inefficient:
# establishes where we start from the vector2 numbers
# just in case we have vector1 <- c(5,8,10)
# and vector2 <- c(1,2,3,4,6,7). We would want to skip the 1,2,3,4 values
i <- 1
while(vector2[i]<vector1[1]){
i <- i+1
}
# starts the result1 vector with the first value from the vector1
result1 <- vector1[1]
# starts the result2 vector empty and will add as we loop through
result2 <- c()
# super complicated and probably hugely inefficient loop within a loop within a loop
# i really want to avoid doing this, but I cannot think of any other way to accomplish this
for(j in 1:length(vector1)){
while(vector1[j] > vector2[i] && (i+1) <= length(vector2)){
result1 <- c(result1,vector1[j])
result2 <- c(result2,vector2[i])
while(vector1[j] > vector2[i+1] && (i+2) <= length(vector2)){
i <- i+1
}
i <- i+1
}
}
## have to add on the last vector2 value cause while loop skips it
## if it doesn't exist (there are no more vector2 values bigger) we put in an NA
if(result1[length(result1)] < vector2[i]){
result2 <- c(result2,vector2[i])
}
else{
### we ran out of vector2 values that are bigger
result2 <- c(result2,NA)
}

This is really difficult to explain. Just call it magic :)
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
## another case
# vector2 <- c(0,9,15,19,21,23,28,35)
## handling the case where vector2 min value(s) are < vector1 min value
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
## interleave the two vectors
tmp <- c(vector1,vector2)[order(c(order(vector1), order(vector2)))]
## if we sort the vectors, which pairwise elements are from the same vector
r <- rle(sort(tmp) %in% vector1)$lengths
## I want to "remove" all the pairwise elements which are from the same vector
## so I again interleave two vectors:
## the first will be all TRUEs because I want the first instance of each *new* vector
## the second will be all FALSEs identifying the elements I want to throw out because
## there is a sequence of elements from the same vector
l <- rep(1, length(r))
ord <- c(l, r - 1)[order(c(order(r), order(l)))]
## create some dummy TRUE/FALSE to identify the ones I want
res <- sort(tmp)[unlist(Map(rep, c(TRUE, FALSE), ord))]
setNames(split(res, res %in% vector2), c('result1', 'result2'))
# $result1
# [1] 1 10 24 30
#
# $result2
# [1] 5 15 28 35
obviously this will only work if both vectors are ascending and unique which you said
EDIT:
works with duplicates:
vector1 <- c(1,3,10,11,24,26,30,31)
vector2 <- c(5,9,15,19,21,23,28,35)
vector2 <- c(0,9,15,19,21,23,28,35)
vector2 <- c(1,3,3,5,7,9,28,35)
f <- function(v1, v2) {
if (any(idx <- which(min(vector1) >= vector2)))
vector2 <- vector2[-idx]
vector1 <- paste0(vector1, '.0')
vector2 <- paste0(vector2, '.00')
n <- function(x) as.numeric(x)
tmp <- c(vector1, vector2)[order(n(c(vector1, vector2)))]
m <- tmp[1]
idx <- c(TRUE, sapply(1:(length(tmp) - 1), function(x) {
if (n(tmp[x + 1]) > n(m)) {
if (gsub('^.*\\.','', tmp[x + 1]) == gsub('^.*\\.','', m))
FALSE
else {
m <<- tmp[x + 1]
TRUE
}
} else FALSE
}))
setNames(split(n(tmp[idx]), grepl('\\.00$', tmp[idx])), c('result1','result2'))
}
f(vector1, vector2)
# $result1
# [1] 1 10 30
#
# $result2
# [1] 3 28 35

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

If...else within a for loop

I am writing a function to perform bit inversion for each row of a binary matrix which depends on a predefined n value. The n value will determine the number of 1 bits for each row of the matrix.
set.seed(123)
## generate a random 5 by 10 binary matrix
init <- t(replicate(5, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, 10 - i)))}))
n <- 3
## init_1 is a used to explain my problem (single row matrix)
init_1 <- t(replicate(1, {i <- sample(3:6, 1); sample(c(rep(1, i), rep(0, 10 - i)))}))
The bit_inversion function does this few things:
If the selected row has number of 1's lesser than n, then it randomly select a few indices (difference) and invert them. (0 to 1)
Else if the selected row has number of 1's greater than n, then it randomly select a few indices (difference) and invert them. (1 to 0)
Else do nothing (when the row has number of 1's equals to n.)
Below is the function I implemented:
bit_inversion<- function(pop){
for(i in 1:nrow(pop)){
difference <- abs(sum(pop[i,]) - n)
## checking condition where there are more bits being turned on than n
if(sum(pop[i,]) > n){
## determine position of 1's
bit_position_1 <- sample(which(pop[i,]==1), difference)
## bit inversion
for(j in 1:length(bit_position_1)){
pop[bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
}
}
else if (sum(pop[i,]) < n){
## determine position of 0's
bit_position_0 <- sample(which(pop[i,]==0), difference)
## bit inversion
for(j in 1:length(bit_position_0)){
pop[bit_position_0[j]] <- abs(pop[bit_position_0[j]] - 1)
}
}
}
return(pop)
}
Outcome:
call <- bit_inversion(init)
> rowSums(call) ## suppose to be all 3
[1] 3 4 5 4 3
But when using init_1 (a single row matrix), the function seems to work fine.
Outcome:
call_1 <- bit_inversion(init_1)
> rowSums(call)
[1] 3
Is there a mistake in my for and if...else loop?
Change the line in 'j' for loop
pop[bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
into
pop[i,bit_position_1[j]] <- abs(pop[i,][bit_position_1[j]] - 1)
You forgot the row index.
And, here is a more compact version of your for loop:
for(i in 1:nrow(pop)){
difference <- abs(sum(pop[i,]) - n)
logi <- sum(pop[i,]) > n
pop[i,sample(which(pop[i,]==logi), difference)] <- !logi
}

How to find a string in a vector in r?

I have created a function that essentially creates a vector of a 1000 binary values. I have been able to count the longest streak of consecutive 1s by using rle.
I was wondering how to find a specific vector (say c(1,0,0,1)) in this larger vector? I would want it to return the amount of occurrences of that vector. So c(1,0,0,1,1,0,0,1) should return 2, while c(1,0,0,0,1) should return 0.
Most solutions that I have found just find whether a sequence occurs at all and return TRUE or FALSE, or they give results for the individual values, not the specific vector that is specified.
Here's my code so far:
# creates a function where a 1000 people choose either up or down.
updown <- function(){
n = 1000
X = rep(0,n)
Y = rbinom(n, 1, 1 / 2)
X[Y == 1] = "up"
X[Y == 0] = "down"
#calculate the length of the longest streak of ups:
Y1 <- rle(Y)
streaks <- Y1$lengths[Y1$values == c(1)]
max(streaks, na.rm=TRUE)
}
# repeat this process n times to find the average outcome.
longeststring <- replicate(1000, updown())
longeststring(p_vals)
This will also work:
library(stringr)
x <- c(1,0,0,1)
y <- c(1,0,0,1,1,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 2
y <- c(1,0,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 0
If you want to match overlapped patterns,
y <- c(1,0,0,1,0,0,1) # overlapped
length(unlist(gregexpr("(?=1001)",paste(y, collapse=''),perl=TRUE)))
[1] 2
Since Y is only 0s and 1s, we can paste it into a string and use regex, specifically gregexpr. Simplified a bit:
set.seed(47) # for reproducibility
Y <- rbinom(1000, 1, 1 / 2)
count_pattern <- function(pattern, x){
sum(gregexpr(paste(pattern, collapse = ''),
paste(x, collapse = ''))[[1]] > 0)
}
count_pattern(c(1, 0, 0, 1), Y)
## [1] 59
paste reduces the pattern and Y down to strings, e.g. "1001" for the pattern here, and a 1000-character string for Y. gregexpr searches for all occurrences of the pattern in Y and returns the indices of the matches (together with a little more information so they can be extracted, if one wanted). Because gregexpr will return -1 for no match, testing for numbers greater than 0 will let us simply sum the TRUE values to get the number of macthes; in this case, 59.
The other sample cases mentioned:
count_pattern(c(1,0,0,1), c(1,0,0,1,1,0,0,1))
## [1] 2
count_pattern(c(1,0,0,1), c(1,0,0,0,1))
## [1] 0

Find vector overlap from the start

I am looking for an efficient way to get the first k elements that are the same between two vectors in R.
For example:
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
# [1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
# [1] 1 2 3
This is the same as the intersect behavior, but any values after the first mismatch should be dropped.
I also want this to work for strings.
So far, the solution that I have is this:
orderedIntersect <- function(a,b) {
a <- as.vector(a)
NAs <- is.na(match(a, as.vector(b)))
last <- ifelse(any(NAs), min(which(NAs)) - 1, length(a))
a[1:last]
}
I am troubled by the fact that I have to iterate over n input elements 6 times: match, is.na, any, which, min, and the subset [].
Clearly, it would be faster to write an external C function (with a for loop and a break), but I am wondering if there is any clever R trick I can use here.
You can compare the values of your vectors and drop elements when the first FALSE is reached:
orderedIntersect <- function(a,b) {
# check the lengths are equal and if not, "cut" the vectors so they are (to avoid warnings)
l_a <- length(a) ; l_b <- length(b)
if(l_a != l_b) {m_l <- min(l_a, l_b) ; a <- a[1:m_l] ; b <- b[1:m_l]}
# compare the elements : they are equal if both are not NA and have the same value or if both are NA
comp <- (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
# return the right vector : nothing if the first elements do not match, everything if all elements match or just the part that match
if(!comp[1]) return(c()) else if (all(comp)) return(a) else return(a[1:(which(!comp)[1]-1)])
}
orderedIntersect(c(1,2,3,4), c(1,2,5,4))
#[1] 1 2
orderedIntersect(c(1,2,3), c(1,2,3,4))
#[1] 1 2 3
orderedIntersect(c(1,2,3), c(2,3,4))
#NULL
The simple C solution (for integers) isn't really any longer than the R version, but it would be a little more work to extend to all the other classes.
library(inline)
orderedIntersect <- cfunction(
signature(x='integer', y='integer'),
body='
int i, l = length(x) > length(y) ? length(y) : length(x),
*xx = INTEGER(x), *yy = INTEGER(y);
SEXP res;
for (i = 0; i < l; i++) if (xx[i] != yy[i]) break;
PROTECT(res = allocVector(INTSXP, i));
for (l = 0; l < i; l++) INTEGER(res)[l] = xx[l];
UNPROTECT(1);
return res;'
)
## Tests
a <- c(1L,2L,3L,4L)
b <- c(1L,2L,5L,4L)
c <- c(1L,2L,8L,9L,9L,9L,9L,3L)
d <- c(9L,0L,0L,8L)
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# integer(0)
orderedIntersect(a, integer())
# integer(0)
This might work:
#test data
a <- c(1,2,3,4)
b <- c(1,2,5,4)
c <- c(1,2,8,9,9,9,9,3)
d <- c(9,0,0,8)
empty <- c()
string1 <- c("abc", "def", "ad","k")
string2 <- c("abc", "def", "c", "lds")
#function
orderedIntersect <- function(a, b) {
l <- min(length(a), length(b))
if (l == 0) return(numeric(0))
a1 <- a[1:l]
comp <- a1 != b[1:l]
if (all(!comp)) return(a1)
a1[ 0:(min(which(comp)) - 1) ]
}
#testing
orderedIntersect(a,b)
# [1] 1 2
orderedIntersect(a,c)
# [1] 1 2
orderedIntersect(a,d)
# numeric(0)
orderedIntersect(a, empty)
# numeric(0)
orderedIntersect(string1,string2)
# [1] "abc" "def"

The number of data points in matrix and vector forms

Supposed that X contains 1000 rows with m columns, where m equal to 3 as follows:
set.seed(5)
X <- cbind(rnorm(1000,0,0.5), rnorm(1000,0,0.5), rnorm(1000,0,0.5))
Variable selection is performed, then the condition will be checked before performing the next operation as follows.
if(nrow(X) < 1000){print(a+b)}
,where a is 5 and b is 15, so if nrow(X) < 1000 is TRUE, then 20 will be printed out.
However, in case that X happens to be a vector because only one column is selected,
how can I check the number of data points when X can be either a matrix or vector ?
What I can think of is that
if(is.matrix(X)){
n <- nrow(X)
} else {
n <- length(X)}
if(n < 1000){print(a+b)}
Anyone has a better idea ?
Thank you
You can use NROW for both cases. From ?NROW
nrow and ncol return the number of rows or columns present in x. NCOL and NROW do the same treating a vector as 1-column matrix.
So that means that even if the subset is dropped down to a vector, as long as x is an array, vector, or data frame NROW will treat it as a one-column matrix.
sub1 <- X[,2:3]
is.matrix(sub1)
# [1] TRUE
NROW(sub1)
# [1] 1000
sub2 <- X[,1]
is.matrix(sub2)
# [1] FALSE
NROW(sub2)
# [1] 1000
So if(NROW(X) < 1000L) a + b should work regardless of whether X is a matrix or a vector. I use <= below, since X has exactly 1000 rows in your example.
a <- 5; b <- 15
if(NROW(sub1) <= 1000L) a + b
# [1] 20
if(NROW(sub2) <= 1000L) a + b
# [1] 20
A second option would be to use drop=FALSE when you make the variable selection. This will make the subset remain a matrix when the subset is only one column. This way you can use nrow with no worry. An example of this is
X[, 1, drop = FALSE]

Resources