Vectorization of findInterval() - r

I have following problem with R function findInterval()
Given a vector X and a matrix Y, I want to find in which interval lie elements of X. Intervals are constructed, having breakpoints in Y rows. In other words for X = c(2,3) and Y = matrix(c(3,1,4,2,5,4),2,3), the output would be c(0,2). I wrote following code:
X <- c(2,3)
Y <- matrix(c(3,1,4,2,5,4),2,3)
output <- diag(apply(Y,1,function(z)findInterval(X,z)))
and it works. However, I think, it can be optimised, since the apply function returns 2 x 2 matrix (that's why i had to get diagonal of that). Is there a way to do the same, but using function, which will return a vector, taking as an argument my vector X and matrix Y? I perform this operation on high-demensional vectors, so obtaining unnecessary matrixes size 10000 x 10000 is not a good idea imho. To maximize efficiency, I don't want to use loops.
Thanks in advance for any feedback.

You can do
rowSums(X > Y)
# [1] 0 2

Related

Replace values in a matrix in R subsetting from vectors

I want to replace values in a matrix based on matrix indexes stored in two vectors (one for x, another one for y). I did it some time ago but forgot the syntax for subsetting based on vectors.
Let's say i have this matrix and these 2 arrays:
m <- matrix(0,10,10)
x <- c(1,3,5)
y <- c(2,4,6)
And i need to replace m[1,2], m[3,4], m[5,6] with other value, what would be the syntax in this case? I tried m[x,y] but doesn't work.
Without sparse matrix support:
If we include z <- c(4.5,5.6,6.7) for the values then,
for(i in 1:length(z)) m[x[i],y[i]] <- z[i]
If you want to an apply solution, this is all I could think of,
apply(data.frame(x=x,y=y,z=z),1,function(row) .GlobalEnv$m[row[1],row[2]] <- row[3])
I remembered how it was, to subset a matrix from vectors the syntax is:
m[cbind(x,y)]

Coding zero values into a data frame

I am working in R with a series of data values that have an x position (distance along a transect) and a z position (distance from the ground for a given x position). There is not a data value measurement at each x, z coordinate, to do the analysis that I need to perform, I need to code a 0 in there. Here is a short code example, real data is usually 14,000-20,000 rows. In Matlab we solve this issue by creating an empty matrix and filling it. I need an x,z matrix normalized to max(z). So in the sample below, max z is 8 and max x is 4, so I need a 4 x 8 matrix where whenever there is no given value present, 0 would be entered--just not sure the best, most efficient way to do this in R.
x <- c(1,1,1,1,1,2,2,3,3,4,4,4)
z <- c(1,4,5,6,7,1,4,2,8,1,2,5)
value <- c(9,9,9,9,9,9,9,9,9,9,9,9)
data.frame(x,z, value)
Thanks ahead of time!
In R you would do it much the same way as you describe in Matlab. First, create a matrix with all zeroes:
df <- data.frame(x, z, value)
mat <- matrix(0, 4, 8)
And then the tricky part, where you have to create a vector of the selected elements
mat[cbind(df$x, df$z)] <- df$value
What the cbind is essentially doing is creating a 2-column matrix that is used to identify a set of elements in the matrix, and then assigning the corresponding value.

Bound the values of a vector to a limit in R

Suppose X is vector of length 100 with X position for 100 individuals. All agents start with position 0
X <- rep(0,100)
but they are embedded in a word with boundaries. I have a function that randomly changes the X position of all the agents at a given time.
Store <- X
X <- X + runif(100)
Eventually, one agent will reach the boundary and, at that point, it stay within the limits. The most simple way to do it using a looping through the vector and checking with if (in pseudo code):
for (i in 1:length(X)) {
if (between the boundaries) {keep the new X[i]} else {assign X[i] the value in Store[i]}
}
This is useful for 100 individual, but the for-loop adds too much computational time if the number of individual (and the length of the vector) increases, for example, to 1000000.
Is there a more straightforward way to do it? I was thinking that maybe I could skip specific re assignation of values that exceed the threshold during:
X <- X + runif(100)
EDIT: Also, imagine that X is not a vector but a matrix.
I realize this question is relatively old, but I just had the same question so I didn't want to leave it unanswered.
Limiting a vector or matrix to values within a certain range, can be done in a comprehensive way by combining an apply statement with min and max functions, as shown in the example below.
# Create sample vector
X <- c(1:100); print(X)
# Create sample matrix
M <- matrix(c(1:100),nrow=10); print(M)
# Set limits
minV <- 15; maxV <- 85;
# Limit vector
sapply(X, function(y) min(max(y,minV),maxV))
# Limit matrix
apply(M, c(1, 2), function(x) min(max(x,minV),maxV))
For further information on the apply functionality I would refer to the R documentation and this article on R-Bloggers:
https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
When I first came across apply statements I found it a difficult concept to wrap my head around, but would now consider it one of R's most powerful features.

crossproduct matrix unexpectedly full of NAs

So I took some information from a CSV, stored it as a matrix, and tried to compute the following operations on the result, but it gave me a 2x2 array of NA. Not seeing the problem here.
data <- read.csv('qog.csv', sep=';')
X <- matrix( log( data$wdi_gnipc ) )
X <- cbind(X, data$ciri_empinx_new)
t(X) %*% X
When I look at X and t(X) they look like how I expect them to, so I am matrix-multiplying a 2xn matrix with an nx2 matrix (n is some large number like 193) and so the matrix product should be well-defined and give a meaningful 2x2 answer.
Any ideas what could be going wrong?
Note: When I try
a <- rbind(c(1,2), c(3,4))
t(a) %*% a
it gives the desired result. Not sure what the important difference is between that and what I'm doing with the data.
Let's make that an answer. For the cross product to be filled with NA, you must have at least one NA per column inside X. You can find the number of NAs per column by running:
colSums(is.na(X))
and by all likelihood you will have that
all(colSums(is.na(X)) > 0)
# [1] TRUE

lapply function with 2 count variables

I am very green in R, so there is probably a very easy solution to this:
I want to calculate the average correlation between the column vectors in a square matrix:
x<-matrix(rnorm(10000),ncol=100)
aux<-matrix(seq(1,10000))
loop<-sapply(aux,function(i,j) cov(x[,i],x[,j])
cor_x<-mean(loop)
When evaluating the sapply line I get the error 'subscript out of bounds'. I know I can do this via a script but is there any way to achieve this in one line of code?
No need for any loops. Just use mean(cov(x)), which does this very efficiently.
The problem is due to aux. The variable auxhas to range from 1 to 100 since you have 100 columns. But your aux is a sequence along the rows of x and hence ranges from 1 to 10000. It will work with the following code:
aux <- seq(1, 100)
loop <- sapply(aux, function(i, j) cov(x[, i], x[, j]))
Afterwards, you can calculate mean covariance with:
cor_x <- mean(loop)
If you want to exclude duplicate fields (e.g., cov(X,Y) is inherently identical to cov(Y,X)), you can use:
cor_x <- mean(loop[upper.tri(loop, diag = TRUE)])
If you also want to exclude cov(X,X), i.e., variance, you can use:
cor_x <- mean(loop[upper.tri(loop)])

Resources