How to vectorize indexing? - r

I average coordinates stored in a data frame as follows:
sapply(coords[N:M,],mean) # mean of coordinates N to M
I need the average of several sets of coordinates, so I made this loop, which finds the mean of coordinates 1-4, 5-11 and 20-30.
N <- c(1, 5,20)
M <- c(4,11,30)
for ( i in 1:length(N) ) {
sapply(coords[N(i):M(i),],mean)
}
How can I vectorize that loop? I've tried to pass a matrix to coords (coords[NM,]), but that doesn't give me what I want.

You may replace your sapply(x, mean) by colMeans(x) in the sake of simplicity and efficiency.
Perhaps by a vector thinking you prefer to convert several variables (N and M) to a single vector - here array - when possible and simple.
N <- data.frame(from=c(1,5,20), to=c(4,11,30))
apply(N, 1, function(x) colMeans(coords[x[1]:x[2],]))

Related

How to do a cumulative sum inverse square weighting based upon how many items you want to calculate off a vector in R?

Suppose I have a vector of length 10
vec <- c(10,9,8,7,6,5,4,3,2,1)
and I wanted to create a function that takes in a subset length value (say 3) and computes the squared inverse up to that length. I would like to compute:
10+(9/(2^2))+(8/(3^2))
which would be
vec[1]+(vec[2]/(2^2))+(vec[3]/(3^2))
but with a function that can take input of the subset length.
The only solution I can think of is a for loop, is there a faster more elegant solution in R?
Yes, you can use the fact that most operations in R are vectorised to do this without a loop:
vec <- c(10,9,8,7,6,5,4,3,2,1)
cum_inverse_square <- function(vec, n) {
sum(vec[1:n] / (1:n)^2)
}
cum_inverse_square(vec, 3) == 10+(9/(2^2))+(8/(3^2)) # TRUE

Coding in R - how to do rolling window without a for-loop? For-loop too slow

I understand for-loops are slow in R, and the suite of apply() functions are designed to be used instead (in many cases).
However, I can't figure out how to use those functions in my situation, and advice would be greatly appreciated.
I have a list/vector of values (let's say length=10,000) and at every point, starting at the 21st value, I need to take the standard deviation of the trailing 20 values. So at 21st, I take SD of 1st-21st . At 22nd value, I take SD(2:22) and so on.
So you see I have a rolling window where I need to take the SD() of the previous 20 indices. Is there any way to accomplish this faster, without a for-loop?
I found a solution to my question.
The zoo package has a function called "rollapply" which does exactly that: uses apply() on a rolling window basis.
library(microbenchmark)
library(ggplot2)
# dummy vector
c <- 50
x <- sample(1:100, c, replace=T)
# parameter
y <- 20 # length of each vector
z <- length(x) - y # final starting index
# benchmark
xx <-
microbenchmark(lapply = {a <- lapply( 1:z, \(i) sd(x[i:(i+y)]) )}
, loop = {
b <- vector("list", z)
for (i in 1:z)
{
b[[i]] <- sd(x[i:(i+y)])
}
}
, times = 30
)
# plot
autoplot(xx) +
ggtitle(paste('vector of size', c))
It would appear while lapply has the speed advantage of a smaller vector, a loop should be used with longer vectors.
I would maintain, however, loops are not slow per se as long as they are not applied incorrectly (iterating over rows).

Calculating difference between points in vector

I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!
Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist
You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)
What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.

Bound the values of a vector to a limit in R

Suppose X is vector of length 100 with X position for 100 individuals. All agents start with position 0
X <- rep(0,100)
but they are embedded in a word with boundaries. I have a function that randomly changes the X position of all the agents at a given time.
Store <- X
X <- X + runif(100)
Eventually, one agent will reach the boundary and, at that point, it stay within the limits. The most simple way to do it using a looping through the vector and checking with if (in pseudo code):
for (i in 1:length(X)) {
if (between the boundaries) {keep the new X[i]} else {assign X[i] the value in Store[i]}
}
This is useful for 100 individual, but the for-loop adds too much computational time if the number of individual (and the length of the vector) increases, for example, to 1000000.
Is there a more straightforward way to do it? I was thinking that maybe I could skip specific re assignation of values that exceed the threshold during:
X <- X + runif(100)
EDIT: Also, imagine that X is not a vector but a matrix.
I realize this question is relatively old, but I just had the same question so I didn't want to leave it unanswered.
Limiting a vector or matrix to values within a certain range, can be done in a comprehensive way by combining an apply statement with min and max functions, as shown in the example below.
# Create sample vector
X <- c(1:100); print(X)
# Create sample matrix
M <- matrix(c(1:100),nrow=10); print(M)
# Set limits
minV <- 15; maxV <- 85;
# Limit vector
sapply(X, function(y) min(max(y,minV),maxV))
# Limit matrix
apply(M, c(1, 2), function(x) min(max(x,minV),maxV))
For further information on the apply functionality I would refer to the R documentation and this article on R-Bloggers:
https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
When I first came across apply statements I found it a difficult concept to wrap my head around, but would now consider it one of R's most powerful features.

R: apply() type function for two 2-d arrays

I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))

Resources