Calculating difference between points in vector - r

I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!

Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist

You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)

What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.

Related

Custom rounding to reference values (nDimensional)

I would like to extract the coordinate from a vector that is closest to a test coordinate.
The task would be very similar to the previously posted:(Find the approximate value in the vector) but adapted to nDimensional cases and with multiple inputs.
In other words, given:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
reference=t(data.frame(
c(1,0,0.5),
c(2,2,2),
c(3.3,3.3,3.3),
c(9,9,9),
c(10,11,12)
))
result <- approximate(test,reference)
1 0 0.5
9 9 9
10 11 12
I programmed a function using euclidean distances and old school loops but when the inputs dataframes are big it results in looong executing times.
Anyone can figure it out a more efficient way of doing it?
Thank you in advance.
PS:This is the function I created that works but takes a while (in case someone could find it useful)
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
# Sub function to calculate euclidean distance
distance_function<- function(a,b){
squaresum<-0
for(id in 1:length(a)){
squaresum=squaresum+(a[id]-b[id])^2
}
result=sqrt(squaresum)
return(result)
}
result<-data.frame()
#Choose 1 item from vector to aproximate at a time
for(id_approximate in 1:nrow(approximate)){
distance=c()
#Compare the value to aproximate with the reference points and chose the one with less distance
for(id_reference in 1:nrow(reference)){
distance[id_reference]<-distance_function(approximate[id_approximate,],reference[id_reference,])
}
result<-rbind(
result,
reference[which.min(distance),]
)
}
return(result)
}
This way the calculation is done instantly.
approximate_function<- function(approximate,reference){
# Function that returns for each entrance of approximate the closest value of reference
# It uses a euclidean distance.
# each entrance must be a row in the dataframe
# the number of columns of the df indicates the dimension of the points
results=data.frame()
#Choose 1 item from vector to aproximate at a time
for(id in 1:nrow(approximate)){
#calculates euclidean distances regardless the dimension
sumsquares=rep(0,nrow(reference))
for(dim in 1:ncol(approximate)){
sumsquares = sumsquares + (approximate[id,dim]-reference[,dim])^2
}
distances=sqrt(sumsquares)
results<- rbind(
results,
reference[which.min(distances),]
)
}
return(results)
}
You've got a few calculations that will be slow.
First:
test=t(data.frame(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
))
This one probably doesn't matter, but it would be better as
test=rbind(
c(0.9,1.1,1),
c(7.5,7.4,7.3),
c(11,11,11.2)
)
Same for setting up reference.
Second and third: You set up result as a dataframe, then add rows to it one at a time. Dataframes are much slower for row operations than matrices, and gradually growing structures in R is slow. So set it up as a matrix from the beginning at the right size, and assign results into specific rows.
EDITED to add:
Fourth: there's no need for the inner loop. You can calculate all the squared differences in one big matrix, then use rowSums or colSums to get the squared distances. This is easiest if you're working with matrix columns instead of rows, because vectors will be properly replicated automatically.
Fifth: There's no need to take the square root; if the squared distance is minimized, so is the distance.
Here's the result:
approximate <- function(test, reference){
# transpose the reference
reference <- t(reference)
# set up the result, not transposed
result <- test*NA
#Choose 1 item from vector to aproximate at a time
for(id in seq_len(nrow(test))){
squareddist <- colSums((test[id,] - reference)^2)
result[id,] <- reference[, which.min(squareddist)]
}
return(result)
}

How to do a cumulative sum inverse square weighting based upon how many items you want to calculate off a vector in R?

Suppose I have a vector of length 10
vec <- c(10,9,8,7,6,5,4,3,2,1)
and I wanted to create a function that takes in a subset length value (say 3) and computes the squared inverse up to that length. I would like to compute:
10+(9/(2^2))+(8/(3^2))
which would be
vec[1]+(vec[2]/(2^2))+(vec[3]/(3^2))
but with a function that can take input of the subset length.
The only solution I can think of is a for loop, is there a faster more elegant solution in R?
Yes, you can use the fact that most operations in R are vectorised to do this without a loop:
vec <- c(10,9,8,7,6,5,4,3,2,1)
cum_inverse_square <- function(vec, n) {
sum(vec[1:n] / (1:n)^2)
}
cum_inverse_square(vec, 3) == 10+(9/(2^2))+(8/(3^2)) # TRUE

Coding in R - how to do rolling window without a for-loop? For-loop too slow

I understand for-loops are slow in R, and the suite of apply() functions are designed to be used instead (in many cases).
However, I can't figure out how to use those functions in my situation, and advice would be greatly appreciated.
I have a list/vector of values (let's say length=10,000) and at every point, starting at the 21st value, I need to take the standard deviation of the trailing 20 values. So at 21st, I take SD of 1st-21st . At 22nd value, I take SD(2:22) and so on.
So you see I have a rolling window where I need to take the SD() of the previous 20 indices. Is there any way to accomplish this faster, without a for-loop?
I found a solution to my question.
The zoo package has a function called "rollapply" which does exactly that: uses apply() on a rolling window basis.
library(microbenchmark)
library(ggplot2)
# dummy vector
c <- 50
x <- sample(1:100, c, replace=T)
# parameter
y <- 20 # length of each vector
z <- length(x) - y # final starting index
# benchmark
xx <-
microbenchmark(lapply = {a <- lapply( 1:z, \(i) sd(x[i:(i+y)]) )}
, loop = {
b <- vector("list", z)
for (i in 1:z)
{
b[[i]] <- sd(x[i:(i+y)])
}
}
, times = 30
)
# plot
autoplot(xx) +
ggtitle(paste('vector of size', c))
It would appear while lapply has the speed advantage of a smaller vector, a loop should be used with longer vectors.
I would maintain, however, loops are not slow per se as long as they are not applied incorrectly (iterating over rows).

Vectorizing a column-by-column comparison to separate values

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

R: apply() type function for two 2-d arrays

I'm trying to find an apply() type function that can run a function that operates on two arrays instead of one.
Sort of like:
apply(X1 = doy_stack, X2 = snow_stack, MARGIN = 2, FUN = r_part(a, b))
The data is a stack of band arrays from Landsat tiles that are stacked together using rbind. Each row contains the data from a single tile, and in the end, I need to apply a function on each column (pixel) of data in this stack. One such stack contains whether each pixel has snow on it or not, and the other stack contains the day of year for that row. I want to run a classifier (rpart) on each pixel and have it identify the snow free day of year for each pixel.
What I'm doing now is pretty silly: mapply(paste, doy, snow_free) concatenates the day of year and the snow status together for each pixel as a string, apply(strstack, 2, FUN) runs the classifer on each pixel, and inside the apply function, I'm exploding each string using strsplit. As you might imagine, this is pretty inefficient, especially on 1 million pixels x 300 tiles.
Thanks!
I wouldn't try to get too fancy. A for loop might be all you need.
out <- numeric(n)
for(i in 1:n) {
out[i] <- snow_free(doy_stack[,i], snow_stack[,i])
}
Or, if you don't want to do the bookkeeping yourself,
sapply(1:n, function(i) snow_free(doy_stack[,i], snow_stack[,i]))
I've just encountered the same problem and, if I clearly understood the question, I may have solved it using mapply.
We'll use two 10x10 matrices populated with uniform random values.
set.seed(1)
X <- matrix(runif(100), 10, 10)
set.seed(2)
Y <- matrix(runif(100), 10, 10)
Next, determine how operations between the matrices will be performed. If it is row-wise, you need to transpose X and Y then cast to data.frame. This is because a data.frame is a list with columns as list elements. mapply() assumes that you are passing a list. In this example I'll perform correlation row-wise.
res.row <- mapply(function(x, y){cor(x, y)}, as.data.frame(t(X)), as.data.frame(t(Y)))
res.row[1]
V1
0.36788
should be the same as
cor(X[1,], Y[1,])
[1] 0.36788
For column-wise operations exclude the t():
res.col <- mapply(function(x, y){cor(x, y)}, as.data.frame(X), as.data.frame(Y))
This obviously assumes that X and Y have dimensions consistent with the operation of interest (i.e. they don't have to be exactly the same dimensions). For instance, one could require a statistical test row-wise but having differing numbers of columns in each matrix.
Wouldn't it be more natural to implement this as a raster stack? With the raster package you can use entire rasters in functions (eg ras3 <- ras1^2 + ras2), as well as extract a single cell value from XY coordinates, or many cell values using a block or polygon mask.
apply can work on higher dimensions (i.e. list elements). Not sure how your data is set up, but something like this might be what you are looking for:
apply(list(doy_stack, snow_stack), c(1,2), function(x) r_part(x[1], x[2]))

Resources