Calculate sum of array cells within a given radius - r

This questions comes after a calculation in GIS (ArcMap 10.1) takes over a month to calculate (and didn't finish yet). Now I am trying to find a faster solution in R.
I have a matrix of ~30,000 x 80,000 cells, where each cell represents a 5x5 meters square. I need to calculate the sum of values in cells that fall within a given radius (3000 meters) from each cell.
For the cells on the edge of the matrix I assume a value of 0 outside the matrix.
The question is how to define the cells that fall within the radius.
There must be a library that has this functionality, but I couldn't find any.
Any suggestions?

A quick method you can test, would be to use extract and set buffer to 3000m and then use sum in the fun argument. You can sequentially extract each cell number in your raster. But I still think this will take an inordinate amount of time. Let's assume your raster is called r....
# in the first instance I would set y to be smallish, like say 1:100 and see how long it takes
extract( r , y = 1:ncell(r) , buffer = 3000 , fun = sum )
Now, the raster package does have some parallelism built in, which with access to a large, large, large multi-core machine could speed up your operation a bit by running...
beginCluster()
extract( r , y = 1:ncell(r) , buffer = 3000 , fun = sum )
endCluster()
Don't forget to assign the output of extract to a variable.

Related

Use %dopar% for nested for loops to find smallest distance in R

I'm currently try to match all the zip codes in the US with some zip codes I have, by the smallest distance. The code currently like this:
for (i in 1:nrow(Haversine_Zip_Match)){
# Reset the nearest distance by every row
BestDist <- Inf
for (j in 1:nrow(merged)){
# Calculate distance
currDist <- dist(merged$LAT[j], Haversine_Zip_Match$LAT[i], merged$LONG[j], Haversine_Zip_Match$LONG[i])
# There are some NA values for long/lat,
if (is.na(currDist)){
currDist <- Inf
}
# Update best matching result
if (currDist < BestDist){
BestDist = currDist
Haversine_Zip_Match$haversineMatch[i] = merged$ZIP_CD[j]
}
}
}
dist is the function I defined to calculate the distance. But "Haversine_Zip_Match" has 40,000 rows and "Merged" has 30,000 rows. In total there are over 1 billion calculations. Is there a way to make it faster? I'm currently thinking use %dopar% to expedite the process. Any idea would help, thanks!
Instead of trying to parallelize, you could try to reduce the number of calculations.
Usually, zipcode databases define the min/max latitude and longitude around a zip code.
If you don't have this information, you can define a box around each zip code, large enough so that zipcodes box areas overlap.
In the example below, I used this zipcode .rda with 43689 codes.
library(data.table)
library(geosphere)
points <- setDT(zipcode)[,.(zip,latitude,longitude)][!is.na(latitude)&!is.na(longitude)]
zipDB <- setDT(zipcode)[,.(zip,latitude,longitude,latmin, latmax,lonmin,lonmax)][!is.na(latitude)&!is.na(longitude)]
# full cross product :
nrow(points) * nrow(zipDB)
#[1] 1908728721
# Area limited cross product
cross <- zipDB[points,.(i.zip,i.latitude,i.longitude,zip,latitude,longitude),on = .(latmin <= latitude,lonmin <= longitude,latmax>=latitude,lonmax>=longitude)]
nrow(cross)
#[1] 18501135
# Find zip codes nearest to a point
cross[,.(i.zip, zip, dist = distHaversine(cbind(i.longitude,i.latitude),cbind(longitude,latitude)))][dist==min(dist),.(dist),by=.(i.zip,zip)]
As we compared the zip codes database to itself, we could expect to get exactly the same number of points, but this is not the case because some zip codes, for example 00210, 00211, ... have the same coordinates, so we get all the combinations of them.
This takes ~20s on my tablet.

Adding values to an R vector based on a for loop

I have a data frame that contains wifi download bandwidth and GPS data (latitude and longitude) on a transportation system. I want to determine from the data what the average bandwidth is when the vehicle is moving north, and what it is when it is moving south.
(bandwidth and latitude values from df)
bandwidth <- df$bandwidth
latitude <-df$latitude
(These both have 2800 entries)
(create empty vectors to fill with bandwidth values depending on whether the vehicle is moving north or south)
movingnorth <- vector('numeric')
movingsouth <- vector('numeric')
(If the train is moving north, fill the moving north vector with data from bandwidth vector)
for(y in latitude){
if(latitude[y]>= latitude[y+1]){
movingnorth <- c(movingnorth, received[y])}
}
Here, I am basically saying if the latitude value is going up, then the vehicle is moving north, and therefore enter the bandwidth value from that location into the movingnorth vector. I would expect only a portion of the values from bandwidth vector to be added to the movingnorth vector, but all 2800 values are added. What am I doing wrong here?
Take advantage of R's vectorized operations. First we use diff to find the change between successive elements of latitude
latitude_change <- diff(df$latitude)
Now we have a vector whose length is 1 less than the length of latitude. Direction happens between the measurements, so that makes sense. Let's say we won't determine direction for the first measurement. So that means if latitude_change[i] > 0, then the train's northbound at time i - 1.
df$movingnorth <- c(FALSE, latitude_change > 0)
I'm keeping this part of df because it's related information, so a table's the perfect place for it.
As lmo said, you want to use seq_along(latitude) or 1:length(latitude), which return the index instead of the actual element in latitude.
Also, you may want to double check that latitude[y+1] is correct. The current syntax assumes that the order of the latitude values in the data goes from the latest to the oldest. It is not possible to know if this is correct from the information you provide, but it may be the reverse.
As pointed out by Frank, you are growing your vector in a loop and that is bad practice (since it does not scale well and becomes very slow for large objects). Nathan Werth's answer suggests a vectorized implementation.

Bourdet Derivative in R with Smoothing Window

I am calculating pressure derivatives using algorithms from this PDF:
Derivative Algorithms
I have been able to implement the "two-points" and "three-consecutive-points" methods relatively easily using dplyr's lag/lead functions to offset the original columns forward and back one row.
The issue with those two methods is that there can be a ton of noise in the high resolution data we use. This is why there is the third method, "three-smoothed-points" which is significantly more difficult to implement. There is a user-defined "window width",W, that is typically between 0 and 0.5. The algorithm chooses point_L and point_R as being the first ones such that ln(deltaP/deltaP_L) > W and ln(deltaP/deltaP_R) > W. Here is what I have so far:
#If necessary install DPLYR
#install.packages("dplyr")
library(dplyr)
#Create initial Data Frame
elapsedTime <- c(0.09583, 0.10833, 0.12083, 0.13333, 0.14583, 0.1680,
0.18383, 0.25583)
deltaP <- c(71.95, 80.68, 88.39, 97.12, 104.24, 108.34, 110.67, 122.29)
df <- data.frame(elapsedTime,deltaP)
#Shift the elapsedTime and deltaP columns forward and back one row
df$lagTime <- lag(df$elapsedTime,1)
df$leadTime <- lead(df$elapsedTime,1)
df$lagP <- lag(df$deltaP,1)
df$leadP <- lead(df$deltaP,1)
#Calculate the 2 and 3 point derivatives using nearest neighbors
df$TwoPtDer <- (df$leadP - df$lagP) / log(df$leadTime/df$lagTime)
df$ThreeConsDer <- ((df$deltaP-df$lagP)/(log(df$elapsedTime/df$lagTime)))*
((log(df$leadTime/df$elapsedTime))/(log(df$leadTime/df$lagTime))) +
((df$leadP-df$deltaP)/(log(df$leadTime/df$elapsedTime)))*
((log(df$elapsedTime/df$lagTime))/(log(df$leadTime/df$lagTime)))
#Calculate the window value for the current 1 row shift
df$lnDeltaT_left <- abs(log(df$elapsedTime/df$lagTime))
df$lnDeltaT_right <- abs(log(df$elapsedTime/df$leadTime))
Resulting Data Table
If you look at the picture linked above, you will see that based on a W of 0.1, only row 2 matches this criteria for both the left and right point. Just FYI, this data set is an extension of the data used in example 2.5 in the referenced PDF.
So, my ultimate question is this:
How can I choose the correct point_L and point_R such that they meet the above criteria? My initial thoughts are some kind of while loop, but being an inexperienced programmer, I am having trouble writing a loop that gets anywhere close to what I am shooting for.
Thank you for any suggestions you may have!

How to use pointDistance with a very large vector

I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Resources