Fast geospatial sampling in R - r

I have a large set of polygons (about 20k) that I want to sample points from. I use the st_sample function from the sf package in R, but it's pretty slow. It takes about 5 minutes to sample from all polygons, and I need to repeat this task a large number of times (N >= 1000) so it's not practical.
Is there a way to do faster sampling?

Related

FactoMineR PCA takes a really long time

I'm trying to run a PCA on a really large dataset (160 000 x 20 000 variables, approx 6.3G in the file but much more when loaded in R) on a cluster. However, it is taking a high amount of time (my job was killed after 90 hours) while it was usually done in a few hours on datasets half the size.
I'm using the most basic R code possible :
data=read.table("dataset.csv", header=T, sep=',',row.names=1, fill=TRUE)
y=PCA(data, ncp=100, graph=FALSE)
Is there something wrong with what I'm doing or should I try a PCA from another package?

What is the most efficient way to find the closest geographic location?

I have two data frames in which observations are geographic locations defined by a latitude/longitude combination. For each point in df1 I would like to get the closest point in df2 and the associated value. I know how to do that by computing all the possible distances (using e.g. the gdist function from the Imap package) and getting the index for the smallest one. But the fact is that it is at best excessively long as df1 has 1,000 rows and df2 some 15 millions.
Do you have an idea of how I could reach my goal without computing all the distances? Maybe there is a way to limit the necessary calculations (for instance using the difference in latitude/longitude values)?
Thanks for helping,
Val
Here's what df1looks like:
Latitude Longitude
1 56.76342 8.320824
2 54.93165 9.115982
3 55.80685 9.102455
4 57.27000 9.760000
5 56.76342 8.320824
6 56.89333 9.684435
7 56.62804 8.571573
8 56.64850 8.501947
9 55.40596 8.884374
10 54.89786 11.880828
then df2:
Latitude Longitude Value
1 41.91000 -4.780000 40500
2 41.61063 14.750832 13500
3 41.91000 -4.780000 4500
4 38.70000 -2.350000 28500
5 52.55172 0.088622 1500
6 39.06000 -1.830000 51000
7 41.91000 -4.780000 49500
8 48.00623 -4.389639 12000
9 56.24889 -3.666940 27000
10 42.72000 -3.750000 49500
Split the second frame into chunks of equal size
Then search only the chunks within the reasonable distance of your point. You will be basically drawing a checkerboard on a map. Your point will be within one of these squares - so you will search only that one and few neighboring ones to be safe.
Naive brute force search is rows(df1) * rows(df2). In our case 1000 * 15M, making for 15G operations times the computation time per operation.
So how do we split the data into chunks?
sort by latitude
sort by longitude
take equaly spaced chunks
Sort will take some Nlog(N) operations. N is 15M in our case so these two sorts will take
~2415M2 operations. Splitting in the chunks is then linear ~15M operations, maybe few times.
when you have this separation done, in each chunk you have total_points/(chunk_side ^ 2) points, assuming that your points are distributed equally.
The number of the chunks is proportional to the size of the chunk in the beginning:
total_area/(chunk_side ^ 2).
Ideally you want to balance the number of chunks with the number of points in each chunk so that both are ~ sqrt(points_total).
Each of the thousand searches will now take only chunk_count + points_in_chunk * 9 (if we want to be super safe and search the chunk our point lands in and all the surrounding ones.) So instead of 1000 * 15M you now have `1000 * (sqrt(15M) *18) ~ 1000 * 16K, an improvement by a factor of 50.
Note that this improvement will grow if the second set gets larger. Also the improvement will be smaller, if you choose the chunk size poorly.
For further improvement, you can iterate this once or twice more, making chunks in chunks. The logic is similar.
The distm function of geosphere package will help you:
# Make sure to put longitude first and then latitude:
df <- df %>% select(Longitude,Latitude)
library(geosphere)
distm(as.matrix(df), as.matrix(df), fun=distGeo)
Remenber, the distm function accepts matrix class objects. You will obtain a 10x10 matrix of distances.

Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R

My task is to compare documents in a corpus by the cosine similarity. I use tm package and obtain the TermDocumentMatrix (in td-idf form) tdm. The following task should as simple as stated in here
d <- dist(tdm, method="cosine")
or
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
However, the number of terms in my tdm is quite large, more than 120,000 (with around 50,000 documents). It is beyond the capability of R to handle such matrix.
My RStudio crashed several times.
My questions are 1) how can I handle such a large matrix and get the pair-wise (120,000*120,000) cosine similarity? 2) if impossible, how can I just get the cosine similarity of only two documents at one time? Suppose I want the similarity between document 10 and 21, then something like
sim10_21<-cosine_similarity(tdm, d1=10,d2=21)
If tdm is a simple matrix, I can do the calculate on tdm[,c(10,21)]. However, to convert tdm to a matrix is exactly what I cannot handle. My questions ultimately boils down to how to do matrix-like calculate on tdm.
120,000 x 120,000 matrix * 8 bytes (dbl float) = 115.2 gigabytes. This isn't necessarily beyond the capability of R, but you do need at least that much memory, regardless of what language you use. Realistically, you'll probably want to write to the disk, either using some database such as Sql (e.g. RSQLite package) or if you plan to only use R in your analysis, it might be better to use the "ff" package for storing/accessing large matrices on disk.
You could do this iteratively and multithread it to improve the speed of calculation.
To find the distance between two docs, you can do something like this:
dist(t(tdm[,1]), t(tdm[,2]), method='cosine')

Make For Loop and Spacial Computing Faster?

I am playing with a large dataset (~1.5m rows x 21 columns). Which includes a long, lat information of a transaction. I am computing the distance of this transaction from couple of target locations and appending this as new column to main dataset:
TargetLocation1<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation1", Size=ZZZZ)
TargetLocation2<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation2", Size=YYYY)
## MainData[6:7] are long and lat columns
MainData$DistanceFromTarget1<-distVincentyEllipsoid(MainData[6:7], TargetLocation1[1:2])
MainData$DistanceFromTarget2<-distVincentyEllipsoid(MainData[6:7], TargetLocation2[1:2])
I am using geosphere() package's distVincentyEllipsoid function to compute the distances. As you can imaging, distVincentyEllipsoid function is a computing intensive but it is more accurate (compared to other functions of the same package distHaversine(); distMeeus(); distRhumb(); distVincentySphere())
Q1) It takes me about 5-10 mins to compute distances for each target location [I have 16 GB RAM and i7 6600U 2.81Ghz Intel CPU ], and I have multiple target locations. Is there any faster way to do this?
Q2) Then I am creating a new column for a categorical variable to mark each transaction if it belongs to market definition of target locations. A for loop with 2 if statements. Is there any other way to make this computation faster?
MainData$TransactionOrigin<-"Other"
for (x in 1:nrow(MainData)){
if (MainData$DistanceFromTarget1[x]<=7000)
MainData$TransactionOrigin[x]="Target1"
if (MainData$DistanceFromTarget2[x]<=4000)
MainData$TransactionOrigin[x]="Target2"
}
Thanks
Regarding Q2
This will run much faster if you lose the loop.
MainData$TransactionOrigin <- "Other"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget1[x]<=7000)] <- "Target1"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget2[x]<=4000)] <- "Target2"

Thinking in Vectors with R

I know that R works most efficiently with vectors and looping should be avoided. I am having a hard time teaching myself to actually write code this way. I would like some ideas on how to 'vectorize' my code. Here's an example of creating 10 years of sample data for 10,000 non unique combinations of state (st), plan1 (p1) and plan2 (p2):
st<-NULL
p1<-NULL
p2<-NULL
year<-NULL
i<-0
starttime <- Sys.time()
while (i<10000) {
for (years in seq(1991,2000)) {
st<-c(st,sample(c(12,17,24),1,prob=c(20,30,50)))
p1<-c(p1,sample(c(12,17,24),1,prob=c(20,30,50)))
p2<-c(p2,sample(c(12,17,24),1,prob=c(20,30,50)))
year <-c(year,years)
}
i<-i+1
}
Sys.time() - starttime
This takes about 8 minutes to run on my laptop. I end up with 4 vectors, each with 100,000 values, as expected. How can I do this faster using vector functions?
As a side note, if I limit the above code to 1000 loops on i it only takes 2 seconds, but 10,000 takes 8 minutes. Any idea why?
Clearly I should have worked on this for another hour before I posted my question. It's so obvious in retrospect. :)
To use R's vector logic I took out the loop and replaced it with this:
st <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p1 <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
p2 <- sample(c(12,17,24),10000,prob=c(20,30,50),replace=TRUE)
year <- rep(1991:2000,1000)
I can now do 100,000 samples almost instantaneous. I knew that vectors were faster, but dang. I presume 100,000 loops would have taken over an hour using a loop and the vector approach takes <1 second. Just for kicks I made the vectors a million. It took ~2 seconds to complete. Since I must test to failure, I tried 10mm but ran out of memory on my 2GB laptop. I switched over to my Vista 64 desktop with 6GB ram and created vectors of length 10mm in 17 seconds. 100mm made things fall apart as one of the vectors was over 763mb which resulted in an allocation issue with R.
Vectors in R are amazingly fast to me. I guess that's why I am an economist and not a computer scientist.
To answer your question about why the loop of 10000 took much longer than your loop of 1000:
I think the primary suspect is the concatenations that are happening every loop. As the data gets longer R is probably copying every element of the vector into a new vector that is one longer. Copying a small (500 elements on average) data set 1000 times is fast. Copying a larger (5000 elements on average) data set 10000 times is slower.

Resources