I have a rather large dataset that I've read into R and formatted the way I would like. I have 3 columns, latitude, longitude, and an indicator for whether or not there is snow at that location.
I also have another (larger) dataset with 3 columns: latitude, longitude, and elevation. I would like to sort through these latitudes and longitudes and match them up with the lats and longs in my other dataset to add in the elevation. Essentially I would like 1 dataset with 4 columns: lat, long, indicator, and elevation.
I have written a program which does this, but for these very large datasets, it is not efficient enough to let it run. I do not want to kill my personal PC. Is there any way to speed up my code?
Thanks so much.
## Looping through
for(i in 1:len){
for(j in 1:78355){
if(OrigDat1[i,2] == Elev04[j,2] && OrigDat1[i,1] == Elev04[j,1]){
Elev.vec[i] = as.numeric(Elev04[j,3])
}
}
}
In the above code the 'len' term varies depending on the size of the shapefile I strip the data down to. It has a length anywhere from 1000 - 50000 locations.
I can provide you with more information if needed.
Thanks again.
Related
I have a seemingly simple question that I can’t seem to figure out. I have a large dataset of millions of data points. Each data point represents a single fish with its biological information as well as when and where it was caught. I am running some statistics on these data and have been having issues which I have finally tracked down to some data points having latitude and longitude values that fall exactly on the corners of the grid cells which I am using to bin my data. When these fish with lats and long that fall exactly onto grid cell corners are grouped into their appropriate grid cell, they end up being duplicated 4 times (one for each cell that touches the grid cell corner their lats and long identify).
Needless to say this is bad and I need to force those animals to have lats and long that don’t put them exactly on a grid cell corner. I realize there are probably lots of ways to correct something like this but what I really need is a simply way to identify latitudes and longitudes that have integer values, and then to modify them by a very small amount (randomly adding or subtracting) so as to shift them into a specific cell without creating a bias by shifting them all the same way.
I hope this explanation makes sense. I have included a very simple example in order to provide a workable problem.
fish <- data.frame(fish=1:10, lat=c(25,25,25,25.01,25.2,25.1,25.5,25.7,25,25),
long=c(140,140,140,140.23,140.01,140.44,140.2,140.05,140,140))
In this fish data frame there are 10 fish, each with an associated latitude and longitude. Fish 1, 2, 3, 9, and 10 have integer lat and long values that will place them exactly on the corners of my grid cells. I need some way of shifting just these values by something like plus are minus 0.01.
I can identify which lats or longs are integers easy enough with something like:
fish %>%
near(as.integer(fish$lat))
But am struggling to find a way to then modify all the integer values by some small amount.
To answer my own question I was able to work this out this morning with some pretty basic code, see below. All it takes is making a function that actually looks for whole numbers, where is.integer does not.
# Used to fix the is.integer function to actually work and not just look at syntax
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
# Use ifelse to change only whole number values of lat and long
fish$jitter_lat <- ifelse(is.wholenumber(fish$lat), fish$lat+rnorm(fish$lat, mean=0, sd=0.01), fish$lat)
fish$jitter_long <- ifelse(is.wholenumber(fish$long), fish$long+rnorm(fish$long, mean=0, sd=0.01), fish$long)
I'm currently try to match all the zip codes in the US with some zip codes I have, by the smallest distance. The code currently like this:
for (i in 1:nrow(Haversine_Zip_Match)){
# Reset the nearest distance by every row
BestDist <- Inf
for (j in 1:nrow(merged)){
# Calculate distance
currDist <- dist(merged$LAT[j], Haversine_Zip_Match$LAT[i], merged$LONG[j], Haversine_Zip_Match$LONG[i])
# There are some NA values for long/lat,
if (is.na(currDist)){
currDist <- Inf
}
# Update best matching result
if (currDist < BestDist){
BestDist = currDist
Haversine_Zip_Match$haversineMatch[i] = merged$ZIP_CD[j]
}
}
}
dist is the function I defined to calculate the distance. But "Haversine_Zip_Match" has 40,000 rows and "Merged" has 30,000 rows. In total there are over 1 billion calculations. Is there a way to make it faster? I'm currently thinking use %dopar% to expedite the process. Any idea would help, thanks!
Instead of trying to parallelize, you could try to reduce the number of calculations.
Usually, zipcode databases define the min/max latitude and longitude around a zip code.
If you don't have this information, you can define a box around each zip code, large enough so that zipcodes box areas overlap.
In the example below, I used this zipcode .rda with 43689 codes.
library(data.table)
library(geosphere)
points <- setDT(zipcode)[,.(zip,latitude,longitude)][!is.na(latitude)&!is.na(longitude)]
zipDB <- setDT(zipcode)[,.(zip,latitude,longitude,latmin, latmax,lonmin,lonmax)][!is.na(latitude)&!is.na(longitude)]
# full cross product :
nrow(points) * nrow(zipDB)
#[1] 1908728721
# Area limited cross product
cross <- zipDB[points,.(i.zip,i.latitude,i.longitude,zip,latitude,longitude),on = .(latmin <= latitude,lonmin <= longitude,latmax>=latitude,lonmax>=longitude)]
nrow(cross)
#[1] 18501135
# Find zip codes nearest to a point
cross[,.(i.zip, zip, dist = distHaversine(cbind(i.longitude,i.latitude),cbind(longitude,latitude)))][dist==min(dist),.(dist),by=.(i.zip,zip)]
As we compared the zip codes database to itself, we could expect to get exactly the same number of points, but this is not the case because some zip codes, for example 00210, 00211, ... have the same coordinates, so we get all the combinations of them.
This takes ~20s on my tablet.
I have a big dataset with a lot of geolocation data (long / lat), which I want to map dependent on the frequency. I just want to show the frequencies of the cities and areas, not of each exact location. Since the geo data might vary a little bit for each city, the data has to be aggregated / clustered.
Unfortunately, just rounding the number does not work. I have already tried to create a matrix to measure the distance of each point, but my vector memory is not sufficient. Is there a simpler way?
This is how the original data looks like:
$long $lat
12.40495 52.52001
13.40233 52.50141
13.37698 52.51607
13.38886 52.51704
13.42927 52.48457
9.993682 53.55108
9.992470 53.55334
10.000654 53.55034
11.58198 48.13513
11.51450 48.13910
... ...
The result should look like this:
$long $lat $count
13.40495 52.52001 5
9.993682 53.55108 3
11.58198 48.13513 2
... ... ...
EDIT 1:
To cluster the points to one single point, a range of 25-50 km is fine.
EDIT 2:
This is how the map looks like, if I don't aggregate the points. I want to prevent the overlapping of the circles.
I have a data frame that contains wifi download bandwidth and GPS data (latitude and longitude) on a transportation system. I want to determine from the data what the average bandwidth is when the vehicle is moving north, and what it is when it is moving south.
(bandwidth and latitude values from df)
bandwidth <- df$bandwidth
latitude <-df$latitude
(These both have 2800 entries)
(create empty vectors to fill with bandwidth values depending on whether the vehicle is moving north or south)
movingnorth <- vector('numeric')
movingsouth <- vector('numeric')
(If the train is moving north, fill the moving north vector with data from bandwidth vector)
for(y in latitude){
if(latitude[y]>= latitude[y+1]){
movingnorth <- c(movingnorth, received[y])}
}
Here, I am basically saying if the latitude value is going up, then the vehicle is moving north, and therefore enter the bandwidth value from that location into the movingnorth vector. I would expect only a portion of the values from bandwidth vector to be added to the movingnorth vector, but all 2800 values are added. What am I doing wrong here?
Take advantage of R's vectorized operations. First we use diff to find the change between successive elements of latitude
latitude_change <- diff(df$latitude)
Now we have a vector whose length is 1 less than the length of latitude. Direction happens between the measurements, so that makes sense. Let's say we won't determine direction for the first measurement. So that means if latitude_change[i] > 0, then the train's northbound at time i - 1.
df$movingnorth <- c(FALSE, latitude_change > 0)
I'm keeping this part of df because it's related information, so a table's the perfect place for it.
As lmo said, you want to use seq_along(latitude) or 1:length(latitude), which return the index instead of the actual element in latitude.
Also, you may want to double check that latitude[y+1] is correct. The current syntax assumes that the order of the latitude values in the data goes from the latest to the oldest. It is not possible to know if this is correct from the information you provide, but it may be the reverse.
As pointed out by Frank, you are growing your vector in a loop and that is bad practice (since it does not scale well and becomes very slow for large objects). Nathan Werth's answer suggests a vectorized implementation.
I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.