R: Matching closest coordinates for large data set - r

I have two sets of data - the first is a list of lat/long coordinates for 2,500 sites where trees have been measured, and the second is a list of lat/long coordinates for 88 temperature monitoring sites.
I want to match each of the 2,500 sites in to its temperature monitoring site.
what i have so far is
distance=geodists(lat.coord.A,long.coord.B,lat.coordB,long.coordB, K)
to calculate the distance between a site in data.set.A and data.set.B, and am looking into using the apply functions to get r perform this for each of the 88 temp. sites at once.
i'm then playing with using min() to give the smallest distance from the site in data.set.A to any of those in data.set.B, but i'd rather just get the coordinates of that specific site in data.set.B than have to calculate it myself.
I'm sure this can be done relatively simply but can't seem to get it right.
I'm pretty new to R so any help is very much appreciated!

You are looking something like (using data.table):
Do a cartesian join:
CartesianJoin<- function(X,Y)
setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
LatLonWide <- CartesianJoin(data.set.A,data.set.B)
Then calculate distance using:
LatLonWide$dist <- sapply(1:nrow(LatLonWide),function(i)
geodists(LatLonWide$lat.coord.A[i],LatLonWide$long.coord.A[i],LatLonWide$lat.coord.B[i],LatLonWide$long.coord.B[i]))

Related

How to analyse spatial data using grid codes from a map

I would like to analyse movement data from a semi-captive animal population. We record their location every 5 mins using a location code which corresponds to a map of the reserve we have made ourselves. Each grid square represents 100 square meters, and has a letter and number to correspond with each grid square e.g. H5 or L6 (letters correlate with columns, whereas numbers correlate with rows.I would like to analyse differences in space use between three different periods of time, to answer questions such as do the animals move around more in certain periods, or are more restricted in their space use in other periods. Please can someone give me any indication of how to go about this? I have looked into spatial analysis in rstudio but haven't come across anything that doesn't use official maps or location co-ordinates. I've not done this type of analysis before so any help would be greatly appreciated! Thanks so much.

A large amount of points to create separate polygons (ArcGIS/QGIS)

Visual example of the data
I used a drone to create a DOF of a small area. During the flight, it takes a photo every 20sh seconds (40sh meters of a flight). I have created a CSV file, which I transferred to a point shapefile. In total, I made with drone 10 so-called "missions", each with 100-200 points which are "shaped" as squares on the map. What I want now is to create a polygon shapefile from the point shapefile.
Because those points sometimes overlap, I cannot use the "Aggregate Points" task, as it's only distance-based. I want to make polygons automatically, using some kind of script. What could help is the fact that a maximum time between two points (AKA photos taken) is 10-20 seconds, so if the time distance is over 3 minutes, it's another "mission". Can you help with such a script, that would quickly and automatically create as many polygons as there are missions?
Okay, I think I understand what you are trying to accomplish. Since no one replied I am going to give it a quick shot, so you have something to try.
I think the best strategy would be to:
Clustering algorithm: Try running a Clustering algorithm such as DBSCAN around the timestamp dimension to classify them based on time groups, instead of the distance (since, as you said, distance based separation is not enough to properly identify and separate the points). After which, you should have all the points classified between different groups with a column group id. Maximum distance parameter in the algorithm should be around 20 seconds steps, or even a minute (since you said each mission was separated at least about 3 minutes apart).
Feature based Polygon to point: At that point, then you run your generic Polygon_from_points(...) function that transforms these clustered points to polygons shapes based on a specific discriminant feature (which in your case is going to be each group id).
How does this work?: This would properly separate the groups first (time-based) and then you should be able to find a generic point to polygon based on a feature (Arcgis should have some).
I dont have an example dataset, nor any code written, but based on what you described I think it would work, hope it helps.

Sampling points on raster layer with specific patterns

I new on using R with spatial data and I don't understand how to fix my issue.
My goal is to test differents pattern to make soil sample for quantifying soil organic carbon. I have a raster layer which represent the carbon stock with a grid of 1m*1m.
On this raster I want to randomly chose 20 points across the diagonal of the plot (which is rectangular). And I want each point separated by 20 meters.
Then I would like to repeat this operation a lot of times and each times I would like that each points move à litle bit in a certain range around the diagonal.
I'm trying with raster::select function but I don't understand the way it's work.
If you have any help to give me or just some good R package to do this I woul apreciate a lot !
Thank you,
Antoine

Finding the nearest zipcode from a list of zipcodes

I have a list of locations with zipcodes. I have another list of Distribution Centers that serve these locations. Is there anyway to map the nearest DC to each of these locations? I am an extremely green coder, but i have some experience with R
I'd need more information to give you some possible code to solve your problem however, here is one approach to solving your problem.
Convert your zipcodes to longitudes and latitudes.
Not sure what location data you have on your distribution centers, but you should be able to find a way to retrieve the long/lat of each of these.
For each zipcode, compute the the distance to each DC (using their respective longs/lats). To compute the distance, use the haversine formula. Find the minimum of these distances. This is your solution.

Finding a density peak / cluster centrum in 2D grid / point process

I have a dataset with minute by minute GPS coordinates recorded by a persons cellphone. I.e. the dataset has 1440 rows with LON/LAT values. Based on the data I would like a point estimate (lon/lat value) of where the participants home is. Let's assume that home is the single location where they spend most of their time in a given 24h interval. Furthermore, the GPS sensor most of the time has quite high accuracy, however sometimes it is completely off resulting in gigantic outliers.
I think the best way to go about this is to treat it as a point process and use 2D density estimation to find the peak. Is there a native way to do this in R? I looked into kde2d (MASS) but this didn't really seem to do the trick. Kde2d creates a 25x25 grid of the data range with density values. However, in my data, the person can easily travel 100 miles or more per day, so these blocks are generally too large of an estimate. I could narrow them down and use a much larger grid but I am sure there must be a better way to get a point estimate.
There are "time spent" functions in the trip package (I'm the author). You can create objects from the track data that understand the underlying track process over time, and simply process the points assuming straight line segments between fixes. If "home" is where the largest value pixel is, i.e. when you break up all the segments based on the time duration and sum them into cells, then it's easy to find it. A "time spent" grid from the tripGrid function is a SpatialGridDataFrame with the standard sp package classes, and a trip object can be composed of one or many tracks.
Using rgdal you can easily transform coordinates to an appropriate map projection if lon/lat is not appropriate for your extent, but it makes no difference to the grid/time-spent calculation of line segments.
There is a simple speedfilter to remove fixes that imply movement that is too fast, but that is very simplistic and can introduce new problems, in general updating or filtering tracks for unlikely movement can be very complicated. (In my experience a basic time spent gridding gets you as good an estimate as many sophisticated models that just open up new complications). The filter works with Cartesian or long/lat coordinates, using tools in sp to calculate distances (long/lat is reliable, whereas a poor map projection choice can introduce problems - over short distances like humans on land it's probably no big deal).
(The function tripGrid calculates the exact components of the straight line segments using pixellate.psp, but that detail is hidden in the implementation).
In terms of data preparation, trip is strict about a sensible sequence of times and will prevent you from creating an object if the data have duplicates, are out of order, etc. There is an example of reading data from a text file in ?trip, and a very simple example with (really) dummy data is:
library(trip)
d <- data.frame(x = 1:10, y = rnorm(10), tms = Sys.time() + 1:10, id = gl(1, 5))
coordinates(d) <- ~x+y
tr <- trip(d, c("tms", "id"))
g <- tripGrid(tr)
pt <- coordinates(g)[which.max(g$z), ]
image(g, col = c("transparent", heat.colors(16)))
lines(tr, col = "black")
points(pt[1], pt[2], pch = "+", cex = 2)
That dummy track has no overlapping regions, but it shows that finding the max point in "time spent" is simple enough.
How about using the location that minimises the sum squared distance to all the events? This might be close to the supremum of any kernel smoothing if my brain is working right.
If your data comprises two clusters (home and work) then I think the location will be in the biggest cluster rather than between them. Its not the same as the simple mean of the x and y coordinates.
For an uncertainty on that, jitter your data by whatever your positional uncertainty is (would be great if you had that value from the GPS, otherwise guess - 50 metres?) and recompute. Do that 100 times, do a kernel smoothing of those locations and find the 95% contour.
Not rigorous, and I need to experiment with this minimum distance/kernel supremum thing...
In response to spacedman - I am pretty sure least squares won't work. Least squares is best known for bowing to the demands of outliers, without much weighting to things that are "nearby". This is the opposite of what is desired.
The bisquare estimator would probably work better, in my opinion - but I have never used it. I think it also requires some tuning.
It's more or less like a least squares estimator for a certain distance from 0, and then the weighting is constant beyond that. So once a point becomes an outlier, it's penalty is constant. We don't want outliers to weigh more and more and more as we move away from them, we would rather weigh them constant, and let the optimization focus on better fitting the things in the vicinity of the cluster.

Resources