Cleaning Geocode data with r - r

I am cleaning my dataset and I don't know how to clean GPS data.
when I use the table function I find that they are entered in different shapes.
"547140",
"35.6997",
"251825.7959",
"251470.43",
"54/4077070001",
and "54/305495"
I don't know how to clean this variable with this great difference.
I would be thankful if help me or suggest me a website for training.

Your main issue is standardizing the GPS by projecting GPS to a coordinate system of choice. Say we have the GPS of amsterdam in two different coordinate systems, one in amersfoort/rd new (espg 28992) and one in wsg1984 (espg 4326):
x y location espg
1: 1.207330e+05 486632.35593 amsterdam 28992
2: 4.884088e+00 52.36651 amsterdam 4326
structure(list(x = c(120733.012428048, 4.88408811380055), y = c(486632.355933105,
52.3665054922233), location = c("amsterdam", "amsterdam"), espg = c(28992,
4326)), row.names = c(NA, -2L), class = "data.frame")
What we want to do is reproject our coordinates to one geographic coordinate system of choice. In this case I used WSG1984 (espg 4326).
library(sf)
#here I tell R which columns contain the coordinates
coordinates(dt) <- ~x+y
#I now convert the table to a spatial object
dt <- st_as_sf(dt)
#here I split by the different ESPG's present
dt <- split(dt, dt$espg)
#here I loop through every individual espg present in the dataset
for(i in 1:length(dt)){
#here I say in which coordinate system (espg) the GPS data is in
st_crs(dt[[i]]) <- unique(dt[[i]]$espg)
#here I transform the coordinates to another projection (in this case WSG1984, espg 4326)
dt[[i]] <- dt[[i]] %>% st_transform(4326)
}
#here I bind the items of the list together
dt <- do.call(rbind, dt)
head(dt)
Simple feature collection with 2 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4.884088 ymin: 52.36651 xmax: 4.884088 ymax: 52.36651
Geodetic CRS: WGS 84
location espg geometry
4326 amsterdam 4326 POINT (4.884088 52.36651)
28992 amsterdam 28992 POINT (4.884088 52.36651)
In the geometry column you now see that the coordinates are equal to one another.
Bottom line is that you need to know the geographic coordinate system the GPS data is in. Then you can convert your data from a table to a spatial object and transform the GPS data to a projection of choice.
In addition, it is always a good idea to check if your assumption on the original ESPG is good by for example plotting the data.
library(ggplot2)
library(ggspatial)
ggplot(dt) + annotation_mape_tile() + geom_sf(size = 4) + theme(text = element_text(size = 15) + facet_wrap(~espg)
In the figurebelow we see that the projection went well for both espg's.

Related

Nearest locations to multiple routes

I'd like to get the nearest location from each of a list of routes and the distance from the route to the place. I think I can do this via SF but am not sure how. In the sample data there are 19 separate routes.
install.packages("sf")
install.packages("sfheaders")
library(sf)
routeData <- read.csv("https://www.dropbox.com/s/vtj8wvcqxj52pbl/SpainActivityRoutes.csv?dl=1")
# Convert routes to SF
sfheaders::sf_multipolygon(
obj = spainActivityRoutes
, multipolygon_id = "id"
, x = "lon"
, y = "lat"
)
# Read in locations
locations <- data.frame(id = c(1,2,3),
place = c('Alcudia', 'Puerto de Pollensa', 'Alaro'),
latitude = c(39.85327712, 39.9024565, 39.704459175469395),
longitude = c(3.123974802, 3.080426926, 2.7919874776545694))
Starting with the data:
routeData <- read.csv("https://www.dropbox.com/s/vtj8wvcqxj52pbl/SpainActivityRoutes.csv?dl=1")
split on id, apply a function to create linestring objects, join the list of linestrings using st_sfc to make a spatial vector. Assume these are "GPS" coordinates with EPSG code 4326:
routes = do.call(st_sfc, lapply(split(routeData, routeData$id) , function(d){st_linestring(cbind(d$lon, d$lat))}))
st_crs(routes)=4326
Convert points data frame to spatial points data frame with same coordinate system:
pts = st_as_sf(locations, coords=c("longitude","latitude"), crs=4326)
Now we can get the nearest route to each point:
> nearf = st_nearest_feature(pts, routes)
> nearf
[1] 1 15 19
So the first point is nearest to route 1, the second point route 15, the third point route 19. Now the distances we get by computing the distance from each point in turn to each of those route lines in turn by using st_distance with by_element=TRUE (otherwise it computes the distances from all points to all three routes as a matrix):
> st_distance(pts, routes[st_nearest_feature(pts, routes)], by_element=TRUE)
Units: [m]
[1] 7.888465 27.046029 44.175458
If you want the point on the route nearest to the point data then use st_nearest_points with pairwise=TRUE:
> st_nearest_points(pts, routes[st_nearest_feature(pts, routes)], pairwise=TRUE)
Geometry set for 3 features
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 2.791987 ymin: 39.70412 xmax: 3.124058 ymax: 39.90256
Geodetic CRS: WGS 84
LINESTRING (3.123975 39.85328, 3.124058 39.85331)
LINESTRING (3.080427 39.90246, 3.080143 39.90256)
LINESTRING (2.791987 39.70446, 2.792247 39.70412)
which returns 2-point lines from the test point to the line. You can use functions like st_cast(...,"POINT") to split those into points and get the locations as points.

Make a vector of coordinates to filter data within a certain area

Rookie R user here and I would greatly appreciate any help you someone could give me.
My project requires me to create a vector boundary box around a city of my choice and then filter a lot of data so I only have the data relative to the area. However, it is several years since I have used R studio and its fair to say I remember little to nothing about the language.
I have initially used
geocode("Hereford, UK")
bbox <-c(Longitude=-2.72,Latitude=52.1)
myMap <- get_map(location = "Hereford, UK",source="google",maptype="roadmap")
I then must create a new tibble which filters out and gives only the relevant data to the area.
I am unsure how to proceed with this and I then must overlay the data onto the map which I have created.
As I only have a centre point of coordinates, is it possible to create a circle with a radius of say 3 miles around the centre of my location so I can then filter this area?
Thank you all for taking the time to read my post. Cheers!
Most spatial work can now be done pretty easily using the sf package.
Example code for a similar problem is below. The comments explain most of what it does.
The difficult part may be in understanding map projections (the crs). Some use units(meters, feet, etc) and others use latitude / longitude. Which one you choose depends on what area of the globe you're working with and what you're trying to accomplish. Most web mapping uses crs 4326, but that does not include an easily usable distance measurement.
The map below shows points outside ~3 miles from Hereford as red, and those inside in dark maroon. The blue point is used as the center for Hereford & the buffer zone.
library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
library(mapview)
set.seed(4)
#hereford approx location, ggmap requires api key
hereford <- data.frame(place = 'hereford', lat = -2.7160, lon = 52.0564) %>%
st_as_sf(coords = c('lat', 'lon')) %>% st_set_crs(4326)
#simulation of data points near-ish hereford
random_points <- data.frame(point_num = 1:20,
lat = runif(20, min = -2.8, max = -2.6),
lon = runif(20, min = 52, max = 52.1)) %>%
st_as_sf(coords = c('lat', 'lon')) %>% st_set_crs(4326) %>%st_transform(27700)
#make a buffer of ~3miles (4800m) around hereford
h_buffer <- hereford %>% st_transform(27700) %>% #change crs to one measured in meters
st_buffer(4800)
#only points inside ~3mi buffer
points_within <- random_points[st_within( random_points, h_buffer, sparse = F), ]
head(points_within)
#> Simple feature collection with 6 features and 1 field
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 346243.2 ymin: 239070.3 xmax: 355169.8 ymax: 243011.4
#> CRS: EPSG:27700
#> point_num geometry
#> 1 1 POINT (353293.1 241673.9)
#> 3 3 POINT (349265.8 239397)
#> 4 4 POINT (349039.5 239217.7)
#> 6 6 POINT (348846.1 243011.4)
#> 7 7 POINT (355169.8 239070.3)
#> 10 10 POINT (346243.2 239690.3)
#shown in mapview
mapview(hereford, color = 'blue') +
mapview(random_points, color = 'red', legend = F, col.regions = 'red') +
mapview(h_buffer, legend = F) +
mapview(points_within, color = 'black', legend = F, col.regions = 'black')
Created on 2020-04-12 by the reprex package (v0.3.0)

How do I plot my shape with a list of polygons with ggplot2?

It’s my first time working with spatial data. The goal of my project is to visualise the price evolution of the housing market in the city of Mechelen (Belgium). I want to visualise this colour coded over a geographic map with different neighborhoods in the city.
I received a shape file (.shp) from the city which would visualise all different neighborhoods and I’m able to import it using the sf package, but I fail to plot it using the ggplot2 package.
Please find my current code below:
library(sf)
library(ggplot2)
#WORKING PART - reading the shape file
shapefile_df <- “/filepath.shp" %>%
st_read()
#NOT WORKING PART - plotting the shapefile
map <- ggplot() +
geom_polygon(data = shapefile_df,
aes(x = long, y = lat, group = group),
color = 'gray', fill = 'white', size = .2)
print(map)
When reading the shape file I get a 4 column dataframe with the 4th column being a list of polygons called geometry.
My question: how do I get the long and lat from this list of polygons?
Or am I completely looking from a wrong perspective?
For your reference when I enter shapefile_df$geometry RStudio responds with:
Geometry set for 12 features
geometry type: POLYGON
dimension: XY
bbox: xmin: 4.370086 ymin: 50.99116 xmax: 4.549005 ymax: 51.07861
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
First 5 geometries:
POLYGON ((4.471225 51.03026, 4.471367 51.03, 4....
POLYGON ((4.496646 51.02285, 4.496969 51.02247,...
POLYGON ((4.484093 51.01383, 4.484615 51.01353,...
POLYGON ((4.450368 51.0356, 4.450477 51.03558, ...
POLYGON ((4.439164 51.0608, 4.439563 51.06049, ...
Can anyone help out? I figured it might be useful I shared the shape file. Is there a best practice for sharing files here?
Please bear in mind that this is my first post and I’ve read a lot about asking questions correctly here, please give feedback if some of the explanation/code isn’t minimalistic enough.
You're looking at it wrong, but don't worry. It is easier than you think.
Shapefiles & sf objects are a little different than ususal data.frames. The geom_sf knows how to plot the points, lines, and polygons without you having to tell it exactly what to do.
To get your plot to work:
#start with a basic plot:
ggplot() +
geom_sf(data = shapefile_df)
From there you can add color, fill, size, etc. arguments.
It looks like your data is made up of polygons, so expect to see something like this, but with the neighborhood borders from your area.
Plot from example data from the sf package.

auto-detect coordinate reference system, based on coordinates in GPX file?

I'm building a Shiny app that color-codes GPX track logs based on the local slope at each point.
It's based extensively on https://rpubs.com/chrisbrunsdon/hiking
To calculate the "run" part of slope = rise/rune, I'm converting from latitude/longitude data into a X,Y grid (in meters) with sf::st_transform. One of the arguments for that function is crs, or "coordinate reference system".
Up to now, I've been testing with GPX files I gathered near my home in southeastern Pennsylvania, so I've been using EPSG:2272 as my CRS.
To make this useful to anyone with logs form anywhere in the world, I'd like to auto-detect the most appropriate CRS based on the centroid of the points in the plotted track. Is there some canned function for doing that?
You can use the UTM projection.
Basically, retrieve the appropriate zone number and letter for the centroid, convert the track to that zone and perform the calculations.
//pseudocode
utm.from_latlon(51.2, 7.5)
//EASTING, NORTHING, ZONE NUMBER, ZONE LETTER
395201.3103811303, 5673135.241182375, 32, "U"
The zone letters are actually latitude bands and won't be needed if you're working with EPSG codes.
To "manually" calculate them use:
zone_num <- floor((longitude + 180) / 6) + 1 #each zone is 6 degrees wide
hemisphere <- if (latitude >= 0) "northern" else "southern"
epsg <- 32600 + zone_num
if (hemisphere == "southern") {
epsg <- epsg + 100
}
I found how-to-get-appropriate-crs-for-a-position-specified-in-lat-lon-coordinates (for the related EPSG crs, not UTM zone per se), and re-wrote it in R, assuming the user wants to analyze the first track segment in file gpx.file.
rg.result <- readGPX(gpx.file)
outer.track.list <- rg.result$tracks
inner.track.list <- outer.track.list[[1]]
track.frame <- inner.track.list[[1]]
tf.avg.lat <- mean(track.frame$lat)
tf.avg.lon <- mean(track.frame$lon)
EPSG <- 32700-round((45+tf.avg.lat)/90,0)*100+round((183+tf.avg.lon)/6,0)
# make a spatial frame, based on GPX's use of WGS84/EPSG 4326
coords <-
st_as_sf(track.frame,
coords = c("lon", "lat"),
crs = 4326)
# project that according to the EPSG crs determined above
st_transformed <-
st_transform(coords$geometry, crs = EPSG)
# `xy` will be a matrix of positions on a grid, in meters
xy <- st_coordinates(st_transformed)

how to merge a shapefile with a dataframe with latitude/longitude data

I am struggling with the following issue
I have downloaded the PLUTO NYC Manhattan Shapefile for the NYC tax lots from here https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page
I am able to read them in sf with a simple st_read
> mydf
Simple feature collection with 42638 features and 90 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 971045.3 ymin: 188447.4 xmax: 1010027 ymax: 259571.5
epsg (SRID): NA
proj4string: +proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0 +datum=NAD83 +units=us-ft +no_defs
First 10 features:
Borough Block Lot CD CT2010 CB2010 SchoolDist Council ZipCode FireComp PolicePrct HealthCent HealthArea
1 MN 1545 52 108 138 4000 02 5 10028 E022 19 13 3700
My problem is the following: I have a dataframe as follows
> data_frame('lat' = c(40.785091,40.785091), 'lon' = c(-73.968285, -73.968285))
# A tibble: 2 x 2
lat lon
<dbl> <dbl>
1 40.785091 -73.968285
2 40.785091 -73.968285
I would like to merge this data to the mydf dataframe above, so that I can count how many latitude/longitude observations I have within each tax lot (remember, mydf is at the tax lot granularity), and plot the corresponding map of it. I need to do so using sf.
In essence something similar to
pol <- mydf %>% select(SchoolDist)
plot(pol)
but where the counts for each tax lot come from counting how many points in my latitude/longitude dataframe fall into them.
Of course, in my small example I just have 2 points in the same tax lot, so that would just highlight one single tax lot in the whole area. My real data contains a lot more points.
I think there is an easy way to do it, but I was not able to find it.
Thanks!
This is how I would do it with arbitrary polygon and point data. I wouldn't merge the two and instead just use a geometry predicate to get the counts that you want. Here we:
Use the built in nc dataset and transform to 3857 crs, which is projected rather than lat-long (avoids a warning in st_contains)
Create 1000 random points within the bounding box of nc, using st_bbox and runif. Note that st_as_sf can turn a data.frame with lat long columns into sf points.
Use lengths(st_contains(polygons, points) to get the counts of points per polygon. sgbp objects created by a geometry predicate are basically "for each geometry in sf x, what indices of geometries in sf y satisfy the predicate". So lengths1 effectively gives the number of points that satisfy the predicate for each geometry, in this case number of points contained within each polygon.
Once the counts are in the sf object as a column, we can just select and plot them with the plot.sf method.
For your data, simply replace nc with mydf and leave out the call to tibble, instead use your data.frame with the right lat long pairs.
library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3
nc <- system.file("shape/nc.shp", package="sf") %>%
read_sf() %>%
st_transform(3857)
set.seed(1000)
points <- tibble(
x = runif(1000, min = st_bbox(nc)[1], max = st_bbox(nc)[3]),
y = runif(1000, min = st_bbox(nc)[2], max = st_bbox(nc)[4])
) %>%
st_as_sf(coords = c("x", "y"), crs = 3857)
plot(nc$geometry)
plot(points$geometry, add = TRUE)
nc %>%
mutate(pt_count = lengths(st_contains(nc, points))) %>%
select(pt_count) %>%
plot()
Created on 2018-05-02 by the reprex package (v0.2.0).
I tried this on your data, but the intersection is empty for the both sets of points you provided. However, the code should work.
EDIT: Simplified group_by + mutate with add_count:
mydf = st_read("MN_Dcp_Mappinglot.shp")
xydf = data.frame(lat=c(40.758896,40.758896), lon=c(-73.985130, -73.985130))
xysf = st_as_sf(xydf, coords=c('lon', 'lat'), crs=st_crs(mydf))
## NB: make sure to st_transform both to common CRS, as Calum You suggests
xysf %>%
sf::st_intersection(mydf) %>%
dplyr::add_count(LOT)
Reproducible example:
nc = sf::st_read(system.file("shape/nc.shp", package="sf"))
ncxy = sf::st_as_sf(data.frame(lon=c(-80, -80.1, -82), lat=c(35.5, 35.5, 35.5)),
coords=c('lon', 'lat'), crs=st_crs(nc))
ncxy = ncxy %>%
sf::st_intersection(nc) %>%
dplyr::add_count(FIPS)
## a better approach
ncxy = ncxy %>%
sf::st_join(nc, join=st_intersects) %>%
dplyr::add_count(FIPS)
The new column n includes the total number of points per FIPS code.
ncxy %>% dplyr::group_by(FIPS) %>% dplyr::distinct(n)
> although coordinates are longitude/latitude, st_intersects assumes
that they are planar
# A tibble: 2 x 2
# Groups: FIPS [2]
FIPS n
<fctr> <int>
1 37123 2
2 37161 1
I'm not sure why your data results in an empty intersection, but since the code works on the example above there must be a separate issue.
HT: st_join approach from this answer.

Resources