Nearest weather station to each zip code in large dataset? - r

I'm looking for an efficient way to link each record in a large dataset to its nearest NOAA weather station. The dataset contains 9-digit zip codes, and NOAA weather stations have lat long info. Anyone have tips on the best way to do this? Thanks!
EDIT: updating with code that worked in case anyone else is looking to find nearest NOAA weather station to a set of zip codes/ if there are suggestions for better ways to do this.
code based on that provided in this question: Finding nearest neighbour (log, lat), then the next closest neighbor, and so on for all points between two datasets in R
temp_stations is downloaded from https://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/station-inventories/temp-inventory.txt (weather stations used in development of temperature dataset)
zipcodes is a package that contains a dataset with lat long for each zip code in the US.
install.packages("zipcode")
require(zipcode)
data(zipcode)
#prime.zips is a subset of "zipcode" created by selecting just the zip codes contained in my original dataset. running the code below on the whole zipcode dataset crashed R on my computer.
install.packages("geosphere")
require(geosphere)
mat <- distm(prime.zips[ ,c('longitude','latitude')], temp_stations[ ,c(3,2)], fun=distGeo)
# assign the weather station id to each record in prime.zips based on shortest distance in the matrix
prime.zips$nearest.station <- temp_stations$station.id[apply(mat, 1, which.min)]

Related

How to create a column by setting the condition

I am currently working on dataset with different firms. I have each firms' longitude and latitude. I want to find the firms' city locations by using R.
For example, I found that Shanghai's city longitude and latitude range 120.852326~122.118227 and 30.691701~31.874634 respectively.
I firstly want to create a column named "city", and I want to use find if firms' longitudes and latitudes within Shanghai's city longitude and latitude range. If yes, then R will print "Shanghai" in the "city column if not, it will remain NA.
In my dataframe longitude and latitude variables are displayed as "longitude" and "latitude".
I am not sure how to run the code and I am really appreciate your favor and help!
I am really struggling at the beginning. Your help and favor are highly appreciative!

Perform join using latitude and longitude in R

I have an excel spreadsheet with the latitude and longitude of bike docking stations.
I have a shape file in R (cb_2018_17_bg_500k.shp) that has GEOID (12 digit FIPS code) column and a column labelled geometry. The values in this column are POLYGON((longitude,latitude))
I am trying to add a column in the excel spreadsheet titled FIPS. So, I need to somehow join the latitude and longitude to GEOID column in the shape file.
I am a novice when it comes to R.
Any advice will be much appreciated.
Rich
So far, I have only managed to upload the shape file to R.

Match coordinates in dataset with nearest neighbor more efficiently - and retain names?

I'm trying to match businesses with the nearest business within the same dataset. The dataset consists of the business name, coordinates, and star rating (out of 5). There is a row for every business, per quarter, per year (about 5 years). While I've figured this out for simply the nearest business, I need to do this ~100 more times based on different criteria. (Splitting into different quarters for each year, matching neighbors with same ratings, etc).
Right now all I can think of doing is splitting the dataset to include only the businesses I need based on the criteria and then matching like that — but doing it 100 times, and then rejoining the data, sounds...like a terrible idea. I'm not too experienced with spatial stuff in R, so if anyone has any ideas or help that would be amazing.
Similarly, for the existing code I have (below), this provides me with the ID # of the nearest neighbor, but not the name. Is there a way to match neighbors while retaining other columns other than simply the ID #?
The code I already have is below. I'm using
library(RANN)
sp.data <- data
coordinates(sp.data) <- ~lon+lat
sf.data <- st_as_sf(sp.data)
dist_m <- st_distance(sf.data, sf.data)
index <- apply(dist_m, 1, order)
index <- index[2:nrow(index), ]
index <- index[1, ]
index2 <- as.data.frame(index)
index2 <- data.frame(t(index2))
This then provides me with the ID# of the nearest neighbor. As you can see, running this over and over again with different data (split from the original data based on rating criteria and quarter, etc) isn't very efficient. Any help is appreciated. Thank you!

Any way to exclude certain periods of weather data from multi-layers raster grid in R?

I have 16 years' daily mean temperature gridded data in netCDF format, the file size is quite big (about 3GB). Initially, I used raster package to load original gridded data in RasterStack object.
I am wondering how can I exclude weather data where its time ranges do not fall in my interest. More specifically, I want to use only 5 years weather data while here I have 15 years weather data instead. How can I operate this filtration for multi-layers raster data in R? For example, the time span of my original gridded data ranges from 1980.1.1 to 1995.12.31, and I want to only keep temperature data from 1980.1.1 to 1984.12.31. How can I filter out my wanted temperature grid from the multi-layers raster in R? Any possible idea to make this happen?
reproducible example:
r <- raster(xmn=5.75, xmx= 15, ymn = 47.25, ymx =55,res=c(0.25,0.25))
tempDat<- do.call(stack,lapply(1:5844,function(i) setValues(r,round(runif(n = ncell(r),min = -4,max = 23)))))
names(tempDat) <- paste0('X',gsub('-','.',ymd('1980.01.01') + days(1:5844)))
Update:
If there are other handy tools that can chunk netCDF file easily, I would like to know how to do it. Any fastest way to filter out my wanter daily mean temperature data from multi-layers raster grid will work for me. Thanks
desired output:
I only want to keep daily mean temperature data from 1980.1.1 to 1984.12.31; how can I make this happen? how can I operate this filtration on multi-layers raster grid in R? Any more thoughts? Thanks
I figure out my own way to answer this question:
So I used raster::getZ() to list out all date and grep to subset the time period that I am only interested in. Here is the solution:
library(raster)
library(ncdf4)
(tg <- brick("C:\\tn_0.25deg_reg_2018.nc"))
tg_date <- getZ(tg)
grep("2018-01-01", tg_date)
grep("2018-05-31", tg_date)
subset
tg_5months <- subset(tg, 1:150)
tg_5months #z$Date <- tg#z$Date[1:150]
and it is done nicely.

Given a vector of coordinates, identify the polygon from a shapefile it falls into

I have my polygons stored in a SpatialPolygonsDataFrame and my coordinates in a data frame.
The output I want is to just have an additional column on my data frame that tags the OBJECTID (id of the polygon from the shapefile) that the coordinates fall into.
My problem is kind of the same with this
But its output is a little bit different. Also, it's kinda slow. I tried to tag just 4 coordinates and it took more than 5 minutes. I'm gonna be tagging 16k coordinates so would it be possible to do it faster?
The current methods I know about wouldn't do that exactly (i.e., produce one polygon id per coordinate) because they're generalized in case one point is contained in multiple (overlapping polygons).
See sp::over(), which used to be called overlay().
Example:
over(sr, geometry(meuse), returnList = TRUE)
over(sr, meuse, returnList = TRUE)
Possible duplicates (it's hard to tell without seeing your example data):
Extracting points with polygon in R
Intersecting Points and Polygons in R

Resources