r - Join data frame coordinates by shapefile regions aka Join Attributes by Location - r

I have a large data set, loaded in R as a data.frame. It contains observations associated with coordinate points (lat/lon).
I also have a shape file of North America.
In the empty column (NA filled) in my data frame, labelled BCR, I want to insert the region name which each coordinate falls into according to the shapefile.
I know how to do this is QGIS using the Vector > Data Management Tools > Join Attributes by Location
The shapefile can be downloaded by clicking HERE.
My data, right now, looks like this (a sample):
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 NA
50.40678 -104.61256 2009 2 0 SK 120 3 NA
50.40678 -104.61256 2009 2 1 SK 136 2 NA
50.40678 -104.61256 2009 3 2 SK 149 4 NA
43.0026385 -79.2900467 2009 2 0 ON 112 3 NA
43.0026385 -79.2900467 2009 2 1 ON 122 3 NA
But I want it to look like this:
LATITUDE LONGITUDE Year EFF n St PJ day BCR
50.406752 -104.613 2009 1 0 SK 90 2 Prairie Potholes
50.40678 -104.61256 2009 2 0 SK 120 3 Prairie Potholes
50.40678 -104.61256 2009 2 1 SK 136 2 Prairie Potholes
50.40678 -104.61256 2009 3 2 SK 149 4 Prairie Potholes
43.0026385 -79.2900467 2009 2 0 ON 112 3 Lower Great Lakes/St.Lawrence Plain
43.0026385 -79.2900467 2009 2 1 ON 122 3 Lower Great Lakes/St.Lawrence Plain
Notice the BCR column is now filled with the appropriate BCR region name.
My code so far is just importing and formatting the data and shapefile:
library(rgdal)
library(proj4)
library(sp)
library(raster)
# PFW data, full 2.5m observations
df = read.csv("MyData.csv")
# Clearning out empty coordinate data
pfw = df[(df$LATITUDE != 0) & (df$LONGITUDE != 0) & (!is.na(df$LATITUDE)) & (!is.na(df$LATITUDE)),]
# Creating a new column to be filled with associated Bird Conservation Regions
pfw["BCR"] = NA
# Making a duplicate data frame to conserve data
toSPDF = pfw
# Ensuring spatial formatting
#coordinates(toSPDF) = ~LATITUDE + LONGITUDE
SPDF <- SpatialPointsDataFrame(toSPDF[,c("LONGITUDE", "LATITUDE"),],
toSPDF,
proj4string = CRS("+init=epsg:4326"))
# BCR shape file, no state borders
shp = shapefile("C:/Users/User1/Desktop/BCR/BCR_Terrestrial_master_International.shx")
spPoly = spTransform(shp, CRS("+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"))
# Check
isTRUE(proj4string(spPoly) == proj4string(SPDF))
# Trying to join attributes by location
#try1 = point.in.polygon(spPoly, SPDF) # Sounds good doesn't work
#a.data <- over(SPDF, spPoly[,"BCRNAME"]) # Error: cannot allocate vector of size 204.7 Mb

I think you want to do a spatial query with points and polygons. That is to assign polygon attributes to the corresponding points. You can do that like this:
Example data
library(terra)
f <- system.file("ex/lux.shp", package="terra")
polygons <- vect(f)
points <- spatSample(v, 10)
Solution
e <- extract(polygons, points)
e
# id.y ID_1 NAME_1 ID_2 NAME_2 AREA POP
#1 1 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#2 2 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#3 3 2 Grevenmacher 6 Echternach 188 18899
#4 4 1 Diekirch 2 Diekirch 218 32543
#5 5 3 Luxembourg 9 Esch-sur-Alzette 251 176820
#6 6 1 Diekirch 4 Vianden 76 5163
#7 7 3 Luxembourg 11 Mersch 233 32112
#8 8 2 Grevenmacher 7 Remich 129 22366
#9 9 1 Diekirch 3 Redange 259 18664
#10 10 3 Luxembourg 9 Esch-sur-Alzette 251 176820
With the older spatial packages you can use raster::extract or sp::over.
Example data:
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
set.seed(20180121)
pts <- data.frame(coordinates(spsample(pols, 5, 'random')), name=letters[1:5])
plot(pols); points(pts)
Solution:
e <- extract(pols, pts[, c('x', 'y')])
pts$BCR <- e$NAME_2
pts
# x y name BCR
#1 6.009390 49.98333 a Wiltz
#2 5.766407 49.85188 b Redange
#3 6.268405 49.62585 c Luxembourg
#4 6.123015 49.56486 d Luxembourg
#5 5.911638 49.53957 e Esch-sur-Alzette

Related

Chinese province coordinates in r

Does anyone knows how to access the coordinates by regions for China. The code below shows the same thing that I am looking for China. Many thanks in advance
require(maps)
states_map <- map_data("state")
You may want to check the following approach:
# load packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(osmextract)
# get polygons in china
poly_china <- openstreetmap_fr_zones[which(openstreetmap_fr_zones$parent == "china"), ]
# extract the coords and save the coords in a data.frame
# you may want to keep the data in matrix format for better performances
poly_china_coords <- as.data.frame(st_coordinates(poly_china))
# extract the region name
my_times <- vapply(st_geometry(poly_china), function(x) nrow(st_coordinates(x)), numeric(1))
poly_china_coords$region_name <- rep(poly_china$name, times = my_times)
# result
head(poly_china_coords)
#> X Y L1 L2 L3 region_name
#> 1 114.875 32.960 1 1 1 Anhui
#> 2 114.860 32.970 1 1 1 Anhui
#> 3 114.870 33.025 1 1 1 Anhui
#> 4 114.880 33.035 1 1 1 Anhui
#> 5 114.895 33.035 1 1 1 Anhui
#> 6 114.890 33.060 1 1 1 Anhui
Created on 2020-11-06 by the reprex package (v0.3.0)
You can install sf from CRAN and you can install osmextract as follows:
install.packages("remotes")
remotes::install_github("ITSLeeds/osmextract")
The data are stored using the EPSG:4326 so X = long and Y = lat.
Is it what you are seeking for?
map_data("world",region="China")
long lat group order region subregion
1 110.8888 19.99194 1 1 China 1
2 110.9383 19.94756 1 2 China 1
3 110.9707 19.88330 1 3 China 1
4 110.9977 19.76470 1 4 China 1
5 111.0137 19.65547 1 5 China 1
6 110.9127 19.58608 1 6 China 1

Calculating distance between two variables and generating new variable

I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider
firm us euro asia africa stock year
A 1 4 3 5 46 2001
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 (1,4,3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = 0.2045883 * 343+ 0.1052075 * 345 = 106.4704 for the year 2001 for firm A.
I want to get a table including spill like this
firm us euro asia africa stock year spill
A 1 4 3 5 46 2001 106.4704
A 2 0 1 3 889 2002
B 2 3 1 1 343 2001
B 0 2 1 3 43 2002
C 1 3 4 2 345 2001
Can anyone please advise?
Here are the codes for stata[https://www.statalist.org/forums/forum/general-stata-discussion/general/1409182-calculating-distance-between-two-variables-and-generating-new-variable]. I have about 3,000 firms and 30 years. It runs well but very slowly.
dt <- data.frame(id=c("A","A","B","B","C"),us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
Given the minimal info on how to calculate the similarity distance I've used a formula from Find cosine similarity between two arrays which will return different numbers than yours but should give the same resulting info.
I split the data by year so we can compare the unique ids. I take those individual lists and use lapply to run a for loop comparing all possibilities.
dt <- data.frame(id=c("A","A","B","B","C"), us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))
geo <- c("us","euro","asia","africa")
s <- lapply(split(dt, dt$year), function(a) {
n <- nrow(a)
for(i in 1:n){
csim <- rep(0, n) # reset results of cosine similarity *stock vector
for(j in 1:n){
x <- unlist(a[i,geo])
y <- unlist(a[j,geo])
csim[j] <- (1-(x %*% y / sqrt(x%*%x * y%*%y)))*a[j,"stock"]
}
a$spill[i] <- sum(csim)
}
a
})
do.call(rbind, s)
# id us euro asia africa stock year spill
#2001.1 A 1 4 3 5 46 2001 106.47039
#2001.3 B 2 3 1 1 343 2001 77.93231
#2001.5 C 1 3 4 2 345 2001 72.96357
#2002.2 A 2 0 1 3 889 2002 12.28571
#2002.4 B 0 2 1 3 43 2002 254.00000

Iteration for time series data, using purrr

I have a bunch of time series data stacked on top of one another in a data frame; one series for each region in a country. I'd like to apply the seas() function (from the seasonal package) to each series, iteratively, to make the series seasonally adjusted. To do this, I first have to convert the series to a ts class. I'm struggling to do all this using purrr.
Here's a minimum worked example:
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
For each region (indexed by a number) I'd like to perform the following operations. Here's the first region as an example:
tem1 <- df %>% filter(region==1)
tem2 <- ts(data = tem1$var, frequency = 4, start=c(1990,1))
tem3 <- seas(tem2)
tem4 <- as.data.frame(tem3$data)
I'd then like to stack the output (ie. the multiple tem4 data frames, one for each region), along with the region and quarter identifiers.
So, the start of the output for region 1 would be this:
final seasonaladj trend irregular region quarter
1 27 27 96.95 -67.97279 1 1
2 126 126 96.95 27.87381 1 2
3 124 124 96.95 27.10823 1 3
4 127 127 96.95 30.55075 1 4
5 173 173 96.95 75.01355 1 5
6 130 130 96.95 32.10672 1 6
The data for region 2 would be below this etc.
I started with the following but without luck so far. Basically, I'm struggling to get the time series into the tibble:
seas.adjusted <- df %>%
group_by(region) %>%
mutate(data.ts = map(.x = data$var,
.f = as.ts,
start = 1990,
freq = 4))
I don't know much about the seasonal adjustment part, so there may be things I missed, but I can help with moving your calculations into a map-friendly function.
After grouping by region, you can nest the data so there's a nested data frame for each region. Then you can run essentially the same code as you had, but inside a function in map. Unnesting the resulting column gives you a long-shaped data frame of adjustments.
Like I said, I don't have the expertise to know whether those last two columns having NAs is expected or not.
Edit: Based on #wibeasley's question about retaining the quarter column, I'm adding a mutate that adds a column of the quarters listed in the nested data frame.
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
df %>%
group_by(region) %>%
nest() %>%
mutate(data.ts = map(data, function(x) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(quarter = x$quarter)
})) %>%
unnest(data.ts)
#> # A tibble: 200 x 8
#> region final seasonaladj trend irregular quarter seasonal adjustfac
#> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 27 27 97.0 -68.0 1 NA NA
#> 2 1 126 126 97.0 27.9 2 NA NA
#> 3 1 124 124 97.0 27.1 3 NA NA
#> 4 1 127 127 97.0 30.6 4 NA NA
#> 5 1 173 173 97.0 75.0 5 NA NA
#> 6 1 130 130 97.0 32.1 6 NA NA
#> 7 1 6 6 97.0 -89.0 7 NA NA
#> 8 1 50 50 97.0 -46.5 8 NA NA
#> 9 1 135 135 97.0 36.7 9 NA NA
#> 10 1 105 105 97.0 8.81 10 NA NA
#> # ... with 190 more rows
I also gave a bit more thought to doing this without nesting, and instead tried doing it with a split. Passing that list of data frames into imap_dfr let me take each split piece of the data frame and its name (in this case, the value of region), then return everything rbinded back together into one data frame. I sometimes shy away from nested data just because I have trouble seeing what's going on, so this is an alternative that is maybe more transparent.
df %>%
split(.$region) %>%
imap_dfr(function(x, reg) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(region = reg, quarter = x$quarter)
}) %>%
select(region, quarter, everything()) %>%
head()
#> region quarter final seasonaladj trend irregular seasonal adjustfac
#> 1 1 1 27 27 96.95 -67.97274 NA NA
#> 2 1 2 126 126 96.95 27.87378 NA NA
#> 3 1 3 124 124 96.95 27.10823 NA NA
#> 4 1 4 127 127 96.95 30.55077 NA NA
#> 5 1 5 173 173 96.95 75.01353 NA NA
#> 6 1 6 130 130 96.95 32.10669 NA NA
Created on 2018-08-12 by the reprex package (v0.2.0).
I put all the action inside of f(), and then called it with purrr::map_df(). The re-inclusion of quarter is a hack.
f <- function( .region ) {
d <- df %>%
dplyr::filter(region == .region)
y <- d %>%
dplyr::pull(var) %>%
ts(frequency = 4, start=c(1990,1)) %>%
seas()
y$data %>%
as.data.frame() %>%
# dplyr::select(-seasonal, -adjustfac) %>%
dplyr::mutate(
quarter = d$quarter
)
}
purrr::map_df(1:10, f, .id = "region")
results:
region final seasonaladj trend irregular quarter seasonal adjustfac
1 1 27.00000 27.00000 96.95000 -6.797279e+01 1 NA NA
2 1 126.00000 126.00000 96.95000 2.787381e+01 2 NA NA
3 1 124.00000 124.00000 96.95000 2.710823e+01 3 NA NA
4 1 127.00000 127.00000 96.95000 3.055075e+01 4 NA NA
5 1 173.00000 173.00000 96.95000 7.501355e+01 5 NA NA
6 1 130.00000 130.00000 96.95000 3.210672e+01 6 NA NA
7 1 6.00000 6.00000 96.95000 -8.899356e+01 7 NA NA
8 1 50.00000 50.00000 96.95000 -4.647254e+01 8 NA NA
9 1 135.00000 135.00000 96.95000 3.671077e+01 9 NA NA
10 1 105.00000 105.00000 96.95000 8.806955e+00 10 NA NA
...
96 5 55.01724 55.01724 60.25848 9.130207e-01 16 1.9084928 1.9084928
97 5 60.21549 60.21549 59.43828 1.013076e+00 17 1.0462424 1.0462424
98 5 58.30626 58.30626 58.87065 9.904130e-01 18 0.1715082 0.1715082
99 5 61.68175 61.68175 58.07827 1.062045e+00 19 1.0537962 1.0537962
100 5 59.30138 59.30138 56.70798 1.045733e+00 20 2.5294523 2.5294523
...

Finding coastal and international boarders from shape file in R

I want to automatically create two variables from a shape file: 1. a dummy indicator for if a region has a international border and 2. a dummy indicator for if a region has a coastal border.
For example for Guinea variable 1 would be the regions with red dots below, and variable 2 would be with blue dots (I did these by eye).
library(raster)
sd0 <- getData(name = "GADM", country = "GIN", level = 2)
plot(sd0)
There does not seem to be any information in the #data slot for these type of characteristics:
head(sd0#data)
# OBJECTID ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 HASC_2 CCN_2 CCA_2 TYPE_2 ENGTYPE_2 NL_NAME_2
# 1 1 97 GIN Guinea 1 Boké 1 Boffa GN.BF NA Préfecture Prefecture
# 2 2 97 GIN Guinea 1 Boké 2 Boké GN.BK NA Préfecture Prefecture
# 3 3 97 GIN Guinea 1 Boké 3 Fria GN.FR NA Préfecture Prefecture
# 4 4 97 GIN Guinea 1 Boké 4 Gaoual GN.GA NA Préfecture Prefecture
# 5 5 97 GIN Guinea 1 Boké 5 Koundara GN.KD NA Préfecture Prefecture
# 6 6 97 GIN Guinea 2 Conakry 6 Conakry GN.CK NA Préfecture Prefecture
Perhaps they are elsewhere (I have little to no experience with shape files)? Is there a function somewhere that could at least allow me to create a variable that indicates if a region has no outer boundaries (i.e. all those with no dots in the map above)?

rworldmap package - Warning if the number of quantiles was reduced

I am using this R code:
library(rworldmap)
Data <- read.table("D:/Bla/Maps/Test.txt", header = TRUE, sep = "\t")
sPDF <- joinCountryData2Map(Data, joinCode = "ISO3",nameJoinColumn = "ISO3CountryCode")
mapCountryData(sPDF, nameColumnToPlot = "Data")
This produces a map but I get:
You asked for 7 quantiles, only 1 could be created in quantiles classification
I googled and it pointed me to this code
Not sure whether it is relevant.
This is the data I have used:
ISO3CountryCode Data
JPN 7
AUS 6
IND 6
CHN 5
GBR 5
CHE 4
IRN 4
DEU 3
EGY 3
ESP 3
LBY 3
TUN 3
USA 3
ARG 2
AUT 2
BRA 2
EST 2
GRC 2
ITA 2
TUR 2
URY 2
CHL 1
ETH 1
FRA 1
JOR 1
KEN 1
KOR 1
LTU 1
MEX 1
NLD 1
NZL 1
PER 1
POL 1
SAU 1
SRB 1
SVK 1
SVN 1
TZA 1
ZAF 1
It looks like by default mapCountryData() tries to fit data to quantiles for binning. You'll need to help it along a little by tweaking the catMethod parameter.
I'm not sure what your values 1 through 7 mean. If they are categories (and you want them all explicitly displayed in the legend), try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="categorical")
If you want to treat all values equally on a continuous scale, try:
mapCountryData(sPDF, nameColumnToPlot = "Data", catMethod="fixedWidth")
If neither of these does do what you want, you might try altering numCats and/or catMethod see ?mapCountryData for the possible values and their meaning.

Resources