Match two datasets by minimum geospatial distance (R) - r

I have the two following datasets:
houses <- data.table(house_number = c(1:3),
lat_decimal = seq(1.1, 1.3, by = 0.1),
lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
lat_decimal = seq(1, 2, by = 0.1),
lon_decimal = seq(2, 3, by = 0.1))
I want to merge houses and stations together such that the resulting station_number is the station that's closest to the corresponding house_number.
This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm from the geosphere package).
I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distmfunction instead of the sqrtfunction?

Use match_nrst_haversine from hutilscpp:
library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
lon_decimal,
addresses_lat = stations$lat_decimal,
addresses_lon = stations$lon_decimal,
Index = stations$station_numer,
close_enough = 0,
cartesian_R = 5)]
houses
#> house_number lat_decimal lon_decimal station_number dist
#> 1: 1 1.1 1.4 1 67.62617
#> 2: 2 1.2 1.5 1 59.87076
#> 3: 3 1.3 1.6 1 55.59026
You may want to adjust close_enough and cartesian_R if your data are numerous (i.e. over a million points to match) for performance.
`cartesian_R`
The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the
argument name suggests, the distance is in cartesian coordinates, so a
small number is likely.
`close_enough`
The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a
match.)
For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.
May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.

Your question is a bit more complicated than a simple merge, and outer is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:
library(tidyverse)
library(geosphere)
First, a bit of prep. distm expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:
houses <- data.frame(house_number = c(1:3),
lon_house = seq(1.4, 1.6, by = 0.1),
lat_house = seq(1.1, 1.3, by = 0.1)
)
stations <- data.frame(station_number = c(1:11),
lon_station = seq(2, 3, by = 0.1),
lat_station = seq(1, 2, by = 0.1)
)
We'll create "nested" data frames so that it's easier to keep coordinates together:
house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')
house_number house_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
station_number station_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
4 4 <data.frame [1 × 2]>
5 5 <data.frame [1 × 2]>
6 6 <data.frame [1 × 2]>
7 7 <data.frame [1 × 2]>
8 8 <data.frame [1 × 2]>
9 9 <data.frame [1 × 2]>
10 10 <data.frame [1 × 2]>
11 11 <data.frame [1 × 2]>
Use dplyr::crossing to combine every row from both data frames:
data.master <- crossing(house_nest, station_nest)
house_number house_coords station_number station_coords
<int> <list> <int> <list>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]>
2 1 <data.frame [1 × 2]> 2 <data.frame [1 × 2]>
3 1 <data.frame [1 × 2]> 3 <data.frame [1 × 2]>
4 1 <data.frame [1 × 2]> 4 <data.frame [1 × 2]>
5 1 <data.frame [1 × 2]> 5 <data.frame [1 × 2]>
6 1 <data.frame [1 × 2]> 6 <data.frame [1 × 2]>
7 1 <data.frame [1 × 2]> 7 <data.frame [1 × 2]>
8 1 <data.frame [1 × 2]> 8 <data.frame [1 × 2]>
9 1 <data.frame [1 × 2]> 9 <data.frame [1 × 2]>
10 1 <data.frame [1 × 2]> 10 <data.frame [1 × 2]>
# ... with 23 more rows
With all this now in place, we can use distm on each row to calculate a distance, and keep the shortest distance per house:
data.dist <- data.master %>%
mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>%
group_by(house_number) %>%
filter(dist == min(dist))
house_number house_coords station_number station_coords dist
<int> <list> <int> <list> <dbl>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 67690.
2 2 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 59883.
3 3 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 55519.

Related

How to refer to data frames dynamically in an R for loop?

R novice, so apologies in advance. I want to write a for loop that does sequential operations on a series of dataframes and then binds them (by sequence number).
Ideally, I'd think it would be something like this (where sc2 is the base dataframe I'm working from, week3 is the selection variable used. The dataframes I'm trying to create would be t1, t2, t3,... and w1, w2, w3,... etc. In other words, the 'i' in the dataframe name would read from the for statement.
for(i in 1:16) {
ti= tail((subset(sc2, sc2$week3<i)), n=200)
wi= subset(sc2, sc2$week3==i)
mi=rbind(ti, wi)
}
Which I'm sure you know doesn't work. I've gotten this far -
for(i in 1:16) {
txi= tail((subset(sc2, sc2$week3<i)), n=200)
assign(paste0("trst",i), txi, envir = .GlobalEnv)
wxi= subset(sc2, sc2$week3==i)
assign(paste0("w",i), wxi, envir = .GlobalEnv)
}
Which creates a dummy dataframes (*xi) that are then assigned for each i to the global environment. But now how to rbind them? Is there a more elegant way to do all of this, or am I missing something about the way to refer to the dataframes dynamically?
Don't do it in a loop!
This can be done much easier by holding data frame in data frame or rather I should write tibble in tibble. See the example below.
library(tidyverse)
sc2 = tibble(
week3 = sample(1:20, 100, replace = TRUE),
x = rnorm(100)
)
ftxi = function(i) sc2 %>% filter(week3<i)
fwxi = function(i) sc2 %>% filter(week3==i)
df = tibble(id = 1:16) %>%
group_by(id) %>%
mutate(txi = map(id, ~ftxi(.x)),
wxi = map(id, ~fwxi(.x)))
Let's see what is df.
# A tibble: 16 x 3
# Groups: id [16]
id txi wxi
<int> <list> <list>
1 1 <tibble [0 x 2]> <tibble [4 x 2]>
2 2 <tibble [4 x 2]> <tibble [6 x 2]>
3 3 <tibble [10 x 2]> <tibble [6 x 2]>
4 4 <tibble [16 x 2]> <tibble [6 x 2]>
5 5 <tibble [22 x 2]> <tibble [4 x 2]>
6 6 <tibble [26 x 2]> <tibble [4 x 2]>
7 7 <tibble [30 x 2]> <tibble [6 x 2]>
8 8 <tibble [36 x 2]> <tibble [4 x 2]>
9 9 <tibble [40 x 2]> <tibble [3 x 2]>
10 10 <tibble [43 x 2]> <tibble [6 x 2]>
11 11 <tibble [49 x 2]> <tibble [3 x 2]>
12 12 <tibble [52 x 2]> <tibble [4 x 2]>
13 13 <tibble [56 x 2]> <tibble [6 x 2]>
14 14 <tibble [62 x 2]> <tibble [5 x 2]>
15 15 <tibble [67 x 2]> <tibble [5 x 2]>
16 16 <tibble [72 x 2]> <tibble [7 x 2]>
As you can see it is a tibble which has other tibble in it.
So let's see if everything is correct and take a look at the second row.
First, let's look at the txi variable df$txi[[2]]
# A tibble: 4 x 2
week3 x
<int> <dbl>
1 1 -0.0829
2 1 -2.15
3 1 -0.949
4 1 -0.0583
Now it's the turn of the variable wxi df$wxi[[2]]
# A tibble: 6 x 2
week3 x
<int> <dbl>
1 2 -0.0643
2 2 -0.228
3 2 -0.620
4 2 -1.21
5 2 0.186
6 2 1.19
Bingo you get what you expected!
It is also a very quick method. You can see my other answer in this forum
What is faster/better: Loop over each row..

How to convert a list of tibbles/dataframes into a nested tibble/dataframe

Sample Data
ex_list <- list(a = tibble(x = 1:4, y = 5:8),
b = mtcars)
How do I convert this list of tibbles/dataframes into a nested tibble as shown below:
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Tidy solutions appreciated!
We may use enframe
library(tibble)
enframe(ex_list)
# A tibble: 2 x 2
name value
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
If we need to change the column names, use the name and value
> enframe(ex_list, name = 'data_name', value = 'data')
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Is this what you want?
library(tidyverse)
lapply(ex_list, nest) %>%
dplyr::bind_rows(., .id = "data_name")
# # A tibble: 2 x 2
# data_name data
# <chr> <list>
# 1 a <tibble [4 x 2]>
# 2 b <tibble [32 x 11]>
#OR map
#map(ex_list, nest) %>%
# bind_rows(., .id = "data_name")

Predict nested model with nested newdata

Imagine a high resolution temperature and light time series taken at many locations (stations) over many days. Except, at each station temp and light are taken by different sensors, resulting in a slightly different set of timestamps.
To merge these into one data.frame, I've been trying to make a model of light for each day at each station in df.light. Then, I want to predict light values at the exact timestamps of temp readings, which are nested the same way in df.temp (the temperature dataset).
station <- rep(1:5, each=36500)
dayofyear <- rep(1:365, 5, each=100)
hourofday.light <- runif(182500, min=0, max=24)
light <- runif(182500, min=0, max=40)
hourofday.temp <- runif(182500, min=0, max=24)
temp <- runif(182500, min=0, max=40)
df.light <- data.frame(station, dayofyear, hourofday.light, light)
df.temp <- data.frame(station, dayofyear, hourofday.temp, temp)
> head(df.light)
station dayofyear hourofday.light light
1 1 1 10.217349 0.120381
2 1 1 12.179213 12.423694
3 1 1 16.515400 7.277784
4 1 1 3.775723 31.793782
5 1 1 7.719266 30.578220
6 1 1 9.269916 16.937042
> tail(df.light)
station dayofyear hourofday.light light
182495 5 365 4.712285 19.2047471
182496 5 365 11.190919 39.5921675
182497 5 365 18.710969 11.8182347
182498 5 365 20.288101 11.6874453
182499 5 365 15.466373 0.3264828
182500 5 365 12.969125 29.4429034
> head(df.temp)
station dayofyear hourofday.temp temp
1 1 1 12.1298554 30.862308
2 1 1 23.6226076 9.328942
3 1 1 9.3699831 28.970397
4 1 1 0.1814767 1.405557
5 1 1 23.6300014 39.875743
6 1 1 7.6999984 39.786182
I can make the light models, e.g. GAMs, for each day at each station in df.light using dplyr. But I am stuck not knowing how to feed the nested newdata from df.temp to the models to generate the per-station-per-day predictions.
library("mgcv")
library("tidyverse")
data <- as_tibble(df.light) %>%
group_by(station, dayofyear) %>%
nest()
models <- data %>%
mutate(
model = map(data, ~ gam(light ~ s(hourofday.light), data = .x)),
predicted = map(model, ~ predict.gam(.x, newdata = hourofday.temp)) # newdata doesn't look nested
)
The last line starting with predicted does not work because newdata is not nested...I think. Please help. I'm guessing this could be a common issue in merging time series generated by multiple sources.
You can first prepare the data.
names(df.temp)[3:4] <- names(df.light)[3:4]
data1 <- df.light %>% group_by(station, dayofyear) %>%nest() %>% ungroup()
data2 <- df.temp %>% group_by(station, dayofyear) %>% nest() %>% ungroup()
apply model and get predicted values.
result <- data1 %>%
mutate(data2 = data2$data,
model = map(data, ~ gam(light ~ s(hourofday.light),data = .x)),
predicted = map2(model, data2, predict.gam))
result
# A tibble: 1,825 x 6
# station dayofyear data data2 model predicted
# <int> <int> <list> <list> <list> <list>
# 1 1 1 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 2 1 2 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 3 1 3 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 4 1 4 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 5 1 5 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 6 1 6 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 7 1 7 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 8 1 8 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# 9 1 9 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
#10 1 10 <tibble [100 × 2]> <tibble [100 × 2]> <gam> <dbl [100]>
# … with 1,815 more rows

How to write the Mesh from PubMed into a data frame

I recently downloaded the RISmed package to download a little under 500,000 article abstracts from PubMed, and I'm trying to input this data into a data frame. My problem comes when trying to include the Mesh data i get the error message
'Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows:'
Is there a way to nest this information into a data frame for analysis?
RCT_topic <- 'randomized clinical trial'
RCT_query <- EUtilsSummary(RCT_topic, mindate=2016, maxdate=2017, retmax=100)
summary(RCT_query)
RCT_records <- EUtilsGet(RCT_query)
RCT_data <- data.frame('PMID'=PMID(RCT_records),
'Title'=ArticleTitle(RCT_records),
'Abstract'=AbstractText(RCT_records),
'YearPublished'=YearPubmed(RCT_records),
'Month.Published'=MonthPubmed(RCT_records),
'Country'=Country(RCT_records),
'Grant'=GrantID(RCT_records),
'Acronym'=Acronym(RCT_records),
'Agency'=Agency(RCT_records),
'Mesh'=Mesh(RCT_records))
Consider using a tibble:
library(RISmed)
library(dplyr) # tibble and other functions
RCT_topic <- 'randomized clinical trial'
RCT_query <- EUtilsSummary(RCT_topic, mindate=2016, maxdate=2017, retmax=100)
summary(RCT_query)
RCT_records <- EUtilsGet(RCT_query)
RCT_data <- data_frame('PMID'=PMID(RCT_records),
'Title'=ArticleTitle(RCT_records),
'Abstract'=AbstractText(RCT_records),
'YearPublished'=YearPubmed(RCT_records),
'Month.Published'=MonthPubmed(RCT_records),
'Country'= Country(RCT_records),
'Grant' =GrantID(RCT_records),
'Acronym' =Acronym(RCT_records),
'Agency' =Agency(RCT_records),
'Mesh'=Mesh(RCT_records))
The Mesh column is now a list of dataframes:
select(RCT_data, PMID, Mesh)
# # A tibble: 100 x 2
# PMID Mesh
# <chr> <list>
# 1 29288775 <data.frame [21 × 2]>
# 2 29288545 <data.frame [19 × 2]>
# 3 29288510 <data.frame [15 × 2]>
# 4 29288507 <data.frame [19 × 2]>
# 5 29288478 <data.frame [16 × 2]>
# 6 29288309 <data.frame [19 × 2]>
# 7 29288191 <data.frame [11 × 2]>
# 8 29288190 <data.frame [23 × 2]>
# 9 29288184 <data.frame [21 × 2]>
# 10 29288175 <data.frame [12 × 2]>
# # ... with 90 more rows

apply function on certain columns in nested variable in r

I want to apply a vectorised operation on certain columns in the nested variable. The function that I want to apply is to find the sum of missing value in the numeric features i.e. weight and calories. The data frame that I have is as following
df <- data.frame(country = c("US", "US", "UK", "PAK"),name = c("David",
"James", "Junaid", "Ali"), fruit = c("Apple", "banana", "orange", "melon"),
weight = c(90,110,120,NA), calories = c(NA,20, NA,NA))
country name fruit weight calories
1 US David Apple 90 NA
2 US James banana 110 20
3 UK Junaid orange 120 NA
4 PAK Ali melon NA NA
When I nest the data frame
nested_df <- df %>% group_by(country) %>% nest()
# A tibble: 3 × 2
country data
<fctr> <list>
1 US <tibble [2 × 4]>
2 UK <tibble [1 × 4]>
3 PAK <tibble [1 × 4]>
I have tried to use the following syntax but to no avail.
nested_df %>% mutate(missings = map(data, c("weight", "calories")) %>%
map_lgl(function(x) sum(!is.na(x))/length(x) ==1))`
The result I am expected are as following
`# A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> FALSE
2 UK <tibble [1 × 4]> FALSE
3 PAK <tibble [1 × 4]> TRUE`
however, what i am getting is
` A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> NA
2 UK <tibble [1 × 4]> NA
3 PAK <tibble [1 × 4]> NA`
This will check whether more than 50% of values are NA...
colstocheck <- c("weight", "calories")
nested_df %>% mutate(missings = (map_lgl(data,
function(x) sum(is.na(x[,colstocheck]))/length(x[,colstocheck]) > 0.5)))
# A tibble: 3 x 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 x 4]> FALSE
2 UK <tibble [1 x 4]> FALSE
3 PAK <tibble [1 x 4]> TRUE

Resources