Finding closest distance between points in 3 data sets using sp/rgeos? - r

I have 3 data sets: one of people, one of stores, and one of farmers markets, each with lat/long points.
PeopleEx <- data.frame(replicate(2,sample(0:10,10,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:10, type = "people")
StoresEx <- data.frame(replicate(2,sample(0:10,5,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:5, type = "store")
MarketsEx <- data.frame(replicate(2,sample(0:10,1,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1, type = "market")
Each have a column for "Lat","Long","ID","type", where type can be "people", "market, or "store". I am trying to find the distance between each person and the closest store as well as the closest market. To do so I made a data frame with all 3 data sets.
DfExample <- rbind(PeopleEx, StoresEx, MarketsEx)
I've been using the sp and rgeos packages to calculate distance between points.
sp.DfExample <- DfExample
coordinates(sp.DfExample) <- ~long+lat
The first 10 entries are the people with the last 6 being stores/markets so I've used this code:
Distances <- gDistance(sp.DfExample, byid=T)
DistanceStore <- Distances[1:16, 10:16]
MaxDist <- DistanceStore
apply(MaxDist, 1, FUN=min)
This works in that it reports the minimum distance from each person to a store/market, however it drops the type. Any ideas on how to keep the type from dropping or different packages/approaches to use instead?

Related

Finding all nearest neighbour to all data points using sparklyr

I would like to use sparklyr find the nearest neighbour for each point in a dataset.
I've found sparklyr::ml_approx_nearest_neighbors() uses a key argument (a single feature vector) to find the nearest neighbour, so I guess I'd iterate over that for each point. Should I use this with lapply(), or is this inefficient?
Here's an example (I've modified from here) where I take the titanic dataset and attempt to find the nearest 2 neighbours from the same dataset using the first 700 data points. It returns the point itself, and the next closest as expected, but I suspect the entire pipeline reruns for each data point making this inefficient.
Is there a better way, please?
library(sparklyr)
library(titanic)
library(dplyr)
library(magrittr)
sc <- spark_connect(method = "databricks") # create a spark connection object
# clean dataset
df_titanic <- titanic::titanic_train %>%
dplyr::select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
dplyr::rename_all(tolower) %>% # make the col names lower case
dplyr::mutate(sex = ifelse(sex == 'male', 1, 0), id = 1:nrow(.)) %>% # turn sex to an integer
dplyr::filter_all(dplyr::all_vars(!is.na(.))) # remove NAs
sdf_titanic <- sparklyr::copy_to(sc, df_titanic, overwrite = T) # copy to spark
input_cols <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare') # features list
## append a vectorised list of the features we're interested in
sdf_titanic_va <- ft_vector_assembler(sdf_titanic,
input_cols = input_cols,
output_col = 'features')
brp_lsh <- sparklyr::ft_bucketed_random_projection_lsh(
sc,
input_col = 'features',
output_col = 'hash',
bucket_length = 2,
num_hash_tables = 3
)
brp_fit <- ml_fit(brp_lsh, sdf_titanic_va) ## fit the LSH to our data to get the hashes
id1_input <- sdf_titanic_va %>%
dplyr::filter(id %in% 1:700) %>%
dplyr::pull(features)
lapply(id1_input, function(x) ml_approx_nearest_neighbors(
brp_fit,
sdf_titanic_va,
key = x,
dist_col = 'dist_col',
num_nearest_neighbors = 2
))

Find two points farthest apart using group_by() in Sf

Haven't been able to find an exact Q/A to match this problem, though there are several related. Im trying to calculate a distance matrix for all points groups defined by an ID column. Then select the two points that are farthest apart from each group, retaining the original group id. The number of points in each group varies from 2, 4 or 6.
My sf df:
df <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
I've tried code such as:
a <- df %>%
group_by(id) %>%
st_distance(.)
Though this just returns a distance matrix of all points.
The below gives me, what I want, though I fear it would be slow on large datasets:
maxMin <- do.call(rbind,lapply(unique(allInts$id), function(x) {
df <- allInts %>% filter(id == x)
d <- st_distance(df)
df %>% slice(unique(as.vector(which(d == max(d),arr.ind=T))))
}))
You can use dplyr::group_split to split your data frame into a list per group. You can then apply whatever function you want to that list using map/lapply.
Script below keeps the 2 points which are furthest apart in each group.
library(sf)
library(tidyverse)
# dummy data
data <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
# split it into a list per ID
data_group <- data %>%
group_by(id) %>%
group_split()
#apply a function to each list
distance_per_group <- map(data_group, function(x){
distance_matrix <- st_distance(x)
biggest_distance <- as.numeric(which(distance_matrix == max(distance_matrix), arr.ind = TRUE)[1,])
farthest_apart <- x[biggest_distance,]
})

How to sample without replacement within groups in R

I have a data frame which contains a 'year' variable with values between 1 and 100000 repeating multiple times. I have another data frame with 1000 'loss amounts' with an associated probability for each loss. I'd like to merge loss amounts onto the year data frame by sampling from the loss amounts table. I want to sample without replacement within each level of the year variable e.g. within each level of the year variable the loss amounts should be unique.
Reproducible example below where I can only get it to sample without replacement across the full 'year' dataset and not just within the different levels of the year variable as required. Is there a way of doing this (ideally without using loops as I need the code to run quickly)
#mean frequency
freq <- 100
years <- 100000
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
#lookup table with loss amounts
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(number){
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- sample_from_lookup(nrow(year))
year <- tibble(year = year$year, amount = amounts)
According to your description, maybe you can try replicate within your sample_from_lookup, i.e.,
sample_from_lookup <- function(number){
amount <- replicate(number,
sample(lookup$amount,
1,
replace = FALSE,
prob = lookup$pdf))
}
In this case, you need to set size 1 to your sample function.
I ended up using split to break the 'year' data into groups within a list. Then running the(slightly amended) sample_from_lookup function on each element of the list using map. Amended code below.
#mean frequency
freq <- 5
years <- 100
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
year_split = split(year, year$year)
#lookup table
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(x){
number = NROW(x)
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- map(year_split, sample_from_lookup) %>% unlist() %>% tibble(amount = .)
year <- tibble(year = year$year, amount = amounts$amount)

Creating classes on vertices data frame from components$membership

I am looking to add a 'description' variable to the vertices data frame which describes the cluster in which a node is found. My network is family relationships so clusters could be a family of two adults and two children, single parent with three children, couple etc.
My data looks like
Vertices data frame
ID Date.Of.B Nationality
X1 02/05/1995 Ugandan
X2 10/10/2010 Ugandan
X3 15/12/1975 Irish
: : :
Edgelist
ID1 ID2
X1 X2
X1 X3
X2 X3
X3 X1
: :
I plan to create factor levels to describe clusters i.e
2 adults = 2A
2 adults 2 children = 2A2C
5 adults 0 children = 5A
After creating the graph using graph_from_data_frame() I can extract the components using componets() with components$membership giving each cluster a membership number with the IDs an attribute of components$membership. I can apply a label to each vertex to determine their status as an adult or child.
Basically I am looking to add another variable which classes each ID given the cluster it is in:
New vertices data frame
ID Date.Of.B Nationality Class
X1 02/05/1995 Ugandan 2A1C
X2 10/10/2010 Ugandan 2A1C
X3 15/12/1975 Irish 2A1C
: : :
I am thinking I am going to have to use some sort of loop to go through each cluster and apply a level to each vertex by component$membership
This is one option I thought of and am currently working on.
Please let me know if you have any other ideas or better ways to do it.
Thanks
Maybe this helps:
library(igraph)
library(dplyr)
library(tidyr)
Generate example data:
set.seed(1)
vertices <- data.frame(ID = 1:20,
date = as.character(rnorm(20, -5000, 3000) + Sys.Date()),
Nationality = letters[1:20])
edgelist <- data.frame(from = sample(1:20, 15, replace = T),
to = sample(1:20, 15, replace = T))
g <- graph_from_data_frame(edgelist,
directed = F,
vertices = vertices)
cp <- components(g)
Save component-membership as new vertex attribute:
V(g)$components <- membership(cp)
Extract vertices plus additional attributes:
df <- get.data.frame(g, "vertices")
Work with the dataframe:
First generate a new coding variable based on age (in days), count the occurence and paste the result into a new variable.
df <- df %>%
mutate(coding = ifelse(Sys.Date() - as.Date(df$date) > 6570, "A", "C")) %>%
group_by(components, coding) %>%
mutate(n = n()) %>%
ungroup() %>%
mutate(new = paste(n, coding, sep = "")) %>%
select(-coding, -n)
Then nest the dataframe based on components into a new dataframe and delete duplicates.
df2 <- df %>%
select(new, components) %>%
distinct(.keep_all = T) %>%
nest(-components)
After that you can merge the two dataframes and loop through (sapply) to unlist your new class variable (in this case called data), which is also your final result.
df3 <- left_join(df, df2) %>%
select(-new)
df3$data <- sapply(df3$data, function(x) paste(unname(unlist(x)), collapse = ""))

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

Resources