Find two points farthest apart using group_by() in Sf - r

Haven't been able to find an exact Q/A to match this problem, though there are several related. Im trying to calculate a distance matrix for all points groups defined by an ID column. Then select the two points that are farthest apart from each group, retaining the original group id. The number of points in each group varies from 2, 4 or 6.
My sf df:
df <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
I've tried code such as:
a <- df %>%
group_by(id) %>%
st_distance(.)
Though this just returns a distance matrix of all points.
The below gives me, what I want, though I fear it would be slow on large datasets:
maxMin <- do.call(rbind,lapply(unique(allInts$id), function(x) {
df <- allInts %>% filter(id == x)
d <- st_distance(df)
df %>% slice(unique(as.vector(which(d == max(d),arr.ind=T))))
}))

You can use dplyr::group_split to split your data frame into a list per group. You can then apply whatever function you want to that list using map/lapply.
Script below keeps the 2 points which are furthest apart in each group.
library(sf)
library(tidyverse)
# dummy data
data <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
# split it into a list per ID
data_group <- data %>%
group_by(id) %>%
group_split()
#apply a function to each list
distance_per_group <- map(data_group, function(x){
distance_matrix <- st_distance(x)
biggest_distance <- as.numeric(which(distance_matrix == max(distance_matrix), arr.ind = TRUE)[1,])
farthest_apart <- x[biggest_distance,]
})

Related

Finding all nearest neighbour to all data points using sparklyr

I would like to use sparklyr find the nearest neighbour for each point in a dataset.
I've found sparklyr::ml_approx_nearest_neighbors() uses a key argument (a single feature vector) to find the nearest neighbour, so I guess I'd iterate over that for each point. Should I use this with lapply(), or is this inefficient?
Here's an example (I've modified from here) where I take the titanic dataset and attempt to find the nearest 2 neighbours from the same dataset using the first 700 data points. It returns the point itself, and the next closest as expected, but I suspect the entire pipeline reruns for each data point making this inefficient.
Is there a better way, please?
library(sparklyr)
library(titanic)
library(dplyr)
library(magrittr)
sc <- spark_connect(method = "databricks") # create a spark connection object
# clean dataset
df_titanic <- titanic::titanic_train %>%
dplyr::select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
dplyr::rename_all(tolower) %>% # make the col names lower case
dplyr::mutate(sex = ifelse(sex == 'male', 1, 0), id = 1:nrow(.)) %>% # turn sex to an integer
dplyr::filter_all(dplyr::all_vars(!is.na(.))) # remove NAs
sdf_titanic <- sparklyr::copy_to(sc, df_titanic, overwrite = T) # copy to spark
input_cols <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare') # features list
## append a vectorised list of the features we're interested in
sdf_titanic_va <- ft_vector_assembler(sdf_titanic,
input_cols = input_cols,
output_col = 'features')
brp_lsh <- sparklyr::ft_bucketed_random_projection_lsh(
sc,
input_col = 'features',
output_col = 'hash',
bucket_length = 2,
num_hash_tables = 3
)
brp_fit <- ml_fit(brp_lsh, sdf_titanic_va) ## fit the LSH to our data to get the hashes
id1_input <- sdf_titanic_va %>%
dplyr::filter(id %in% 1:700) %>%
dplyr::pull(features)
lapply(id1_input, function(x) ml_approx_nearest_neighbors(
brp_fit,
sdf_titanic_va,
key = x,
dist_col = 'dist_col',
num_nearest_neighbors = 2
))

Finding closest distance between points in 3 data sets using sp/rgeos?

I have 3 data sets: one of people, one of stores, and one of farmers markets, each with lat/long points.
PeopleEx <- data.frame(replicate(2,sample(0:10,10,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:10, type = "people")
StoresEx <- data.frame(replicate(2,sample(0:10,5,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:5, type = "store")
MarketsEx <- data.frame(replicate(2,sample(0:10,1,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1, type = "market")
Each have a column for "Lat","Long","ID","type", where type can be "people", "market, or "store". I am trying to find the distance between each person and the closest store as well as the closest market. To do so I made a data frame with all 3 data sets.
DfExample <- rbind(PeopleEx, StoresEx, MarketsEx)
I've been using the sp and rgeos packages to calculate distance between points.
sp.DfExample <- DfExample
coordinates(sp.DfExample) <- ~long+lat
The first 10 entries are the people with the last 6 being stores/markets so I've used this code:
Distances <- gDistance(sp.DfExample, byid=T)
DistanceStore <- Distances[1:16, 10:16]
MaxDist <- DistanceStore
apply(MaxDist, 1, FUN=min)
This works in that it reports the minimum distance from each person to a store/market, however it drops the type. Any ideas on how to keep the type from dropping or different packages/approaches to use instead?

How to Calculate A Daily Max for Different Locations in R?

So currently I am able to calculate a daily max for one site using the following code:
library('dplyr')
library('data.table')
library('tidyverse')
library('tidyr')
library('lubridate')
funcVolume <- function(max_data$enter_yard, max_data$exit_yard)
{
vecOnes <- array(1,c(length(max_data$enter_yard),1))
vecTime <- c(max_data$enter_yard,max_data$exit_yard)
vecCount <- c(vecOnes,-vecOnes)
df_test <- data.frame(T = vecTime, Count = vecCount)
df_test <- df_test %>%
arrange(T) %>%
mutate(Volume = cumsum(Count))
df_test
}
df_test2 <- df_test
df_test2$date <- as.Date(format(df_test$T, "%Y-%m-%d"))
df_test3 <- df_test2
df_test3 <- tibble(x = df_test2$Volume, y = df_test2$date) %>%
arrange(y)
dataset <- df_test3 %>%
group_by(y) %>%
dplyr::filter(x == max(x)) %>%
distinct(x,.keep_all = T) %>%
ungroup()
However, I would like to do this for multiple locations. In my original dataframe, I have a column that lists the name of the site, and two columns for when an object enter or leaves a site. The name is just a general text column, and the other two columns are datetime columns. Ideally, I would want an output that looks like the following:
Date | Max Count | Site
x y z
x a b
I also have a couple million rows of data, so I need something that can run in a reasonable time frame.

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

Find the variance over a sliding window in dplyr

I want to find the variance of the previous three values in a group.
# make some data with categories a and b
library(dplyr)
df = expand.grid(
a = LETTERS[1:3],
index = 1:10
)
# add a variable that changes within each group
set.seed(9999)
df$x = runif(nrow(df))
# get the variance of a subset of x
varSubset = function(x, index, subsetSize) {
subset = (index-subsetSize+1):index
ifelse(subset[1]<1, -1, var(x[subset]))
}
df %>%
# group the data
group_by(a) %>%
# get the variance of the 3 most recent values
mutate(var3 = varSubset(x, index, 3))
It's calling the varSubset with both x and index as vectors.
I can't figure out how to treat x as a vector (of only the group) and index as a single value. I've tried rowwise(), but then I effectively lose grouping.
Why not use rollapply from zoo?:
library(dplyr)
library(zoo)
df %>% group_by(a) %>%
mutate(var = rollapply(x, 3, var, fill = NA, align = "right"))

Resources