Finding all nearest neighbour to all data points using sparklyr - r

I would like to use sparklyr find the nearest neighbour for each point in a dataset.
I've found sparklyr::ml_approx_nearest_neighbors() uses a key argument (a single feature vector) to find the nearest neighbour, so I guess I'd iterate over that for each point. Should I use this with lapply(), or is this inefficient?
Here's an example (I've modified from here) where I take the titanic dataset and attempt to find the nearest 2 neighbours from the same dataset using the first 700 data points. It returns the point itself, and the next closest as expected, but I suspect the entire pipeline reruns for each data point making this inefficient.
Is there a better way, please?
library(sparklyr)
library(titanic)
library(dplyr)
library(magrittr)
sc <- spark_connect(method = "databricks") # create a spark connection object
# clean dataset
df_titanic <- titanic::titanic_train %>%
dplyr::select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
dplyr::rename_all(tolower) %>% # make the col names lower case
dplyr::mutate(sex = ifelse(sex == 'male', 1, 0), id = 1:nrow(.)) %>% # turn sex to an integer
dplyr::filter_all(dplyr::all_vars(!is.na(.))) # remove NAs
sdf_titanic <- sparklyr::copy_to(sc, df_titanic, overwrite = T) # copy to spark
input_cols <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare') # features list
## append a vectorised list of the features we're interested in
sdf_titanic_va <- ft_vector_assembler(sdf_titanic,
input_cols = input_cols,
output_col = 'features')
brp_lsh <- sparklyr::ft_bucketed_random_projection_lsh(
sc,
input_col = 'features',
output_col = 'hash',
bucket_length = 2,
num_hash_tables = 3
)
brp_fit <- ml_fit(brp_lsh, sdf_titanic_va) ## fit the LSH to our data to get the hashes
id1_input <- sdf_titanic_va %>%
dplyr::filter(id %in% 1:700) %>%
dplyr::pull(features)
lapply(id1_input, function(x) ml_approx_nearest_neighbors(
brp_fit,
sdf_titanic_va,
key = x,
dist_col = 'dist_col',
num_nearest_neighbors = 2
))

Related

Find two points farthest apart using group_by() in Sf

Haven't been able to find an exact Q/A to match this problem, though there are several related. Im trying to calculate a distance matrix for all points groups defined by an ID column. Then select the two points that are farthest apart from each group, retaining the original group id. The number of points in each group varies from 2, 4 or 6.
My sf df:
df <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
I've tried code such as:
a <- df %>%
group_by(id) %>%
st_distance(.)
Though this just returns a distance matrix of all points.
The below gives me, what I want, though I fear it would be slow on large datasets:
maxMin <- do.call(rbind,lapply(unique(allInts$id), function(x) {
df <- allInts %>% filter(id == x)
d <- st_distance(df)
df %>% slice(unique(as.vector(which(d == max(d),arr.ind=T))))
}))
You can use dplyr::group_split to split your data frame into a list per group. You can then apply whatever function you want to that list using map/lapply.
Script below keeps the 2 points which are furthest apart in each group.
library(sf)
library(tidyverse)
# dummy data
data <- data.frame(x = runif(12), y = runif(12), id = rep(1:3,each = 4)) %>%
st_as_sf(coords = c("x","y"), crs = 27700)
# split it into a list per ID
data_group <- data %>%
group_by(id) %>%
group_split()
#apply a function to each list
distance_per_group <- map(data_group, function(x){
distance_matrix <- st_distance(x)
biggest_distance <- as.numeric(which(distance_matrix == max(distance_matrix), arr.ind = TRUE)[1,])
farthest_apart <- x[biggest_distance,]
})

Finding closest distance between points in 3 data sets using sp/rgeos?

I have 3 data sets: one of people, one of stores, and one of farmers markets, each with lat/long points.
PeopleEx <- data.frame(replicate(2,sample(0:10,10,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:10, type = "people")
StoresEx <- data.frame(replicate(2,sample(0:10,5,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1:5, type = "store")
MarketsEx <- data.frame(replicate(2,sample(0:10,1,rep=TRUE))) %>%
rename(lat = X1, long = X2) %>%
mutate(ID = 1, type = "market")
Each have a column for "Lat","Long","ID","type", where type can be "people", "market, or "store". I am trying to find the distance between each person and the closest store as well as the closest market. To do so I made a data frame with all 3 data sets.
DfExample <- rbind(PeopleEx, StoresEx, MarketsEx)
I've been using the sp and rgeos packages to calculate distance between points.
sp.DfExample <- DfExample
coordinates(sp.DfExample) <- ~long+lat
The first 10 entries are the people with the last 6 being stores/markets so I've used this code:
Distances <- gDistance(sp.DfExample, byid=T)
DistanceStore <- Distances[1:16, 10:16]
MaxDist <- DistanceStore
apply(MaxDist, 1, FUN=min)
This works in that it reports the minimum distance from each person to a store/market, however it drops the type. Any ideas on how to keep the type from dropping or different packages/approaches to use instead?

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Resources