Function to calculate Euclidean distance in R - r

I am trying to implement KNN classifier in R from scratch on iris data set and as a part of this i have written a function to calculate the Euclidean distance. Here is my code.
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
distance <- 0
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1][i])^2 + (k[,2][i] - unk[,2][i])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
However, when i call the function it's returning the first value correctly and rest as NA.
Could anyone show where i could have gone wrong with the code?
Thanks in advance.

The aim is to calculate the distance between the ith row of known_data, and the single unknown_data point.
How to fix your code
When you calculate distance[i], you're trying to access the ith row of the unknown data point, which doesn't exits, and is hence NA. I believe your code should run fine if you make the following edits:
known_data <- iris[1:15,c("Sepal.Length", "Petal.Length", "Class")]
unknown_data <- iris[16,c("Sepal.Length", "Petal.Length")]
# euclidean distance
euclidean_dist <- function(k,unk) {
# Make distance a vector [although not technically required]
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
# Change unk[,1][i] to unk[1,1] and similarly for unk[,2][i]
distance[i] <- sqrt((k[,1][i] - unk[1,1])^2 + (k[,2][i] - unk[1,2])^2)
return(distance)
}
euclidean_dist(known_data, unknown_data)
One final note - in the version of R I'm using, the known dataset uses a Species as opposed to Class column
An alternative method
As suggested by #Roman Luštrik, the entire aim of getting the Euclidean distances can be achieved with a simple one-liner:
sqrt((known_data[, 1] - unknown_data[, 1])^2 + (known_data[, 2] - unknown_data[, 2])^2)
This is very similar to the function you wrote, but does it in vectorised form, rather than through a loop, which is often a preferable way of doing things in R.

The best and fastst way is using h2o package:
#load library
library(h2o)
#initialize the node
h2o.init()
#transform the df to h2o type
known_data<-as.h2o(known_data)
unknown_data<-as.h2o(unknown_data)
#create a matrix in which the distances are going to be record
matrix1<-h2o.createFrame(rows=nrow(known_data),cols=unknown_data)
#do a loop to calculate the distance between all the rows of both df
for(i in 1:nrow(unknown_data)){
matrix[,i]<-as.data.frame(h2o.distance(known_data, unknown_data[i,],"l2"))
}

Related

Optimize lapply for distance matrix function R

I am trying to find the cluster number from HDBSCAN analysis of cell coordinates grouped by an image ID in a dataframe.
My approach so far is to split the dataframe containing the ID, X, and Y columns by the ID and then use lapply to run a function on each element as such:
dlist <- split(d[, -c(1)], d$ID) #subgroup dataframe "d" as list and remove the ID column
cls <- function(x) {
dm <- dist(x, method = "euclidean", p = 2) %>% as.matrix() #run distance matrix for each imageID's X,Y coordinates
cl <- hdbscan(dm, minPts = 3) #run unsupervised cluster analysis on matrix
lv <- length(cl$cluster_scores)
return(lv) #return the cluster number for each image ID
}
ClusterNumbers <- lapply(dlist, FUN = cls) %>% bind_rows()
I know the cluster analysis methodology may not be the most robust but it is just a proof of concept at present. My issue currently is that this method is obviously painfully slow, so I am looking for a way (short of submitting this to the uni HPCC) to make this process more efficient and run quicker. I have tried generating the matrices prior to the cluster analysis etc but the number of data prohibits this as I cannot assign vectors that large.
Any help would be awesome.

Calculate Errors using loop function in R

I have two data matrices both having the same dimensions. I want to extract the same series of columns vectors. Then take both series as vectors, then calculate different errors for example mean absolute error (mae), mean percentage error (mape) and root means square error
(rmse). My data matrix is quite large dimensional so I try to explain with an example and calculate these errors manually as:
mat1<- matrix(6:75,ncol=10,byrow=T)
mat2<- matrix(30:99,ncol=10,byrow=T)
mat1_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat1_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mat2_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat2_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mae1<-mean(abs(mat1_seri1-mat2_seri1))
mae2<-mean(abs(mat1_seri2-mat2_seri2))
For mape
mape1<- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2<- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
similarly, I calculate rmse from their formula, as I have large data matrices so manually it is quite time-consuming. Is it's possible to do this using looping which gives an output of the errors (mae,mape,rmse) term for each series separately.
I'm not sure if this is what you are looking for, but here is a function that could automate the process, maybe there is also a better way:
fn <- function(m1, m2) {
stopifnot(dim(m1) == dim(m2))
mat1_seri1 <- as.vector(m1[, (1:ncol(m1))[(1:ncol(m1))%%2 != 0]])
mat1_seri2 <- as.vector(m1[, (1:ncol(m1))[!(1:ncol(m1))%%2]])
mat2_seri1 <- as.vector(m2[, (1:ncol(m2))[(1:ncol(m2))%%2 != 0]])
mat2_seri2 <- as.vector(m2[, (1:ncol(m2))[!(1:ncol(m2))%%2]])
mae1 <- mean(abs(mat1_seri1-mat2_seri1))
mae2 <- mean(abs(mat1_seri2-mat2_seri2))
mape1 <- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2 <- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
setNames(as.data.frame(matrix(c(mae1, mae2, mape1, mape2), ncol = 4)),
c("mae1", "mae2", "mape1", "mape2"))
}
fn(mat1, mat2)
mae1 mae2 mape1 mape2
1 24 24 92.62581 86.89572

Monte Carlo Simulation for DCF Model in R

I am trying to create a function where Monte Carlo Simulation is applied to two of the variables in a DCF Model in R Studio. It supposed to take a first value FCF_0 and applied to it a specific growth FCF_ 0*(1 + growth), which is the first input variable until period 6, each period takes the last FCF to keep growing. After that I would like to discount it as well to get the present value which would be FCFn*(1/((1+WACC)^n)). Where WACC is the second variable to simulate.
So far I have the function to calculate the FCF but with a vector of specifics values of growth, which is the following:
What I am trying so far to create this function is this, but I think is bad.
Could you please help me to understand how to create both simulations and if it is neccesary for me to create two functions or in one function I can do everything? I would expect from the function to give the sum of all present values and each sum would be an element in a vector of 10.000 simulations. I am new at this and even though I have read almost for two weeks, I don't get how to create these simulations.
Thank you very much!
revfunc <- function(hist, growth){
rval <- c()
help <- c(hist)
for(i in growth){
help <- help*(1+i)
rval <- c(rval, help)
}
return(rval)
}
Monte Carlo Simulations
pvffcf_function <- function(fcf0, growth, wacc){
rval1 <- c()
help <- c(fcf0)
pvs <- rval1*(1/((1+wacc)^n))
random_growth <- rnorm(n=10000, mean(fcfgrowth), sd(fcfgrowth))
wacc <- rnorm(n=10000, 0.03804, 0.007711)
pvffcf <- sum(freecashflows)
for(i in growth){
help <- help*(1+i)
rval1 <- c(rval1, help)
}
return(freecashflows)
}

R - Different approach to speed up 3 dimension array/matrix creation

My question is one of approach. Using SO I iterated through methods to create a 3 dimension array in R (this is my first question; R is a constraint). The use case is that this final array needs to be updated often but the two input arrays are updated at different periods. The goal is to minimize the final array creation time, but also intermediary steps if possible.
I know I can reach out with Rcpp, and I assign more than I need to for readability, but what I am wondering is:
Is there a better approach to completing this operation?
if (!require("geosphere")) install.packages("geosphere")
#simulate real data
dimLength <- 418
latLong <- cbind(rep(40,418),rep(2,418))
potentialChurn <- as.matrix(rep(500,418))
#create 2D matrix
valueMat <- matrix(0,dimLength,dimLength)
value <- potentialChurn
valueTranspose <- t(value)
for (s in 1:dimLength){valueMat[s,] <- value + valueTranspose[s]}
diag(valueMat) <- 0
#create 3D matrix from copying 2D matrix
bigValMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (d in 1:dimLength){bigValMat[,d,] <- valueMat}
#get crow fly distance between locations, create 2D matrix
distMat <- as.matrix(outer(seq(dimLength), seq(dimLength), Vectorize(function(i, j) distCosine(latLong[i,], latLong [j,]))))
###create 3D matrix by calculating distance between any two locations;
# create 2D matrix from each column in original 2D matrix
# add this column-replicated 2D matrix to the original
bigDistMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (p in 1:dimLength){
addCol <- distMat[,p]
addMatrix <- as.matrix(addCol)
for (y in 2:dimLength) {addMatrix <- cbind(addMatrix,addCol)}
bigDistMat[,p,] <- data.matrix(distMat) + data.matrix(addMatrix)}
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
...as context this is part of a two step ahead forecast policy developed for a class using Barcelona Bikesharing (Bicing) data. The project is over and I am interested how I could have done better.
In general if you want to speed up your code you want to identify bottle necks and fix them like explained here. Putting all your code before hand in a function would
Be a good idea.
In your specific case, you use much too much for loops for an R code. You need to vectorize your code much more.
Edit
Now for the long answer:
#simulate real data, you want them to be random
dimLength <- 418
latLong <- cbind(rnorm(dimLength,40,0.5),rnorm(dimLength,2,0.5))
potentialChurn <- as.matrix(rnorm(dimLength,500,10))
#create 2D matrix, outer is designed for this operation
valueMat <- outer(value,t(value),FUN="+")[,1,1,]
diag(valueMat) <- 0
# create 3D matrix from copying 2D matrix, again, avoid for loop
bigValMat <- array(rep(valueMat,dimLength),dim=c(dimLength,dimLength,dimLength))
# and use aperm to permute the dimensions
bigValMat <- aperm(bigValMat2,c(1,3,2))
#get crow fly distance between locations, create 2D matrix
# other packages are available to compute that kind of distance matrix
# but let's stay in plain R
# wordy but so much faster (and easier to read)
longs1 <- rep(latLong[,1],dimLength)
lats1 <- rep(latLong[,2],dimLength)
latLong1 <- cbind(longs1,lats1)
longs2 <- rep(latLong[,1],each=dimLength)
lats2 <- rep(latLong[,2],each=dimLength)
latLong2 <- cbind(longs2,lats2)
distMat <- matrix(distCosine(latLong1,latLong2),ncol=dimLength)
###create 3D matrix by calculating distance between any two locations;
# same logic than for bigValMat
addMatrix <- array(rep(distMat,dimLength),dim=rep(dimLength,3))
distMat3D <- aperm(addMatrix,c(1,3,2))
bigDistMat <- addMatrix + distMat3D
#get crow fly distance between locations, create 2D matrix
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
Here it is 25x faster than your initial code (76s -> 3s). It could still be much improved but you got the idea: avoid for and cbind and co at all costs.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

Resources