Calculate euclidean distance in a faster way - r

I want to calculate the euclidean distances between rows of a dataframe with 30.000 observations. A simple way to do this is the dist function (e.g., dist(data)). However, since my dataframe is large, this takes too much time.
Some of the rows contain missing values. I do not need the distances between rows, where both rows contain missing values, or between rows, where none of the rows contains missing values.
In a for-loop, I tried to exclude the combinations that I do not need. Unfortunately, my solution takes even more time:
# Some example data
data <- data.frame(
x1 = c(1, 22, NA, NA, 15, 7, 10, 8, NA, 5),
x2 = c(11, 2, 7, 15, 1, 17, 11, 18, 5, 5),
x3 = c(21, 5, 6, NA, 10, 22, 12, 2, 12, 3),
x4 = c(13, NA, NA, 20, 12, 5, 1, 8, 7, 14)
# Measure speed of dist() function
start_time_dist <- Sys.time()
# Calculate euclidean distance with dist() function for complete dataset
dist_results <- dist(data)
end_time_dist <- Sys.time()
time_taken_dist <- end_time_dist - start_time_dist
# Measure speed of my own loop
start_time_own <- Sys.time()
# Calculate euclidean distance with my own loop only for specific cases
# # #
# The following code should be faster!
# # #
data_cc <- data[complete.cases(data), ]
data_miss <- data[complete.cases(data) == FALSE, ]
distance_list <- list()
for(i in 1:nrow(data_miss)) {
distances <- numeric()
for(j in 1:nrow(data_cc)) {
distances <- c(distances, dist(rbind(data_miss[i, ], data_cc[j, ]), method = "euclidean"))
distance_list[[i]] <- distances
end_time_own <- Sys.time()
time_taken_own <- end_time_own - start_time_own
# Compare speed of both calculations
time_taken_dist # 0.002001047 secs
time_taken_own # 0.01562881 secs
Is there a faster way how I could calculate the euclidean distances that I need?

I recommend you to use parallel computations. Put all your code in one function and do it parallel.
R will do all calculation in one thread by default. You should add parallel threads manually. Starting clusters in R will take time, but if you have large data frame, the performance of the main job will be (your_processors_number-1) times faster.
This links may also help: How-to go parallel in R – basics + tips and A gentle introduction to parallel computing in R.
Good choice is to divide your job into smaller packes and calculate them separately in each thread. Create threads only once, because it is time consuming in R.
# I am not sure that all libraries are here
# try ??your function to determine which library do you need
# determine how many processors has your computer
no_cores <- detectCores() - 1# one processor must be free always for system<-Sys.time()
#start parallel calculations
cl<-makeCluster(no_cores,outfile = "mycalculation_debug.txt")
# results will be in out.df class(dataframe)
,.combine=rbind # data from different threads will be in one table
,.packages=c()# All packages that your funtion is using must be called here
,.inorder=T) %dopar% #don`t forget this directive
# enter your function here and do what you want in parallel
print(paste(date,'packet',p, percent((x-istart)/packes[p]),'done'))
},error = function(e) return(paste0("The variable '", p, "'",
" caused the error: '", e, "'")))
gc()# force to free memory from killed processes


filter out rolling mean results with limited data

I am trying to calculate the rolling mean of a time series. I have no problems with the calculation, however, looking at the results, there are locations along the time series where the rolling mean occurs based on one or two values surrounded by a long series of missing values. I would like the rolling average to only occur when greater than 50% of the data within the width of the time frame for with the rolling average is performed. If less than 50% of the data is available, then the result for that index should be NaN.
I wrote some example code to hopefully demonstrate what I am trying to accomplish.
#Create example data
ind=which(dat2 %in% sample(dat2,5))
#in this case ind=c(4, 7, 8, 13, 16)
ind2=which(dat3 %in% sample(dat3,12))
#in this case ind2=c(2, 5, 7, 8, 9, 10, 11, 12, 13, 14, 17, 18)
#create a time series
now <- Sys.time()
tseq <- seq(from = now, length.out = 20, by = "mins")
#data in zoo format
#rolling mean using roll apply
dat1rollmean=rollapply(dat1,width=5,align='center',FUN=function(x) mean(x,na.rm=T))
dat2rollmean=rollapply(dat2,width=5,align='center',FUN=function(x) mean(x,na.rm=T))
dat3rollmean=rollapply(dat3,width=5,align='center',FUN=function(x) mean(x,na.rm=T))
#doesn't work
#desired rolling mean result
In this example, dat1 is a complete dataset for which my rollapply (width of 5) function works well, dat2 and dat3 have different levels of missing data. I would want my result in this case to replace any points in which the rollapply is performed on less than 2 points of data with NaN. That would be index 4 for dat2rollmean and indexes 4-14 for dat3rollmean. How would I write a function to find these instances of insufficient data and replace the resulting rolling mean result with NaN?
Use Mean defined below:
Mean <- function(x) if (sum( < length(x) / 2) mean(x, na.rm = TRUE) else NaN
res1 <- rollapply(dat1, 5, Mean)
identical(res1, dat1rollmean)
## [1] TRUE
res2 <- rollapply(dat2, 5, Mean)
identical(res2, dat2des)
## [1] TRUE
res3 <- rollapply(dat3, 5, Mean)
identical(res3, dat3des)
## [1] TRUE

Shuffle under constraints

I have a question from a book on Monte Carlos Methods that I am working through and I can not figure it out. The question is as follows:
Obtain random shuffles of the cards: club
2, 3, 4, 5, 6; diamond 2, 3, 4, 5, 6; heart 2, 3, 4, 5, 6; and spade 2, 3, 4; in such
a way that no clubs or spades appear in positions 1, 4, 7, . . ., no hearts
appear in positions 2, 5, 8, . . ., and no diamonds or spades appear in positions
3, 6, 9, . . . .
My current best solution is constructing a matrix of possible cards to draw where each row is a turn and each column a card and to iterate down the rows. However I am having problems with the dimensions of the problem, where by some of the later draws I will run out of possible cards meeting the restraints of the question.
# 1-5 club, 6-10 diamond, 10-15 heart, 16-18 spade
#no spade, club
no_s_c <- matrix(1,nrow = 18, ncol = 18)
no_s_c [,1:5] <- 0
no_s_c[,16:18] <- 0
#no spade no diamond
no_d_s<- matrix(1,nrow = 18, ncol = 18)
no_d_s [,6:10] <- 0
no_d_s[,16:18] <- 0
#no hearts
no_h <- matrix(1,nrow = 18, ncol = 18)
no_h[,10:15] <- 0
turn_no_s_c <- c(1,4,7,10,13,16)
turn_no_d_s <- c(3,6,9,12,15,18)
turn_no_h <- c(2,5,8,11,14,17)
#psudotransition matrix
M <- zeros(18)
for(i in turn_no_s_c){M[i,] <- no_s_c[i,]}
for(i in turn_no_d_s){M[i,] <- no_d_s[i,]}
for(i in turn_no_h){M[i,] <- no_h[i,]}
random_w_contraint <- function(){ # there are problems with the dimension of
this problem
card_order <- rep(0,dim(M)[1])
for(i in 1:dim(M)[1]){
x <- sample(which(M[i,] !=0),1)
card_order[i] <- x
M[,x] <- 0
Thanks for your help!
I'd recommend a two-step approach: writing helper functions for drawing cards from a deck, and then calling these functions in an order that meets your constraints.
Heads-up as you read: I'm naming the cards differently than you do (I call the two-of-clubs "2C" instead of 1), but the general advice still stands.
Helper Functions for Card Decks
You can deal with card-based problems by creating a list or data.frame to represent the deck of cards you're working with.
make_deck <- function(){
list(club = paste0('C', 2:6),
diamond = paste0('D', 2:6),
heart = paste0('H', 2:6),
spade = paste0('S', 2:6))
Then, you can write functions to draw a random card from particular suits in a deck:
draw_from_suits <- function(deck, suits){
cards <- unlist(deck[suits], use.names = FALSE)
# If there are no cards in the requested suits, return NA
if (length(cards) == 0) { return(NA) }
# Otherwise, grab a random card
sample(cards, 1)
Once you know what card you've picked, you can remove it from the deck with another helper function:
get_suit <- function(card){
switch(substr(card, 1, 1),
C = 'club',
D = 'diamond',
H = 'heart',
S = 'spade')
remove_from_deck <- function(deck, card){
suit <- get_suit(card)
deck[[suit]] <- setdiff(deck[[suit]], card)
Now, if we want to sample a card from the hearts suite, we'd have this three-step process:
deck <- make_deck()
card <- draw_from_suits(deck, 'heart')
deck <- remove_from_deck(deck, card)
Sampling With Constraints
The second challenge in this problem that you identify is that you can run into dead ends partway through. You could write the sampling function so that it resets itself and starts from scratch every time it hits a dead end.
You can do this many ways. One is to use a while loop to keep trying until you succeed:
sample_with_constraint <- function(){
# The suits we're allowed to draw from at each step
suit_sequence <- list(c('heart', 'diamond'),
c('club', 'diamond', 'spade'),
c('heart', 'club'))
# We'll use this variable to track whether we're done dealing cards
dealt <- FALSE
while (dealt == FALSE) {
deck <- make_deck()
hand <- rep(NA, length(unlist(deck)))
# Step through the hand and build it card-by-card
for (ii in seq_along(hand)) {
# Use the modulo operator to identify the step of the sequence
which_suits <- suit_sequence[[(ii %% 3) + 1]]
card <- draw_from_suits(deck, which_suits)
# If we failed to draw a card, this is a dead end
# So break out of this for-loop
if ( { break }
hand[ii] <- card
deck <- remove_from_deck(deck, card)
# If there are no more cards in the deck, we've successfully dealt a hand
# In this case, flip 'dealt' to TRUE. Otherwise it stays FALSE and we try again.
dealt <- length(unlist(deck)) == 0
You could also adapt the for loop at the end of your random_w_contraint function to do something similar.

Simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3

I am trying to simulate 5000 samples of size 5 from a normal distribution with mean 5 and standard deviation 3. I want to then compute the mean of each sample and make a histogram of the sample means
My current code is not giving me an error but I don't think it's right:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
Any idea on how to tackle this? I am very very new to R!
You don't need a list in this case. It is a common mistake of new R users to use lists excessively.
observations <- matrix(rnorm(25000, mean=5, sd=3), 5000, 5)
means <- rowMeans(observations)
Now means is a vector of 5000 elements.
You can actually do this without for loops. replicate can be used to create the 5000 samples. Then use sapply to return the mean of each sample. Wrap the sapply call in hist() to get the histogram of means.
dat = replicate(5000, rnorm(5,5,3), simplify=FALSE)
hist(sapply(dat, mean))
Or, if you want to save the means:
sample.means = sapply(dat,mean)
I think your code is giving valid results. list(mode="vector",length=nrSamples) isn't doing what I think you intended (run it in the console and see what happens), but it works out because the first two list elements get overwritten in the loop.
Although there's no need to use loops here, just for illustration here are two modified versions of your code using loops:
# 1. Store random samples in a list
e <- vector("list", nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[[i]])
# 2. Store random samples in a matrix
e <- matrix(rep(NA, 5000*5), nrow=5)
for (i in 1:nrSamples) {
e[,i] <- rnorm(n = 5, mean = 5, sd = 3)
sample_means = rep(NA, nrSamples)
for (i in 1:nrSamples){
sample_means[i] <- mean(e[, i])
Your code is fine (see below), but I would suggest you try the following:
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
Here, for each element of the sequence 1, 2, 3, ... nrSamples that I supply as the first argument, lapply executes an function with the given element of the sequence as argument (i.e. x). The function that I have supplied does not depend on x, however, so it is just replicated 5000 times, and the output is stored in a list (this is what lapply does). It is an easy way to avoid loops in situations like these. Needless to say, you could also just run
yourmeans <- sapply(1:nrSamples, function(x) mean(rnorm(n=5, mean = 5, sd = 3)))
Apart from the means, the latter does not store your results though, which may not be what you want. Also note that I call sapply to return a vector, which you can then use to plot your histogram, using e.g. hist(yourmeans).
To show that your code is fine, consider the following:
nrSamples = 5000
e <- list(mode="vector",length=nrSamples)
for (i in 1:nrSamples) {
e[[i]] <- rnorm(n = 5, mean = 5, sd = 3)
sample_means <- matrix(NA, 5000,1)
for (i in 1:5000){
sample_means[i] <- mean(e[[i]])
yourlist <- lapply(1:nrSamples, function(x) rnorm(n=5, mean = 5, sd = 3 ))
yourmeans <- sapply(yourlist, mean)
all.equal(as.vector(sample_means), yourmeans)
[1] TRUE
Here, I set the seed to the random number generator to make sure that the random numbers are the same. As you see, your code works fine, though as others have pointed out, loops can easily be avoided.

Why is the actual number of generation not as specified for genetic algorithms in R

I am working with the genalg library for R, and try to save all the generations when I run a binary generic algorithm. It does not seems like there is a built-in method for that in the library, so my attempt was to save each chromosome, x, coming through the evaluation function.
To test this method I have tried to insert print(x) in the evaluation function to be able to see all the evaluated chromosomes. However, the number of printed chromosomes does not always match what I am suspecting.
I thought that the number of printed chromosomes would be equal to the number of iterations times the population size, but it does not seems to be try all the time.
The problem is that I want to know from which generation (or iteration) each chromosome belongs, which I can't tell if the number of chromosomes are different from iter times popSize.
What is the reason for this, and how can I "fix" it. Or is there another way of saving each chromosome and from which iteration it belongs?
Below is an example, where I thought that the evaluation function would print 2x5 chromosomes, but only prints 8.
dataset <- data.frame(
item = c("pocketknife", "beans", "potatoes", "unions", "sleeping bag", "rope", "compass"),
survivalpoints = c(10, 20, 15, 2, 30, 10, 30),
weight = c(1, 5, 10, 1, 7, 5, 1))
weightlimit <- 20
evalFunc <- function(x) {
current_solution_survivalpoints <- x %*% dataset$survivalpoints
current_solution_weight <- x %*% dataset$weight
if (current_solution_weight > weightlimit)
return(0) else return(-current_solution_survivalpoints
iter = 2
popSize = 5
GAmodel <- rbga.bin(size = 7, popSize = popSize, iters = iter, mutationChance = 0.1,elitism = T, evalFunc = evalFunc)
Looking at the function code, it seems like at each iteration (generation) a subset of chromosomes is chosen from the population (population = 5 chromosomes in your example) with a certain probability (0.1 in your case) and mutated. Evaluation function is called only for the mutated chromosomes at each generation (and of course for all the chromosomes in the first iteration to know their initial value).
Note that, this subset do not include elitists group, which in your example you have defined as 1 element big (you have erroneously passed elitism=TRUE and TRUE is implicitly converted to 1).
Anyway, to know the population at each generation, you can pass a monitor function through the monitorFun parameter e.g. :
# obj contains a lot of informations, try to print it
monitor <- function(obj) {
print(paste(" GENERATION :", obj$iter))
iter = 2
popSize = 5
GAmodel <- rbga.bin(size = 7, popSize = popSize,
iters = iter, mutationChance = 0.1,
elitism = 1, evalFunc = evalFunc, monitorFunc = monitor)

Using coefficient of variation in aggregate

I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using tapply() instead of aggregate() (see ?tapply for more info). However, a minimal working example would be very helpful.
co.var <- function(x) ( 100*sd(x)/mean(x) )
## Data with multiple repeated measurements.
## There are three things (ID 1, 2, 3) that
## are measured two times, twice each (val1 and val2)
## Calculate coefficient of variation for each measurement set
## Use tapply() instead of aggregate
## The mySel vector is then the vector of rows that correspond to the
## maximum coefficient of variation for each ID
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
# Create a big dataset
myDF <- data.frame(val1 = c(20, 10, 5, 25, 7, 2),
val2 = c(19, 9, 4, 24, 4, 1))
myDF <- myDF[sample(seq_len(nrow(myDF)), 40000, replace = TRUE), ]
myDF <- cbind(myDF, rep(myDF, 49))
myDF$ID <-, nrow(myDF), replace = TRUE)
# Define a new function to work (slightly) better with large datasets
co.var.df <- function(x) ( 100*apply(x,1,sd)/rowMeans(x) )
# Create two datasets to benchmark the two methods
# (A second method proved slower than the third, hence the naming)
myDF.firstMethod <- myDF
myDF.thirdMethod <- myDF
Time the original method
startTime <- Sys.time()
myDF.firstMethod$coVar <- apply(myDF.firstMethod[,
grep("val", names(myDF.firstMethod))], 1, co.var)
mySel <- tapply(seq_len(nrow(myDF.firstMethod)),
myDF.firstMethod$ID, function(x) {
curSub <- myDF.firstMethod[x, ]
return(x[which(curSub$coVar == max(curSub$coVar))])
}, simplify = FALSE)
endTime <- Sys.time()
R> endTime-startTime
Time difference of 17.87806 secs
Time second method
startTime3 <- Sys.time()
mySel3 <- tapply(seq_along(coVar3),
myDF[, "ID"], function(x) {
return(x[which(coVar3[x] == max(coVar3[x]))])
}, simplify = FALSE)
endTime3 <- Sys.time()
R> endTime3-startTime3
Time difference of 2.024207 secs
And check to see that we get the same results:
R> all.equal(mySel,mySel3)
[1] TRUE
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must unlist the mySel or mySel3 objects:
