Calculating Hamming distance for two vectors in R? - r

I'm trying just to calculate the Hamming distance between two vectors in R. I'm currently attempting to use the "e1071" package, and the hamming.distance function, as follows:
library(e1071)
H <- hamming.distance(X)
Where X is a data.frame with 2 rows and (in my particular data) 667 columns, and every observation is 0 or 1.
Initially I got the error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
After some research, it appeared that one fix might be increasing the basic option in R. This I did via options(expressions=5000), and then tried varying values in place of the 5000. But this only produced the error:
Error: C stack usage is too close to the limit
I'm not much of a programmer, and the fixes for this most recent error appear to have to do with something inside the package e1071 possibly not being called correctly (or at the right time).
Any ideas on what I'm doing wrong? I eventually want the Hamming distances between a large number of vectors, and this was just a starting point. If this has to do with memory allocation, any suggestions for how to deal with it?

I don't know how hamming.distance works internally, but a simple way to calculate the distance for 2 vectors is just
sum(x1 != x2)
or, in this case,
sum(X[1,] != X[2,])
If the total number of vectors is not too large (up to, say, a few thousand), you could implement this in a nested loop:
n <- nrow(X)
m <- matrix(nrow=n, ncol=n)
for(i in seq_len(n - 1))
for(j in seq(i, n))
m[j, i] <- m[i, j] <- sum(X[i,] != X[j,])
Caveat: untested.

WARNING ABOUT USING HAMMING.DISTANCE FROM PACKAGE e1071!
This package's implementation forces the objects being compared to booleans with as.logical. This means that values of 0 will be FALSE and any non-zero values will be TRUE. This means that for the sequence: 0 1 2 compared to 0 1 1 the hamming distance will be reported as 0 instead of the correct value of 1 -- this package will treat 1 and 2 as equal since as.logical(1) == as.logical(2).
Here is the faulty (in my view) implementation:
> library("e1071", lib.loc="C:/Program Files/R/R-2.15.3/library")
Loading required package: class
> hamming.distance
function (x, y)
{
z <- NULL
if (is.vector(x) && is.vector(y)) {
z <- sum(as.logical(x) != as.logical(y))
}
else {
z <- matrix(0, nrow = nrow(x), ncol = nrow(x))
for (k in 1:(nrow(x) - 1)) {
for (l in (k + 1):nrow(x)) {
z[k, l] <- hamming.distance(x[k, ], x[l, ])
z[l, k] <- z[k, l]
}
}
dimnames(z) <- list(dimnames(x)[[1]], dimnames(x)[[1]])
}
z
}
<environment: namespace:e1071>
My recommendation: DO NOT USE. Hamming distance is trivial to implement as noted several times above.

hamming.distance takes two vectors or a matrix, but not a data frame, so what you want is probably either
m = as.matrix(X)
hamming.distance(m[1,], m[2,])
or
hamming.distance(as.matrix(X))
but as was pointed out this is in your particular case the same as
sum(m[1,] != m[2,])
(In general, avoid data.frames if what you have is not a heterogenous structure since they are much, much slower than matrices)

As an addition to all that was mentioned above: Although the Hamming distance is trivial to implement as an ordinary nested loop, in terms of execution time things can quickly get out of hand for larger matrices. In R, it is far more efficient to instead use matrix multiplication for computing the Hamming distance between all columns of large matrices. This is extremely fast compared to an R-level nested loop. An example implementation can be found here.

sum(xor(x[1,],x[2,]))
I don't know the relative efficiency of 'xor' to '!='

Just adding to #HongOoi I want to point that in R != and == return NA when one of the values is missing, so it could give misleading results
> c(1, NA) == 1:2
[1] TRUE NA
however %in% outputs FALSE for 1 %in% NA comparison. Because of that if when comparing vectors you want to count missing values as "different", then you have to use sum(!((x != y) %in% FALSE)) code:
> x <- c(1, 8, 5, NA, 5)
> y <- 1:5
> sum(!((x != y) %in% FALSE))
[1] 3
Notice also that it could happen that x and y vectors have different length, what would lead to missing values in the shorter vector - you can do two things: truncate the longer vector or claim that values absent in the shorter vector are "different". This could be translated into standalone function with familiar R parameters:
hamming <- function(x, y, na.rm = TRUE) {
size <- 1:max(length(x) & length(y))
x <- x[size]
y <- y[size]
if (na.rm) {
del <- is.na(x) & is.na(y)
x <- x[del]
y <- y[del]
}
sum(!((x != y) %in% FALSE))
}
This function enables you to choose if you want to count missing values as "different" (na.rm = FALSE) or ignore them. With na.rm = TRUE if vectors differ in their length, the longer one gets truncated.

Related

Find the first minimum value in R?

How can I find the first minimum value in a vector? E.g., in y I'd like to return 3 because it is the first time the values on either side of y[3] are greater than y[3]. I wrote a function to do this but wondering if there is an easier way? I also need to account for the case when the first value is less than the second value. E.g., in z.
y <- c(2448,2442,2438,2440,2438,2444,2431,2433,2434)
plot(y)
getFirstMin <- function(x){
if(x[1] < x[2]) res <- 1
else res <- min(which(diff(x) > 0))
return(res)
}
getFirstMin(y)
z <- c(2408,2442,2438,2440,2438,2444,2431,2433,2434)
plot(z)
getFirstMin(z)
We can use first, which, lag() and lead()
getFirstMin<-function(x) {dplyr::first(which(lag(x, default = Inf) > x & lead(x, default = Inf) > x ) }
Easier than #onyambu, a local minimum simply requires the first order difference is positive at some point, so with a nicely behaved sequence:
y <- c(2448,2442,2438,2440,2438,2444,2431,2433,2434)
which(diff(y)>0)[1]
gives 3. We could of course, build on this to handle some complicated cases, but your particular example demands nothing else.

R - Number of cases in 4-dimensional pairwise array

I'm making a "pairwise" array in R. Given the vector combo, I'm finding every permutation of 4 elements. Thus, a 4-dimensional "pairwise" array. My current approach is making it as a simple list, using nested sapply functions, like so:
fourList <- sapply(X = combo, FUN = function(h) {
hi <- which(combo == h) #get index of h
sapply(X = combo[hi:n], FUN = function(i) {
ii <- which(combo == i) #get index of i
sapply(X = combo[ii:n], FUN = function(j) {
ji <- which(combo == j) #get index of j
sapply(X = combo[ji:n], FUN = function(k) {
list(c(h,i,j,k))
})
})
})
})
I'd like to make some sort of progress indicator, so I can report to the user what percentage of the array has been built. Ideally, I'd just take numberCasesCompleted and divide that by totalCases = length(combo)^4 to get the fraction that is done. However, I can't seem to figure out an algorithm that takes in hi, ji, and ii, and outputs the value numberCasesCompleted. How can I calculate this?
In the 2D (x by y) case (e.g: sapply(X, function(x) {sapply(X[xi:n], function(y) {list(c(x,y))}}), this could be calculated by sum(n - (x-2:x), y-(x-1)), but generalizing that to 4 dimensions sounds rather difficult.
I'm stupid. Just add the proportion complete of the first level to the proportion complete of the second level (scaled down to a single iteration at the first level), and so forth.
In my case: completion <- hi/(n+1) + (ii/(n+1))*(1/n) + (ji/n)*(1/n)*(1/n)
(The n+1 denominators are there because there's effectively another loop after hi is equal to n, as ii still has a full set of iterations to complete. Otherwise it would end at ~101%. But for a rough/quick estimation of progress, this is fine.)
However, it is worth noting that (according to #Gregor in the comments) there are much better ways of making combinations in R, so my original use case may be moot (just don't use nested sapply in the first place).

add exact proportion of random missing values to data.frame

I would like to add random NA to a data.frame in R. So far I've looked into these questions:
R: Randomly insert NAs into dataframe proportionaly
How do I add random NAs into a data frame
add random missing values to a complete data frame (in R)
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
Add really random NA, and not the same amount by row or by column
Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
Is computationnaly efficient for big datasets.
Add the proportion/number of NA independently of already present NA in the input.
Anyone has an idea?
I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4.
Thanks.
[note] the exact proportion, rounded at +/- 1NA of course.
This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:
I modify his createNAs function so that it enables one of 3 options:
option complement: complement with NA up to the desired %
option add : add % of NA in addition to those already present
option none : add a % of NA regardless of those already present
For option 1 and 2, the function will work recursively until reached the desired proportion of NA:
createNAs <- function (x, pctNA = 0.0, option = "add"){
prop.NA = function(x) sum(is.na(x))/prod(dim(x))
initial.pctNA = prop.NA(x)
if ( (option =="complement") & (initial.pctNA > pctNA) ){
message("The data already had more NA than the target percentage. Returning original data")
return(x)
}
if ( (option == "none") || (initial.pctNA == 0) ){
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
} else { # if another option than none:
target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
while (prop.NA(x) < target) {
prop.remaining.to.add = target - prop.NA(x)
x = createNAs(x, prop.remaining.to.add, option = "none")
}
return(x)
}
}

Easiest way in R to get vector of frequencies of elements in vector

Suppose I have a vector of values v. What is the easiest way to get a vector f of length equal to v, where the ith element of f is the frequency of the ith element of v in v?
The only way I know to do it seems unnecessarily complicated:
v = sample(1:10,100,replace=TRUE)
D = data.frame( idx=1:length(v), v=v )
E = merge( D, data.frame(table(v)) )
E = E[ with(E,order(idx)), ]
f = E$Freq
Surely there's a simpler way to do this, along the lines of "frequencies(v)"?
For a vector of small positive integers v, as in the question, the expression
tabulate(v)[v]
is particularly simple as well as speedy.
For more general numerical vectors v you can persuade ecdf to help you out, as in
w <- sapply(v, ecdf(v)) * length(v)
tabulate(w)[w]
It's probably better to do the coding of the underlying algorithm yourself, though--and it certainly avoids the floating point rounding error implicit in the preceding solution:
frequencies <- function(x) {
i <- order(x)
v <- x[i]
w <- cumsum(c(TRUE, v[-1] != v[-length(x)]))
f <- tabulate(w)[w]
return(f[order(i)])
}
This algorithm sorts the data, assigns sequential identifiers 1, 2, 3, ... to the values as it encounters them (by summing a binary indicator of when the values change), uses the preceding tabulate()[] trick to obtain the frequencies efficiently, and then unsorts the results to make the output match the input, component by component.
I think the best solution here is:
ave(v,v,FUN=length)
It is simply ave()'s design to replicate and map the return value of FUN() back to every index of the input vector whose element was part of the group for which that particular invocation of FUN() was performed.
Something like this works for me:
sapply(v, function(elmt, vec) sum(vec == elmt), vec=v)
i would suggest you use table and as.vector:
as.vector(table(dataInVector))

Logical comparison of two vectors with binary (0/1) result

For an assignment I had to create a random vector theta, a vector p containing for each element of theta the associated probability, and another random vector u. No problems thus far, but I'm stuck with the next instruction which I report below:
Generate a vector r1 that has a 1 in position i if pi ≥ ui and 0 if pi < ui. The
vector r1 is a Rasch item given the latent variable theta.
theta=rnorm(1000,0,1)
p=(exp(theta-1))/(1+exp(theta-1))
u=runif(1000,0,1)
I tried the following code, but it doesn't work.
r1<-for(i in 1:1000){
if(p[i]<u[i]){
return("0")
} else {
return("1")}
}
You can use the ifelse function:
r1 <- ifelse(p >= u, 1, 0)
Or you can simply convert the logical comparison into a numeric vector, which turns TRUE into 1 and FALSE into 0:
r1 <- as.numeric(p >= u)
#DavidRobinson gave a nice working solution, but let's look at why your attempt didn't work:
r1<-for(i in 1:1000){
if(p[i]<u[i]){
return("0")
} else {
return("1")}
}
We've got a few problems, biggest of which is that you're confusing for loops with general functions, both by assigning and using return(). return() is used when you are writing your own function, with function() <- .... Inside a for loop it isn't needed. A for loop just runs the code inside it a certain number of times, it can't return something like a function.
You do need a way to store your results. This is best done by pre-allocating a results vector, and then filling it inside the for loop.
r1 <- rep(NA, length(p)) # create a vector as long as p
for (i in 1:1000) {
if (p[i] < u[i]) { # compare the ith element of p and u
r1[i] <- 0 # put the answer in the ith element of r1
} else {
r1[i] <- 1
}
}
We could simplify this a bit. Rather than bothering with the if and the else, you could start r1 as all 0's, and then only change it to a 1 if p[i] >= u[i]. Just to be safe I think it's better to make the for statement something like for (i in 1:length(p)), or best yet for (i in seq_along(p)), but the beauty of R is how few for loops are necessary, and #DavidRobinson's vectorized suggestions are far cleaner.

Resources