I have a data frame df with four columns. I would like to find the number of unequal number for each pair of rows.
I have tried to do it using for loop and it works out perfectly. However, it take a very long time to run. Please see below my code:
dist_mat <- matrix(0, nrow(df), nrow(df))
for(i in 1:nrow(df))
{
for(j in 1:nrow(df))
{
dist_mat[i,j] <- sum(df[,1:4][i,]!=df[,1:4][j,])
}
}
I thought there would be other way of doing this fast. Any suggestion is appreciated.
P.S. The data is numeric.
Given that the matrix is symmetric, and the diagonal will be zero, you don't need to loop twice over each row so you can cut the looping down by over half:
for(i in 1:(nrow(df)-1))
{
for(j in (i+1):nrow(df))
{
dist_mat[i,j] <- sum(df[i,1:4]!=df[j,1:4])
}
}
dist_mat[lower.tri(dist_mat)] <- dist_mat[upper.tri(dist.mat)]
This is a job for combn:
DF <- data.frame(x=rep(1,6), y=rep(1:2,3))
combn(seq_len(nrow(DF)), 2, FUN=function(ind, df) {
c(ind[1], ind[2], sum(df[ind[1],]!=df[ind[2],]))
}, df=as.matrix(DF))
Note that I convert the data.frame into a matrix, since matrix subsetting is faster than data.frame subsetting. Depending on your data types this could become a problem.
If your distance measure wasn't so unusual, dist would be helpful (and fast).
Related
I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!
Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist
You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)
What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.
I need to do iteratively evaluate the variance of a dataset while i reduce the data.frame row by row in each step. As an example
data <- matrix(runif(100),10,10)
perc <- list("vector")
sums <- sum(data)
for (i in 1:nrow(data)) {
data <- data[-1,]
perc[[i]] <- sum(data)/sums # in reality, here are ~8 additonal lines of code
}
I dont like that data is re-initialized in every step, and that the loop breaks with an error, when data is emptied.
So the questions are:
1. How to express data <- data[-1,] in an incrementing way (something like tmp <- data[-c(1:i),], which doesnt work?
2. Is there a way to stop the loop, before the last row is removed from data?
You could try
set.seed(123)
data <- matrix(runif(100),10,10)
sums <- sum(data)
perc <- lapply(2:nrow(data),function(x) sum(data[x:nrow(data),]/sums))
The above code yields the same result as your original code, but without error message and without modifying data.
perc1 <- list()
for (i in 1:nrow(data)) {
data <- data[-1,]
perc1[[i]] <- sum(data)/sums
}
identical(perc,perc1)
#[1] TRUE
If you wish to preserve the for loop in order to perform other calculations within the loop, you could try:
for (i in 2:nrow(data)) {
perc[[i-1]] <- sum(data[i:nrow(data),])/sums
# do more stuff here
}
identical(perc,perc1)
#[1] TRUE
If you are using the loop index i for other calculations within the loop, you will most probably need to replace it with i-1. It may depend on what is calculated.
You can use lapply
res <- lapply(2:nrow(data), function(i)sum(data[i:nrow(data),])/sums)
You can write the loop part like this:
for (i in 2:nrow(data)) {
perc[[i - 1]] <- sum(data[i:nrow(data),])/sums # in reality, here are ~8 additonal lines of code
}
I have found away to do this using reshape2 but it is quite slow and doesn't quite give me exactly what I want. I have a data.frame that looks like this:
df<-data.frame(expand.grid(1:10,1:10))
colnames(df) <- c("x","y")
for(i in 3:10){
df[i] <- runif(100,10,100)
}
I run:
require(reshape2)
matrices<-lapply(colnames(df)[-c(1:2)],function(x){
mat<-acast(df, y~x, value.var=x, fill= 0,fun.aggregate = mean)
return(mat)
})
there I have a list of matrices for each value vector in my data, I can transform this into an array of 1:10,1:10,1:10 dimension, but I am looking to see if there is a faster way to do this as my datasets can contain many value columns and this process can take a long time and I can't seem to find a more efficient way of doing it..
Thanks for any help.
If your data.frame is stored regularly as you say, you could accomplish this in a for loop, which may actually be faster than casting:
# preallocate array
myArray <- array(0, dim=c(10,10,10))
# loop through:
for(i in 1:10) {
myArray[,,i] <- as.matrix(df[df$y==i,])
}
I'm sorry for repeating a question about the *apply functions, but I cannot get my code to work with the material that I found so far. I have a matrix (stored in a large data frame) and I want to shift the rows of this matrix by a certain amount (to the left). The amount by which I want to shift is different for each row and is stored in another column of the same data frame. The following code should illustrate what I am aiming for
mat <- matrix(rnorm(15),ncol=5,nrow=3);
sv <- c(1,4,2);
mat;
shift <- function(x,shift){c(x[(1+max(0,shift)):length(x)],rep(0,max(0,shift)))}
for(i in 1:nrow(mat)){mat[i,] <- shift(mat[i,],sv[i])}
mat;
But this runs incredibly slow on my 300000x201 matrix, so how could I vectorize this (using some of *apply commands)?
Working on larger chunks will speedup things
n.col <- ncol(mat)
for(i in unique(sv)){
selection <- which(sv == i)
mat[selection, 1:(n.col - i + 1)] <- mat[selection, i:n.col]
mat[selection, (n.col - i + 1):n.col] <- 0
}
I'm trying to create a data.frame that takes different values depending on the value of a reference data.frame. I only know how to do this with a "for loop", but have been advised to avoid for loops in R... and my actual data have ~500,000 rows x ~200 columns.
a <- as.data.frame(matrix(rbinom(10,1,0.5),5,2,dimnames=list(c(1:5),c("a","b"))))
b <- data.frame(v1=c(2,10,12,5,11,3,4,14,2,13),v2=c("a","b","b","a","b","a","a","b","a","b"))
c <- as.data.frame(matrix(0,5,2))
for (i in 1:5){
for(j in 1:2){
if(a[i,j]==1){
c[i,j] <- mean(b$v1[b$v2==colnames(a)[j]])
} else {
c[i,j]= mean(b$v1)
}}}
c
I create data.frame "c" based on the value in each cell, and the corresponding column name, of data.frame "a".
Is there another way to do this? Indexing? Using data.table? Maybe apply functions?
Any and all help is greatly appreciated!
(a == 0) * mean(b$v1) + t(t(a) * c(tapply(b$v1, b$v2, mean)))
Run in pieces to understand what's happening. Also, note that this assumes ordered names in a (and 0's and 1's as entries in it, as per OP).
An alternative to a bunch of t's as above is using mapply (this assumes a is a data.frame or data.table and not a matrix, while the above doesn't care):
(a == 0) * mean(b$v1) + mapply(`*`, a, tapply(b$v1, b$v2, mean))
#subsetting a matrix is faster
res <- as.matrix(a)
#calculate fill-in values outside the loop
in1 <- mean(b$v1)
in2 <- sapply(colnames(a),function(i) mean(b$v1[b$v2==i]))
#loop over columns and use a vectorized approach
for (i in seq_len(ncol(res))) {
res[,i] <- ifelse(res[,i]==0, in1, in2[i])
}