I'm sorry for repeating a question about the *apply functions, but I cannot get my code to work with the material that I found so far. I have a matrix (stored in a large data frame) and I want to shift the rows of this matrix by a certain amount (to the left). The amount by which I want to shift is different for each row and is stored in another column of the same data frame. The following code should illustrate what I am aiming for
mat <- matrix(rnorm(15),ncol=5,nrow=3);
sv <- c(1,4,2);
mat;
shift <- function(x,shift){c(x[(1+max(0,shift)):length(x)],rep(0,max(0,shift)))}
for(i in 1:nrow(mat)){mat[i,] <- shift(mat[i,],sv[i])}
mat;
But this runs incredibly slow on my 300000x201 matrix, so how could I vectorize this (using some of *apply commands)?
Working on larger chunks will speedup things
n.col <- ncol(mat)
for(i in unique(sv)){
selection <- which(sv == i)
mat[selection, 1:(n.col - i + 1)] <- mat[selection, i:n.col]
mat[selection, (n.col - i + 1):n.col] <- 0
}
Related
I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!
Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist
You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)
What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.
I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)
I have a data frame df with four columns. I would like to find the number of unequal number for each pair of rows.
I have tried to do it using for loop and it works out perfectly. However, it take a very long time to run. Please see below my code:
dist_mat <- matrix(0, nrow(df), nrow(df))
for(i in 1:nrow(df))
{
for(j in 1:nrow(df))
{
dist_mat[i,j] <- sum(df[,1:4][i,]!=df[,1:4][j,])
}
}
I thought there would be other way of doing this fast. Any suggestion is appreciated.
P.S. The data is numeric.
Given that the matrix is symmetric, and the diagonal will be zero, you don't need to loop twice over each row so you can cut the looping down by over half:
for(i in 1:(nrow(df)-1))
{
for(j in (i+1):nrow(df))
{
dist_mat[i,j] <- sum(df[i,1:4]!=df[j,1:4])
}
}
dist_mat[lower.tri(dist_mat)] <- dist_mat[upper.tri(dist.mat)]
This is a job for combn:
DF <- data.frame(x=rep(1,6), y=rep(1:2,3))
combn(seq_len(nrow(DF)), 2, FUN=function(ind, df) {
c(ind[1], ind[2], sum(df[ind[1],]!=df[ind[2],]))
}, df=as.matrix(DF))
Note that I convert the data.frame into a matrix, since matrix subsetting is faster than data.frame subsetting. Depending on your data types this could become a problem.
If your distance measure wasn't so unusual, dist would be helpful (and fast).
I have a big data frame with almost 1M rows (transactions) and 2600 columns (items). The values in the data set are 1's and NA's. Data type of all the values are factor. I want to add a new column to the end of the data frame which shows sum of all 1's in each row.
Here is the R code that I wrote:
for(i in 1:nrow(dataset){
counter<-0
for(j in 1:ncol(dataset){
if(!is.na(dataset[i,j])){
counter<- counter+1
}
}
dataset[i,ncol(dataset)+1]<-counter
}
But it has been a very long time that it is running in R studio because the running time is O(n^2). I am wondering if there is any other way to do that or a way to improve this algorithm? (Machine has 80Gb of memory)
Using a matrix (of numbers, not factors), as #joran suggested, would be better for this, and simply do:
rowSums(your_matrix, na.rm = T)
As eddi answer is the best in your case more general solution is to vectorize code (means: operate on all rows at once):
counter <- rep(0, nrow(dataset))
for(j in 1:ncol(dataset)) {
counter <- counter + !is.na(dataset[[j]])
}
dataset$no_of_1s <- counter
One note: in your code in line:
dataset[i,ncol(dataset)+1]<-counter
you create new column for each row (cause for each step there is one more column), so final data.frame would have 1M rows and 1M colums (so it won't fit your memory).
Another option is to use Reduce
dataset$no_of_1s <- Reduce(function(a,b) a+!is.na(b), dataset, init=integer(nrow(dataset)))
I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))