I have a large-ish dataframe (40000 observations of 800 variables) and wish to operate on a range of columns of every observation with something akin to dot product. This is how I implemented it:
matrixattempt <- as.matrix(dframe)
takerow <- function(k) {as.vector(matrixattempt[k,])}
takedot0 <- function(k) {sqrt(sum(data0averrow * takerow(k)[2:785]))}
for (k in 1:40000){
print(k)
dframe$dot0aver[k]<-takedot0(k)
}
The print is just to keep track of what's going on. data0averrow is a numeric vector, same size as takerow(k)[2:785], that has been pre-defined.
This is running, and from a few tests running correctly, but it is very slow.
I searched for dot product for a subset of columns, and found this question, but could not figure out how to apply it to my setup. ddply sounds like it should work faster (although I do not want to do splitting and would have to use the same define-id trick that the referenced questioner did). Any insight/hints?
Try this:
sqrt(colSums(t(matrixattempt[, 2:785]) * data0averrow))
or equivalently:
sqrt(matrixattempt[, 2:785] %*% data0averrow)
Use matrix multiplication and rowSums on the result:
dframe$dot0aver <- NA
dframe$dot0aver[2:785] <- sqrt( rowSums(
matrixattempt[2:785,] %*% data0averrow ))
It's the sqrt of the dot-product of data0aver with each row in the range
Related
I understand for-loops are slow in R, and the suite of apply() functions are designed to be used instead (in many cases).
However, I can't figure out how to use those functions in my situation, and advice would be greatly appreciated.
I have a list/vector of values (let's say length=10,000) and at every point, starting at the 21st value, I need to take the standard deviation of the trailing 20 values. So at 21st, I take SD of 1st-21st . At 22nd value, I take SD(2:22) and so on.
So you see I have a rolling window where I need to take the SD() of the previous 20 indices. Is there any way to accomplish this faster, without a for-loop?
I found a solution to my question.
The zoo package has a function called "rollapply" which does exactly that: uses apply() on a rolling window basis.
library(microbenchmark)
library(ggplot2)
# dummy vector
c <- 50
x <- sample(1:100, c, replace=T)
# parameter
y <- 20 # length of each vector
z <- length(x) - y # final starting index
# benchmark
xx <-
microbenchmark(lapply = {a <- lapply( 1:z, \(i) sd(x[i:(i+y)]) )}
, loop = {
b <- vector("list", z)
for (i in 1:z)
{
b[[i]] <- sd(x[i:(i+y)])
}
}
, times = 30
)
# plot
autoplot(xx) +
ggtitle(paste('vector of size', c))
It would appear while lapply has the speed advantage of a smaller vector, a loop should be used with longer vectors.
I would maintain, however, loops are not slow per se as long as they are not applied incorrectly (iterating over rows).
I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))
I am very green in R, so there is probably a very easy solution to this:
I want to calculate the average correlation between the column vectors in a square matrix:
x<-matrix(rnorm(10000),ncol=100)
aux<-matrix(seq(1,10000))
loop<-sapply(aux,function(i,j) cov(x[,i],x[,j])
cor_x<-mean(loop)
When evaluating the sapply line I get the error 'subscript out of bounds'. I know I can do this via a script but is there any way to achieve this in one line of code?
No need for any loops. Just use mean(cov(x)), which does this very efficiently.
The problem is due to aux. The variable auxhas to range from 1 to 100 since you have 100 columns. But your aux is a sequence along the rows of x and hence ranges from 1 to 10000. It will work with the following code:
aux <- seq(1, 100)
loop <- sapply(aux, function(i, j) cov(x[, i], x[, j]))
Afterwards, you can calculate mean covariance with:
cor_x <- mean(loop)
If you want to exclude duplicate fields (e.g., cov(X,Y) is inherently identical to cov(Y,X)), you can use:
cor_x <- mean(loop[upper.tri(loop, diag = TRUE)])
If you also want to exclude cov(X,X), i.e., variance, you can use:
cor_x <- mean(loop[upper.tri(loop)])
I have a data frame (150000 obs, 15 variables) in R and need to correct a subset of values of one variable (simply by multiplying by a constant) based on the value of another. What's an easy way to do this?
I though apply would work, but I'm not sure how to write the function (obviously can't multiply in the function) and qualifier:
df$RESULT <- df[apply(df$RESULT, 1, function(x * 18.01420678) where(SITE==1)), ]
you mean this?
dat <- data.frame(x=1:10,y=sample(20,10))
constant <- 100
dat$y <- ifelse(dat$x > dat$y, dat$y*constant, dat$y)
You could use the capacity of "[" to do subsetting but for "correction" of a subset you need to use the logical expression that defines the subset on both sides of the assignment. Since you will then be working with only the values that need correction you do not use any further conditional function.
df[ df$SITE==1, "RESULT" ] <- df[ df$SITE==1, "RESULT"] * 18.01420678
In cases where the operation is to be done on large (millions) of cases or done repeatedly in simulations, this approach may be much faster that the ifelse approach
I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))