skip NA's when computing dot product - r

I am adjusting the measurements in a data matrix by subtracting their projections onto the first 1-2 principal components. The problem is, if there is even a single NA in the data matrix (almost inevitable for thousands of measurements), the inner product operation x%*%y (I also tried sum(x*y), for vectors x,y) returns NA. Is there a simple way (i.e. avoiding conditional statements and loops) of computing the inner product on the non-NA values, so that the operation actually returns something?
Incidentally, I would like to avoid just replacing NA's with 0's, since then I would have to renormalize the vectors at each stage.

You can try this command:
sum(x*y, na.rm = TRUE)

Related

Generate all permutations of a large vector and perform calculations efficently

I have a vector "perm" of length 445, which includes 260 zeros and 185 ones. I need to generate the matrix of all its possible permutations. Then I need to perform a couple of calculations efficiently. I have kind of managed to do this but not completely and not efficiently (I used a loop). I would need help in improving my code, as it would be very instructive.
First, since R does not let me create this huge matrix of permutations, I use the 'ri' package to randomly sample 100,000 permutations.
library(ri)
lalonde <- read.csv('lalonde.csv')
perms <- genperms(lalonde$treat, maxiter=100000)
Instead, I would like to be able to generate the full matrix of permutations (or a list of lists, if that works better).
Then, I merge the my original dataset, lalonde, with the permutation dataset.
lalonde1 <- data.frame(perms, lalonde)
I create an empty vector, where I store the output of the loop that follows
diff_vec <- vector(mode='list', length=100000)
I create a loop that calculates an absolute conditional difference in means per each permuted vector. I store the results in the empty vector. This is far from efficient and I would very much appreciate if I could get some advice on how to do it better.
for (i in 1:100000) {
diff_vec[i] <- abs((mean(lalonde1[lalonde1[[i]]==1,"re78"])-mean(lalonde1[lalonde1[[i]]==0,"re78"])))
}
Finally, for each absolute difference in means, check if it is greater or equal to a value I stored, assigning 1 if it is and 0 otherwise.
p_val<-ifelse(diff_vec>=tau_hat, 1, 0)
For those who care, this is to calculate an exact p-value in a completely randomised experiment.

Replace unknown number of rows in matrix

I am running some big simulations where in each iteration a matrix of unknown row number gets created. I want to unite all these matrices into one big matrix. Intuitively, the easiest way to do this is setting up an empty matrix before the loop and then appending each result of an iteration with rbind. However, this is very inefficient because of memory allocation and I need to have a fast script!
Therefore, I decided to create a sparse results matrix filled with 0s prior to the loop and then replace its rows iteratively with the new matrices. The size of this results matrix is very exaggerated to make sure that I have "enough" rows. The "unused" rows get removed after the loop. An example of my current solution is as follows (this is just dummy code, the real code is more complex):
library(Matrix)
#Create empty results matrix
matrix_to_fill<-sparseMatrix(i=integer(0),j=integer(0),x=0,
dims=c(100,10))
#Run 10 iterations where a new matrix of unknown size is created
for(i in 1:10){
#Find first row that is a 0
first0<-(1:100)[matrix_to_fill[,1]==0][1]
#Create a new dataframe with random number of rows
new_matrix<-matrix(rep(1,10*(rpois(1,2)+1)),ncol=10)
#Replace section of matrix_to_fill starting at first 0-line with new matrix
matrix_to_fill[first0:(first0+nrow(new_matrix)-1),]<-new_matrix
}
matrix_to_fill<-matrix_to_fill[matrix_to_fill[,1]!=0,]
However, again, I am running into a similar problem with memory allocation. Since I do not know what is the number of rows of each of my matrices, I have to store them first, calculate their row number and then replace the respective rows in my results matrix.
Is there a way I can replace an unknown number of rows in my results table with a new matrix (I do know the starting row and I do know that my table is big enough to "fit" the new matrix)? Or do I have to solve this by creating a "results list" of known size prior to the loop instead of a "results matrix" and then fill each new matrix into the list? This would be of course possible, but I'm afraid this would be less efficient...
Thanks!

What is the best way to perform operations for pairs of complete observations in R?

Consider two vectors x and y of equal length; each of them have both NA and non-NA values. The objective is to efficiently find pairs of non-NA values between the two vectors to perform some downstream operations on these pairs of complete observations. A complete pair is defined as a non-NA value for index i in both x and y.
What I have tried is the following:
complete.pairs <- !is.na(x) & !is.na(y)
complete.x <- x[complete.pairs]
complete.y <- y[complete.pairs]
However, I'm unsure if this is the fastest way to do it. For some context, the idea is to process all combinations of rows in a matrix this way (so x and y would be a single combination). I'm looking for something like what cor(x, use = "complete.pairwise.obs") does as it seems highly efficient for finding complete pairs per combination of vectors and then computing a correlation. Here I don't necessarily want to compute a correlation after but rather other things such as statistical tests for example. From my tests, my approach seems really inefficient for large matrices.

issue summing columns

I have a very large dataset and I'm trying to get the sums of values. The variables are binary with 0s and 1s.
Somehow, when I run a for loop
for (i in 7:39){
agegroup1[53640, i]<-sum(agegroup1[, i])
}
The loop processes but everything but the first column would contain nothing but just NA. I tried calling the values up and would see 0 and 1s, as well as checking the class (it returns "integer"). But when adding it all up, R does not work.
Any advice?
cs <- colSums(agegroup1[, 7:39])
will give you the vector of column sums without looping (at the R level).
If you have any missing values (NAs) in agegroup1[, 7:39] then you may want to add na.rm = TRUE to the colSums() call (or even your sum() call).
You don't say what agegroup1 is or how many rows it has etc, but to finalise what your loop is doing, you then need
agegroup1[53640, 7:39] <- cs
What was in agegroup1[53640, ] before you started adding the column sums? NA? If so that would explain some behaviour.
We do really need more detail though...
#Gavin Simpson provided a workable solution but alternatively you could use apply. This function allows you to apply a function to the row or column margin.
x <- cbind(x1=1, x2=c(1:8), y=runif(8))
# If you wanted to sum the rows of columns 2 and 3
apply(x[,2:3], 1, sum, na.rm=TRUE)
# If you want to sum the columns of columns 2 and 3
apply(x[,2:3], 2, sum, na.rm=TRUE)

Matrix turned to class(character) when removing NA values

Reproducible example below. I have a simulation loop, within which I occasionally have rows I need to remove from a matrix. I have done this by entering an 'NA' value into the row I need to remove in a specific position, and then I have a line of code to remove any line with an NA. This has worked great so far. My issue is, I am now running simulations in a certain way that occasionally whittles my matrix down to a single row. Then this occurs, the matrix gets transformed into a 'character', and crashes the simulation.
Example:
mat<-matrix(1:10,5,2) #setting up a simplified example matrix
mat[3:5,1]<-NA #Giving 3 rows 'NA' values, for removal of these rows
mat<-mat[!is.na(mat[,1]),] #An example where my procedure works just fine
class(mat)
mat[2,1]<-NA #Setting 1 of the remaining 2 rows as NA
mat<-mat[!is.na(mat[,1]),] #Removing one of final two rows
class(mat) #No longer a matrix
Is there some way I can do this, where I don't lose my formatting as a matrix at the end? I am assuming this issue is coming from my use of the "is.na" command, but I haven't found a good way around using this.
To give a bit more insight into the issue, in case there is a MUCH better way to do this I am too naive to have found yet... In my real-life simulation, I have a column in the matrix that holds a '1' when the individual in the given row is alive, and a '0' when dead. When an individual (a single row) dies, (and the value goes from a '1' to a '0'), I need to remove the row. The only way I knew how to do this was to change the '0' to an 'NA' and then remove all rows with an NA. If there is a way to just remove the rows with a '0' in a specific column that avoids this issue, that would be great!
By default, the [ function coerces the output into the lowest possible dimension. In your example, you have a two dimensional array (a matrix): when extracting a single row, it is coerced into a vector of characters.
To avoid that, have a look at the drop option to the [ function. You should be doing:
mat <- mat[!is.na(mat[,1]),, drop = FALSE]

Resources