calculation of block quantities in a matrix in R - r

I have a very general question on data manipulations in R, and I am seeking a convenient and fast way. Suppose I have a matrix of dimension (R)-by-(nxm), i.e. R rows and n times m columns.
set.seed(999)
n = 5; m = 10; R = 100
ncol = m*n
mat = matrix(rnorm(n*m*R), nrow=R, ncol=ncol)
Now I want to have a new matrix (call it new.mat) of dimension (R)-by-(m), i.e. given a certain row of mat, I want to calculate a number (say sum) for the first n elements, then a number for the next n elements, and so on. In this way, the first row of mat ends up with m numbers. The same thing is done for every other row of mat.
For the given example above, the 1st element of the 1st row of the new matrix new.mat should be sum(mat[1,1:5]), the 2nd element is sum(mat[1,6:10]), and the last element is sum(mat[1,46:50]). The 2nd row of new.mat is (sum(mat[2,1:5]), sum(mat[2,6:10),...).
If possible, avoiding for loops is preferred. Thank you!

rowsum is a useful function here. You will have to do a bit of transposing to get what you want
You need to create a grouping vector that is something like c(1,1,1,1,1,2,2,2,2,2,....,10,10,10,10,10)
grp <- rep(seq_len(ceiling(ncol(mat)/5)), each = 5, length.out = ncol(mat))
# this will also work, but may be less clear why.
# grp <- (seq_len(ncol(mat))-1) %/%5
rowsum computes column sums across rows of a numeric matrix-like object for each level of a grouping variable
You are looking for row sums across columns, so you will have to transpose your results (and your input)
t(rowsum(t(mat),grp))

Related

Remove list element where any of the row sums in the element are less than a certain value (in R)

Apologies in advance for what I know is simple. I just haven't been able to find the solution despite the 1000 search attempts and my rudimentary skills are not up to the challenge.
I have a list of matrices consisting of rows of integers. I can find row totals etc with (l)apply function etc. What I am stuck on however is removing an entire element if any of the rows fail a certain criteria, say a total of <500.
So in the below example:
x1 <- rnorm(50,5,0.32)
dim(x1) <- c(5,10)
x2 =rnorm(50,25,3.2)
dim(x2) <- c(5,10)
x3 =rnorm(50,25,3.2)
dim(x3) <- c(5,10)
x4=rnorm(50,0.8,0.1)
dim(x4) <- c(5,10)
x5=rep(NaN,50)
dim(x5) <- c(5,10)
list1<-list(x1,x2,x3,x4,x5)
If I sum each row in each element for a total:
goodbit <- lapply(list1, function (x) apply(x, 1, function(c) sum(c)))
I know I can filter out the elements with NAs:
list1nonas <- Filter(Negate(anyNA),list1)
But I am having a hard time extending that to criteria based on the row totals. For example how can I remove any element where any row total in that element is < 8.
(Element [[4]] in this example).
You can use rowSums. If we want to test whether there are any rowSums less than 8 in a given matrix x, we can do any(rowSums(x) < 8). Therefore the logical negation negation of this will return TRUE if none of the row sums are less than 8.
We can therefore put this inside an sapply to run the test on each matrix in our list, and return a logical vector.
Subsetting our original list by this vector returns a filtered list with only those matrices that have no row sums below 8.
list1[sapply(list1, function(x) !any(rowSums(x) < 8))]

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

Faster method of counting specified values from rows in large matrix in R

MC is a very large matrix, 1E6 rows (or more) and 500 columns. I am trying to get the number of occurrences of the values 1 through 13 for each of the columns. Sometimes the number of occurrences for one of these values will be zero. I would like my final output to be a 300X13 matrix (or data frame) with these count values. I am wondering if anyone can suggest a more efficient manner then what I currently have, which is the following:
MCct<-matrix(0,500,13)
for (j in 1:500){
for (i in 1:13){
MCct[j,i]<-length(which(MC[,j]==i))}}
I don't that table works, because I need to also know if zero occurrences occurred...I couldn't figure it out how to do that if it is possible. And I am only somewhat familiar with apply, so maybe there is a method to use that...I haven't been successful in figuring that out yet.
Thanks for the help,
Vivien
You could do this with sapply (to iterate from 1 to 13) and colSums (to add up the columns of j):
MCct <- sapply(1:13, function(i) {
colSums(MC == i)
})
Suppose you have a set of values you're interested in
set <- 1:4
n = length(set)
and you have a matrix that includes those values, and others
m <- matrix(sample(10, 120, TRUE), 12, 10)
Create a vector indicating the index in the set of each matching value
idx <- match(m, set)
then make the index unique to each column
idx <- idx + (col(m) - 1) * n
idx ranges from 1 (occurrences of the first set element in the first column) to n * ncol(m) (occurrence of the nth set element in the last column of m). Tabulate the unique values of idx
v <- tabulate(idx, nbin = n * ncol(m))
The first n elements of v summarize the number of times set elements 1..n appear in the first column of m. The second n elements of v summarize the number of times set elements 1..n appear in the second column of m, etc. Reshape as the desired matrix, where each row represents the corresponding member of the set.
matrix(v, ncol=ncol(m))
table can count zero occurrences, you just need to create a factor that has the whole range of levels, e.g.
apply(MC, 2, function(x) table(factor(x, levels=1:13)))
This is not as efficient as #Patronus' solution though.

selecting columns specified by a random vector in R

I have a large matrix from which I would like to randomly extract a smaller matrix. (I want to do this 1000 times, so ultimately it will be in a for loop.) Say for example that I have this 9x9 matrix:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
From this matrix, I would like a random 3x3 subset. The trick is that I do not want any of the row or column sums in the final matrix to be 0. Another important thing is that I need to know the original number of the rows and columns in the final matrix. So, if I end up randomly selecting rows 4, 5, and 7 and columns 1, 3, and 8, I want to have those identifiers easily accessible in the final matrix.
Here is what I've done so far.
First, I create a vector of row numbers and column numbers. I am trying to keep these attached to the matrix throughout.
r.num<-seq(from=1,to=nrow(mat),by=1) #vector of row numbers
c.num<-seq(from=0, to=(ncol(mat)+1),by=1) #vector of col numbers (adj for r.num)
mat.1<-cbind(r.num,mat)
mat.2<-rbind(c.num,mat.1)
Now I have a 10x10 matrix with identifiers. I can select my rows by creating a random vector and subsetting the matrix.
rand <- sample(r.num,3)
temp1 <- rbind(mat.2[1,],mat.2[rand,]) #keep the identifier row
This works well! Now I want to randomly select 3 columns. This is where I am running into trouble. I tried doing it the same way.
rand2 <- sample(c.num,3)
temp2 <- cbind(temp1[,1],temp1[,rand2])
The problem is that I end up with some row and column sums that are 0. I can eliminate columns that sum to 0 first.
temp3 <- temp1[,which(colSums(temp1[2:nrow(temp1),])>0)]
cols <- which(colSums(temp1[2:nrow(temp1),2:ncol(temp1)])>0)
rand3 <- sample(cols,3)
temp4 <- cbind(temp3[,1],temp3[,rand3])
But I end up with an error message. For some reason, R does not like to subset the matrix this way.
So my question is, is there a better way to subset the matrix by the random vector "rand3" after the zero columns have been removed OR is there a better way to randomly select three complementary rows and columns such that there are none that sum to 0?
Thank you so much for your help!
If I understood your problem, I think this would work:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
smallmatrix = matrix(0,,nrow=3,ncol=3)
while(any(apply(smallmatrix,2,sum) ==0) | any(apply(smallmatrix,1,sum) ==0)){
cols = sample(ncol(mat),3)
rows= sample(nrow(mat),3)
smallmatrix = mat[rows,cols]
}
colnames(smallmatrix) = cols
rownames(smallmatrix) = rows

Return value from column indicated in same row

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

Resources