selecting columns specified by a random vector in R - r

I have a large matrix from which I would like to randomly extract a smaller matrix. (I want to do this 1000 times, so ultimately it will be in a for loop.) Say for example that I have this 9x9 matrix:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
From this matrix, I would like a random 3x3 subset. The trick is that I do not want any of the row or column sums in the final matrix to be 0. Another important thing is that I need to know the original number of the rows and columns in the final matrix. So, if I end up randomly selecting rows 4, 5, and 7 and columns 1, 3, and 8, I want to have those identifiers easily accessible in the final matrix.
Here is what I've done so far.
First, I create a vector of row numbers and column numbers. I am trying to keep these attached to the matrix throughout.
r.num<-seq(from=1,to=nrow(mat),by=1) #vector of row numbers
c.num<-seq(from=0, to=(ncol(mat)+1),by=1) #vector of col numbers (adj for r.num)
mat.1<-cbind(r.num,mat)
mat.2<-rbind(c.num,mat.1)
Now I have a 10x10 matrix with identifiers. I can select my rows by creating a random vector and subsetting the matrix.
rand <- sample(r.num,3)
temp1 <- rbind(mat.2[1,],mat.2[rand,]) #keep the identifier row
This works well! Now I want to randomly select 3 columns. This is where I am running into trouble. I tried doing it the same way.
rand2 <- sample(c.num,3)
temp2 <- cbind(temp1[,1],temp1[,rand2])
The problem is that I end up with some row and column sums that are 0. I can eliminate columns that sum to 0 first.
temp3 <- temp1[,which(colSums(temp1[2:nrow(temp1),])>0)]
cols <- which(colSums(temp1[2:nrow(temp1),2:ncol(temp1)])>0)
rand3 <- sample(cols,3)
temp4 <- cbind(temp3[,1],temp3[,rand3])
But I end up with an error message. For some reason, R does not like to subset the matrix this way.
So my question is, is there a better way to subset the matrix by the random vector "rand3" after the zero columns have been removed OR is there a better way to randomly select three complementary rows and columns such that there are none that sum to 0?
Thank you so much for your help!

If I understood your problem, I think this would work:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
smallmatrix = matrix(0,,nrow=3,ncol=3)
while(any(apply(smallmatrix,2,sum) ==0) | any(apply(smallmatrix,1,sum) ==0)){
cols = sample(ncol(mat),3)
rows= sample(nrow(mat),3)
smallmatrix = mat[rows,cols]
}
colnames(smallmatrix) = cols
rownames(smallmatrix) = rows

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

R add empty row after 8 rows

I would like to add an empty row after each 8 rows in my whole dataset. The actual one is like :
original dataset
and the final result I would like to get is :
expected dataset
Thanks in advance.
This answer is for numeric columns.
Initially, for the first eight observations, you can generate a matrix that is a vector of ones with length 8 and an additional NA column: v<- c(rep(1, 8),NA)
and then you can kronecker product this matrix with the original: kron(a,v) (after applying as.matrix on both a and v).
this should be extendable for each 8th row using a loop.
For the character column, you can first separate it, rbind a NA row to each 8th element using a loop defined by seq and then cbind it back to the numeric matrix.
cur <- rbind(df[1:8,], NA)
for(i in seq(from = 9, to = length(df), by = 8) {
cur <- rbind(df[i:(i+7),], NA)
}

Matrix subset dimensions

This is a very trivial example, but I do have real data where I am experiencing this particular problem. For simplicity, let's say I have a matrix in R called x, 20 rows x 3 columns
x <- matrix(0, nrow=20, ncol=3)
Then I take a subset of the matrix, for example, using index i, which can be a single integer, for example i <- 4, or multiple integers, for example i <- c(4:7), depending on the algorithm iterations (in other words, in one iteration i may be a single integer and in the next iteration i is multiple integers) and I'd like to know the size of the resulting subset
xsubset <- x[i,]
Then I use the dim command
dim(xsubset)
and I get the result: NULL
What do I have to do to determine the number of rows and columns in xsubset?

calculation of block quantities in a matrix in R

I have a very general question on data manipulations in R, and I am seeking a convenient and fast way. Suppose I have a matrix of dimension (R)-by-(nxm), i.e. R rows and n times m columns.
set.seed(999)
n = 5; m = 10; R = 100
ncol = m*n
mat = matrix(rnorm(n*m*R), nrow=R, ncol=ncol)
Now I want to have a new matrix (call it new.mat) of dimension (R)-by-(m), i.e. given a certain row of mat, I want to calculate a number (say sum) for the first n elements, then a number for the next n elements, and so on. In this way, the first row of mat ends up with m numbers. The same thing is done for every other row of mat.
For the given example above, the 1st element of the 1st row of the new matrix new.mat should be sum(mat[1,1:5]), the 2nd element is sum(mat[1,6:10]), and the last element is sum(mat[1,46:50]). The 2nd row of new.mat is (sum(mat[2,1:5]), sum(mat[2,6:10),...).
If possible, avoiding for loops is preferred. Thank you!
rowsum is a useful function here. You will have to do a bit of transposing to get what you want
You need to create a grouping vector that is something like c(1,1,1,1,1,2,2,2,2,2,....,10,10,10,10,10)
grp <- rep(seq_len(ceiling(ncol(mat)/5)), each = 5, length.out = ncol(mat))
# this will also work, but may be less clear why.
# grp <- (seq_len(ncol(mat))-1) %/%5
rowsum computes column sums across rows of a numeric matrix-like object for each level of a grouping variable
You are looking for row sums across columns, so you will have to transpose your results (and your input)
t(rowsum(t(mat),grp))

Error: (subscript) logical subscript too long

Can some one let me know why I am getting this error and how I can fix it?
Here is the code
What I am trying to do is remove the rows that associated 1's if the column of that one's less than 10
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
c4=c(rep(1,6),rep(0,34))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3,c4),nrow=40,ncol=8)
nam <- paste("V",2:9,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- colSums(dat)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
cs <- colSums(dat) < 10
indx <- dat[,which(cs)]==0
for(i in 1:dim(indx)[2]){
datnw <- dat[indx[,i],]
dat <- datnw}
datnw2 <- dat[, -which(cs)]
Thanks
If I understand correctly what you're trying to achieve, you might best write it this way:
cs <- colSums(dat) < 10
dat[rowSums(dat[,cs]) == 0, !cs]
This means: for any column with sum less than 10 (called a “small column” hereafter), drop any row which has a 1 in that column. So you only keep rows which have a zero in all those small columns. You drop the small columns as well, as they would only contain zeros in any case.
In your code, indx is a logical data frame with 40 rows, one for each row of input, and one column for each small column in the input. You use the first column of idx to remove the rows with a 1 in the first short column. This results in a new value for dat, which is a few rows shorter than the original. In the next iteration of the loop, you use the second logical vector in an attempt to remove more rows. But this won't work: after the first iteration, dat has less than 40 rows, but the second column still has all 40 rows. This is what's causing the error: you're subscripting a vector of less than 40 elements with a logical vector of length 40.
You could combine the three columns of your indx into a single vector suitable to subscript the rows of interest using the following expression:
apply(indx, 1, all)
This will have a TRUE value in its result for exactly those rows which have TRUE in each column. However, I guess I'd prefer my code above over this, as it is much shorter to write. The most likely reason to prefer the latter is if your data frame may contain negative number, so that a row sum of zero does not imply an all-zero row. Not a problem in your example data.

Resources