Repat Column in R with different arrangement of rows each time

Repat Column in R with different arrangement of rows each time - r

I have a vector in R, called v, such that v <- c(1,3,4,5,2).
I am trying to create 50 repetitions of this vector, but would like to re-arrange the indices randomly and non-repeating for each repetition, i.e.,
reps <- c(1,3,4,5,2 1,2,3,4,5) etc...
I tried:
reps <- rep(sample(1:5, 5, replace = FALSE), 50)
but this just repeats the same sample 50 times.

May be we can use
c(replicate(50, sample(v)))

Related

Creating a sparse matrix in r with a set number of integer values per row

I'm trying to create a sparse matrix, where for each row has a maximum of n entries that are each integers within a certain range, which I could then use as an adjacency matrix for social network analysis. For example, an 80X80 matrix where each row has 10 or fewer entries that are integers from 1-4. The goal is to represent the sort of data you would get from a social networking survey in which respondents were selecting values between 1 and 4 to indicate their relationship with up to 10 of the possibilities/columns in the survey.
I can create a sparse matrix using the "rsparsematrix" function, and using the density command can approximate the required number of responses, but I can't control the number of responses per row and would have to do additional processing to convert the random values to integers within my desired range.
eg: I could start with something like
M1<-rsparsematrix(80, 80, density = .1, symmetric = FALSE)
A more promising approach (from https://www.r-bloggers.com/casting-a-wide-and-sparse-matrix-in-r/) would be to generate the values and then use "transform" to convert them into a matrix. This allows me to control the integer values, but still doesn't get the limited number of responses per row.
Example code from the blog follows:
set.seed(11)
N = 10
data = data.frame(
row = sample(1:3, N, replace = TRUE),
col = sample(LETTERS, N, replace = TRUE),
value = sample(1:3, N, replace = TRUE))
data = transform(data,
row = factor(row),
col = factor(col)) "
This could be tweaked to give the required 80x80 matrix, but doesn't solve the problem of limiting the responses per row and, in the case of duplicate entries in the same row/column combination will result in out of range values since it resolves duplicate entries by taking the sum.
Any suggestions would be most appreciated.
As a bonus question, how would you then create random rows of null responses? For example within the 80*80 matrix, how might you introduce 40 random rows with no values? As in the description above, this would correspond to missing survey data.

You can try to build the spare matrix up using the row (i), column (j) amd value (x) components. This involves sampling subject to your row and value constraints.
# constraints
values <- 1:4
maxValuesPerRow <- 10
nrow <- 80
ncol <- 80
# sample values : how many values should each row get but <= 10 values
set.seed(1)
nValuesForEachRow <- sample(maxValuesPerRow, nrow, replace=TRUE)
# create matrix
library(Matrix)
i <- rep(seq_len(nrow), nValuesForEachRow) # row
j <- unlist(lapply(nValuesForEachRow, sample, x=seq_len(ncol))) # which columns
x <- sample(values, sum(nValuesForEachRow), replace=TRUE) # values
sm <- sparseMatrix(i=i, j=j, x=x)
check
dim(sm)
table(rowSums(sm>0))
table(as.vector(sm))
note, cant just sample columns like below as this can give duplicate values, hence loop used.
j <- sample(seq_len(ncol), sum(nValuesForEachRow), replace=TRUE)

The code below will do what you want. It generates your random sparse matrix, rounds it to whole numbers, then for every row that has more than 10 entries, randomly makes some entries NA until only 10 remain. It then makes all the non NA entries a random number between 1 and 4.
library(Matrix)
M1<-as.data.frame(as.matrix((rsparsematrix(80, 80, density = .1, symmetric = FALSE))))
M1 <- as.data.frame(apply(M1,1,round))
M1<-as.data.frame(sapply(M1,function(x) ifelse(x==0,NA,x)))
rows<-which(apply(M1,1,function(x) sum(!(is.na(x)))) >10)
for(i in rows)
{
toNA<-setdiff(which(!(is.na(M1[i,]))),sample(which(!(is.na(M1[i,]))),10,replace=F))
M1[i,toNA] <- NA
)
for(i in 1:nrow(M1))
{
M1[i,which(!(is.na(M1[i,])))] <- sample(1:4,length(M1[i,which(!
(is.na(M1[i,])))]),replace=T)
}

Matrix subset dimensions

This is a very trivial example, but I do have real data where I am experiencing this particular problem. For simplicity, let's say I have a matrix in R called x, 20 rows x 3 columns
x <- matrix(0, nrow=20, ncol=3)
Then I take a subset of the matrix, for example, using index i, which can be a single integer, for example i <- 4, or multiple integers, for example i <- c(4:7), depending on the algorithm iterations (in other words, in one iteration i may be a single integer and in the next iteration i is multiple integers) and I'd like to know the size of the resulting subset
xsubset <- x[i,]
Then I use the dim command
dim(xsubset)
and I get the result: NULL
What do I have to do to determine the number of rows and columns in xsubset?

Largest set of columns having at least k shared rows

I have a large data frame (50k by 5k). I would like to make a smaller data frame using the following rule. For a given k, 0>k>n, I would like to select the largest set of columns such that k rows have non-na values in all of these columns.
This seems like it might be too hard for a computer to do on a big data frame, but I'm hoping it is possible. I have written code for this operation.
It seems my way of doing this is too complex. It relies on (1) computing a list of all possible subsets of the set of columns, and then (2) checking how many shared rows they have. For even small numbers of columns (1) gets too slow (e.g. 45 seconds for 25 columns).
Question: Is it theoretically possible to get the largest set of columns sharing at least k non-na rows? If so, what is a more realistic approach?
#alexis_laz's elegant answer to a similar question takes an inverse approach to mine, examining all (fixed-size) subsets of the observations/samples/draws/units and checking which variables are present in them.
Taking combinations of n observations is difficult for large n. For example, length(combn(1:500, 3, simplify = FALSE)) yields 20,708,500 combinations for 500 observations and fails to produce the combinations on my computer for sizes greater than 3. This makes me worry that it's not gonna be possible for large n and p in either approach.
I have included an example matrix for reproducibility.
require(dplyr)
# generate example matrix
set.seed(123)
n = 100
p = 25
missing = 25
mat = rnorm(n * p)
mat[sample(1:(n*p), missing)] = NA
mat = matrix(mat, nrow = n, ncol = p)
colnames(mat) = 1:p
# matrix reporting whether a value is na
hasVal = 1-is.na(mat)
system.time(
# collect all possible subsets of the columns' indices
nameSubsets <<- unlist(lapply(1:ncol(mat), combn, x = colnames(mat), simplify = FALSE),
recursive = FALSE,
use.names = FALSE)
)
#how many observations have all of the subset variables
countObsWithVars = function(varsVec){
selectedCols = as.matrix(hasVal[,varsVec])
countInRow = apply(selectedCols, 1, sum) # for each row, number of matching values
numMatching = sum(countInRow == length(varsVec)) #has all selected columns
}
system.time(
numObsWithVars <<- unlist(lapply(nameSubsets, countObsWithVars))
)
# collect results into a data.frame
df = data.frame(subSetNum = 1:length(numObsWithVars),
numObsWithVars = numObsWithVars,
numVarsInSubset = unlist(lapply(nameSubsets, length)),
varsInSubset = I(nameSubsets))
# find the largest set of columns for each number of rows
maxdf = df %>% group_by(numObsWithVars) %>%
filter(numVarsInSubset== max(numVarsInSubset)) %>%
arrange(numObsWithVars)

R data samples counting

I have a task to do using R. I need to make 10000 samples of a vector of 12 elements each of them between 1 and 7. I did this using:
dataSet = t(replicate(10000, sample(1:7, 12, r=T)))
Now I need to count the rows of this dataSet that contain all the values from 1:7.
How can I do that and is there a better way to represent the data than this?

One way would be (you need to use set.seed in order to make this reproducible)
indx <- 1:7
sum(apply(dataSet, 1, function(x) all(indx %in% x)))
## 2336

selecting columns specified by a random vector in R

I have a large matrix from which I would like to randomly extract a smaller matrix. (I want to do this 1000 times, so ultimately it will be in a for loop.) Say for example that I have this 9x9 matrix:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
From this matrix, I would like a random 3x3 subset. The trick is that I do not want any of the row or column sums in the final matrix to be 0. Another important thing is that I need to know the original number of the rows and columns in the final matrix. So, if I end up randomly selecting rows 4, 5, and 7 and columns 1, 3, and 8, I want to have those identifiers easily accessible in the final matrix.
Here is what I've done so far.
First, I create a vector of row numbers and column numbers. I am trying to keep these attached to the matrix throughout.
r.num<-seq(from=1,to=nrow(mat),by=1) #vector of row numbers
c.num<-seq(from=0, to=(ncol(mat)+1),by=1) #vector of col numbers (adj for r.num)
mat.1<-cbind(r.num,mat)
mat.2<-rbind(c.num,mat.1)
Now I have a 10x10 matrix with identifiers. I can select my rows by creating a random vector and subsetting the matrix.
rand <- sample(r.num,3)
temp1 <- rbind(mat.2[1,],mat.2[rand,]) #keep the identifier row
This works well! Now I want to randomly select 3 columns. This is where I am running into trouble. I tried doing it the same way.
rand2 <- sample(c.num,3)
temp2 <- cbind(temp1[,1],temp1[,rand2])
The problem is that I end up with some row and column sums that are 0. I can eliminate columns that sum to 0 first.
temp3 <- temp1[,which(colSums(temp1[2:nrow(temp1),])>0)]
cols <- which(colSums(temp1[2:nrow(temp1),2:ncol(temp1)])>0)
rand3 <- sample(cols,3)
temp4 <- cbind(temp3[,1],temp3[,rand3])
But I end up with an error message. For some reason, R does not like to subset the matrix this way.
So my question is, is there a better way to subset the matrix by the random vector "rand3" after the zero columns have been removed OR is there a better way to randomly select three complementary rows and columns such that there are none that sum to 0?
Thank you so much for your help!

If I understood your problem, I think this would work:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
smallmatrix = matrix(0,,nrow=3,ncol=3)
while(any(apply(smallmatrix,2,sum) ==0) | any(apply(smallmatrix,1,sum) ==0)){
cols = sample(ncol(mat),3)
rows= sample(nrow(mat),3)
smallmatrix = mat[rows,cols]
}
colnames(smallmatrix) = cols
rownames(smallmatrix) = rows