I want generate a random n by n matrix in R with discrete values ranging from 1 to n. The tricky part is that I want each value to be unique both in the row and on the column.
For example, if n=3 the matrix could look like:
1 2 3
2 3 1
3 1 2
or it could look like this:
2 3 1
1 2 3
3 1 2
anyone has any idea of how to generate this kind of matrix?
What you want is called a Latin square. Here's one function (from the Cookbook for R; see also here and a bunch of other search results online) allowing to generate them:
latinsquare <- function(len, reps=1, seed=NA, returnstrings=FALSE) {
# Save the old random seed and use the new one, if present
if (!is.na(seed)) {
if (exists(".Random.seed")) { saved.seed <- .Random.seed }
else { saved.seed <- NA }
set.seed(seed)
}
# This matrix will contain all the individual squares
allsq <- matrix(nrow=reps*len, ncol=len)
# Store a string id of each square if requested
if (returnstrings) { squareid <- vector(mode = "character", length = reps) }
# Get a random element from a vector (the built-in sample function annoyingly
# has different behavior if there's only one element in x)
sample1 <- function(x) {
if (length(x)==1) { return(x) }
else { return(sample(x,1)) }
}
# Generate each of n individual squares
for (n in 1:reps) {
# Generate an empty square
sq <- matrix(nrow=len, ncol=len)
# If we fill the square sequentially from top left, some latin squares
# are more probable than others. So we have to do it random order,
# all over the square.
# The rough procedure is:
# - randomly select a cell that is currently NA (call it the target cell)
# - find all the NA cells sharing the same row or column as the target
# - fill the target cell
# - fill the other cells sharing the row/col
# - If it ever is impossible to fill a cell because all the numbers
# are already used, then quit and start over with a new square.
# In short, it picks a random empty cell, fills it, then fills in the
# other empty cells in the "cross" in random order. If we went totally randomly
# (without the cross), the failure rate is much higher.
while (any(is.na(sq))) {
# Pick a random cell which is currently NA
k <- sample1(which(is.na(sq)))
i <- (k-1) %% len +1 # Get the row num
j <- floor((k-1) / len) +1 # Get the col num
# Find the other NA cells in the "cross" centered at i,j
sqrow <- sq[i,]
sqcol <- sq[,j]
# A matrix of coordinates of all the NA cells in the cross
openCell <-rbind( cbind(which(is.na(sqcol)), j),
cbind(i, which(is.na(sqrow))))
# Randomize fill order
openCell <- openCell[sample(nrow(openCell)),]
# Put center cell at top of list, so that it gets filled first
openCell <- rbind(c(i,j), openCell)
# There will now be three entries for the center cell, so remove duplicated entries
# Need to make sure it's a matrix -- otherwise, if there's just
# one row, it turns into a vector, which causes problems
openCell <- matrix(openCell[!duplicated(openCell),], ncol=2)
# Fill in the center of the cross, then the other open spaces in the cross
for (c in 1:nrow(openCell)) {
# The current cell to fill
ci <- openCell[c,1]
cj <- openCell[c,2]
# Get the numbers that are unused in the "cross" centered on i,j
freeNum <- which(!(1:len %in% c(sq[ci,], sq[,cj])))
# Fill in this location on the square
if (length(freeNum)>0) { sq[ci,cj] <- sample1(freeNum) }
else {
# Failed attempt - no available numbers
# Re-generate empty square
sq <- matrix(nrow=len, ncol=len)
# Break out of loop
break;
}
}
}
# Store the individual square into the matrix containing all squares
allsqrows <- ((n-1)*len) + 1:len
allsq[allsqrows,] <- sq
# Store a string representation of the square if requested. Each unique
# square has a unique string.
if (returnstrings) { squareid[n] <- paste(sq, collapse="") }
}
# Restore the old random seed, if present
if (!is.na(seed) && !is.na(saved.seed)) { .Random.seed <- saved.seed }
if (returnstrings) { return(squareid) }
else { return(allsq) }
}
mats is a list of such matrices. It uses r2dtable to generate N random n x n matrices whose elements are chosen from 0, 1, ..., n-1 and whose margins are each given by margin. Then it filters out those for which all columns columns have one each of 0:(n-1) and adds one to each matrix to give result. The number of matrices returned can vary and you have to generate a huge number of matrices N to get just a few as n gets larger. When I tried n <- 3 below mats was a list of 24 matrices out of 100 but with n <- 4 it only found 1 out of 100.
set.seed(123)
N <- 100 # no of tries
n <- 3 # rows of matrix (= # cols)
check <- function(x) all(apply(x, 2, sort) == seq_len(nrow(x))-1)
margin <- sum(seq_len(n))-n
margins <- rep(margin, n)
L <- r2dtable(N, r = margins, c = margins)
mats <- lapply(Filter(check, L), "+", 1)
Here is an attempt:
x <- c(1,2,3)
out <- NULL
for(i in 1:3){
y <- c(x[1 + (i+0) %% 3], x[1 + (i+1) %% 3], x[1 + (i+2) %% 3])
out <- rbind(out,y)
}
This gives:
> out
[,1] [,2] [,3]
y 2 3 1
y 3 1 2
y 1 2 3
For the general case:
n <- 4
x <- 1:n
out <- NULL
for(i in 1:n){
y <- x[1 + ((i+0:(n-1))%%n)]
out <- rbind(out,y)
}
If I'm not wrong this is the expected result:
> out
[,1] [,2] [,3] [,4]
y 2 3 4 1
y 3 4 1 2
y 4 1 2 3
y 1 2 3 4
Shorter:
n < 4
x <- 1:n
vapply(x, function(i) x[1 + ((i+0:(n-1))%%n)], numeric(n))
Here is one version that generates all possible rows for such matrix and then takes them one by one, restricting the selection to the valid choices each time:
n <- 9
allrows <- combinat::permn(n)
takerows <- function(taken, all) {
available <- rep(TRUE, length(all))
for(i in 1:nrow(taken)) {
available <- sapply(all, function(x) all((x-taken[i,])!=0)) & available
}
matrix(all[[which(available)[sample(sum(available), 1)]]], nrow=1)
}
magicMat <- takerows(matrix(rep(0, n), ncol=n), allrows)
for(i in 1:(n-1)) {
magicMat <- rbind(magicMat, takerows(magicMat, allrows))
}
> magicMat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 5 3 1 4 2 8 6 7 9
[2,] 9 8 6 2 1 3 7 4 5
[3,] 4 5 7 8 9 2 3 6 1
[4,] 3 9 2 1 6 7 5 8 4
[5,] 1 6 5 3 8 4 2 9 7
[6,] 7 2 4 9 3 5 8 1 6
[7,] 6 4 8 5 7 1 9 3 2
[8,] 8 1 9 7 5 6 4 2 3
[9,] 2 7 3 6 4 9 1 5 8
Related
I am trying to read several .RLS (spreadsheet) files into one matrix using R. These files have very formulaic names, which makes it easy to work with. My plan (see code below) is to create the file names in a vector called names, then use a for loop to read and add the second column of each of those files into a matrix. However, I have encountered some issue. I have tested the names part of my code, and it can read tables into R individually. However when I try to put them all together into the matrix collected using the 2nd for loop, I get an error that says, "incorrect number of subscripts on matrix". I am not sure what this means. Any advice would be welcome.
library(tidyverse)
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[,2]
}
}
test <- collector(15, 23)
test
Regarding the issue, it may be because we used read_table which returns a tibble and tibble doesn't drop dimensions with [. Instead, we need [[.
collector <- function(min, max){
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = c()
for (i in 1:(max-min+1)){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:(max-min+1)){
e <- read_table(names[j], col_names=FALSE)
collected[,j+1] = e[[2]]## change
}
}
Instead of initializing with a NULL vector, we can create a vector of certain length and then assign with [i]. Other than that the code works with a dummy data
collector <- function(min, max){
i1 <- max - min + 1
collected <- matrix(nrow = 601, ncol = max - min + 2)
names = character(i1)
for (i in 1:i1){
names[i] = paste0("D:/CHY 498/UV-Vis/22822/BH4_3/12321A",(i+min-1),".RLS")
}
for (j in 1:i1){
e <- cbind(NA, 1:601) # created dummy data
collected[,j+1] = e[,2]
}
collected
}
test <- collector(15, 23)
-testing
test <- collector(15, 23)
> head(test)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] NA 1 1 1 1 1 1 1 1 1
[2,] NA 2 2 2 2 2 2 2 2 2
[3,] NA 3 3 3 3 3 3 3 3 3
[4,] NA 4 4 4 4 4 4 4 4 4
[5,] NA 5 5 5 5 5 5 5 5 5
[6,] NA 6 6 6 6 6 6 6 6 6
NOTE: The last part of reading the data couldn't be tested. It may be that some of the links doesn't have data and thus couldn't be read. Also, paste is vectorized, so the first loop is not really needed
I have three 4x4x1000 arrays A,B,C. I want to combine each 4x4 from A,B,C to generate 1 billion combinations. I am currently trying to circular shift the array B and C along the 3rd dimension but my code (in R) is very slow and takes ~6 minutes to run, while similar code on matlab only takes ~4s. However, all my other codes are in R and I wonder if there is a faster way to run this in R? Thanks a lot.
m=4
n=1000
a1=array(runif(m*m*n,0,1), c(m,m,n))
a2=a1
a3=a1
Sys.time()
for (i in 1:n) {
for (j in 1:n) {
a3 = abind(a3[,,2:n],a3[,,1])
}
a2 = abind(a2[,,2:n],a2[,,1])
}
Sys.time()
# more simplified example:
m = 2
n = 3
a1 = array(1:(m*m*n), c(m,m,n))
a1
To do "circular shift", we do not need to physically change data n times.
We can simply calculate the end indexes and subset the data once.
For example, we want to shift 2 times:
n3 <- 2
new_ind <- c((1:n)[-(1:n3)], (1:n)[1:n3]) # calculate indices
new_ind
# [1] 3 1 2
a2_v2 = a1[,,new_ind] # one subset
a2_v2
# , , 1
#
# [,1] [,2]
# [1,] 9 11
# [2,] 10 12
#
# , , 2
#
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
#
# , , 3
#
# [,1] [,2]
# [1,] 5 7
# [2,] 6 8
P.S. There is no point in shifting n (3`d dimension size) times,
as we will get the initial data (that is what you did in your example).
I want to find the first index k of an array, where the aggregate until that k is bigger than an given cutoff. This looks like follows in the code:
k <- 0
agg <- 0
while (agg < cutoff) {
k <- k +1
agg <- sum(array[1:k])
}
I was told there is a way to rewrite this without the for loop, I was also told the which statement would be helpful. I'm new to R and couldn't find the way. Any thoughts on this?
First we find array of partial sums:
x <- 1:10
partial_sums <- Reduce('+', x, accumulate = T)
partial_sums
[1] 1 3 6 10 15 21 28 36 45 55
Next we find the indices of all the elements of partial_sums array which are bigger then cutoff:
cutoff <- 17
indices <- which(partial_sums > cutoff)
indices[1]
[1] 6
Please note, that indices could be empty.
You can use the following:
seed(123)#in order to have reproducible "random" numbers
m1 <- matrix(sample(10),nrow = 5,ncol = 2)# create a matrix
m1
[,1] [,2]
[1,] 7 5
[2,] 4 2
[3,] 9 8
[4,] 1 6
[5,] 3 10
cutoff <- 5 #applying cutoff value
apply(m1,2,function(x){x<cutoff})#checking each column using apply instead of loop
OR:
which(m1 < cutoff) #results in the indices of m1 that comply to the condition <cutoff
[1] 2 4 5 7
EDIT
cutoff<-30# a new cutoff
v1<-unlist(lapply(seq_along(1:(nrow(m1)*ncol(m1))),function(x){sum(m1[1:x])}))#adding the values of each cell
which(v1>=cutoff)[1]#find the 1st of occurrence
I have two tables in R (females and males) with presence-absence data. I'd like to do pairwise comparisons between them (row-by-row) to find the number of cells not shared between each pair (i.e the sum of cells equal to 1 on the female but not on the male and vice-versa).
I know that the cross product (%*%) does the opposite of what I need. It creates a new matrix containing the sum of shared cells between pairs of males and females (i.e sum um cells equal to 1 in both).
Here is an example dataset:
females <- as.data.frame(matrix(c(0,0,0,1,1,0,1,0,1,0,1,0,1,0,1,1,1,0,1,1,1,0,1,1,1), nrow=5, byrow=T))
males <- as.data.frame(matrix(c(1,0,0,1,1,0,1,0,1,1,1,0,1,0,1,1,1,0,1,1,1,0,1,0,1), nrow=5, byrow=T))
rownames(females) <-c ("female_1","female_2","female_3","female_4","female_5")
rownames(males) <-c ("male_1","male_2","male_3","male_4","male_5")
So, if I do the cross product
as.matrix(females) %*% t(as.matrix(males))
I get this
male_1 male_2 male_3 male_4 male_5
female_1 2 2 1 2 1
female_2 1 2 0 2 0
female_3 2 1 3 2 3
female_4 3 3 2 4 2
female_5 3 2 3 3 3
But I need this (only first row shown)
male_1 male_2 male_3 male_4 male_5
female_1 1 1 3 2 3
.
.
In reality, my dataset is not symmetrical (I have 47 females and 32 males).
Thanks for any help!!!
Set up an object to receive results:
xy <- matrix(NA, nrow(females), nrow(males))
for ( x in 1:nrow(females) ){
for(y in 1:nrow(males) ){
xy[x,y] <- sum(females[x, 1:ncol(females)] != males[y,1:ncol(males)])}}
Should have done with nested sapply calls as well and might have been a bit cleaner since there was no need to have a separate "setup", (but only a little bit cleaner, and contrary to popular myth not any faster):
xy <- sapply( 1:nrow(females) ,
function(x) sapply( 1:nrow(males) ,
function(y) sum( females[x, 1:ncol(females)] != males[y,1:ncol(males)]) ))
xy
#-----
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 2 1 1
[2,] 1 1 4 1 3
[3,] 3 5 0 3 1
[4,] 2 2 3 0 2
[5,] 3 5 0 3 1
dimnames(xy) <- list( rownames(females), rownames(males) )
inverseCross <- function(females, males){
inverse <- data.frame(a=integer(), b=integer(), c=integer(), d=integer(), e=integer())
tempRow <- NULL
for(i in 1:nrow(females)){
for(j in 1:nrow(males)){
sum <- 0
for(k in 1: ncol(males)){
if(females[i,k] != males[j,k]){
sum <- sum + 1
}
}
tempRow <- c(tempRow, sum)
}
inverse[i,] <- tempRow
}
colnames(inverse) <- rownames(males)
rownames(inverse) <- rownames(females)
inverse
}
I have a problem to find a vectorization representation for a specific loop in R. My objective is to enhance the performance of the loop, because it has to be run thousands of times in the algorithm.
I want to find the position of the lowest value in a particular array section defined by a vector 'Level' for each row.
Example:
Level = c(2,3)
Let first row of array X be: c(2, -1, 3, 0.5, 4).
Searching for the position of the lowest value in the range 1:Level[1] of the row (that is (2, -1)), I get a 2, because -1 < 2 and -1 stands on second position of the row. Then, searching the position of the lowest value in the second range (Level[1]+1):(Level[1]+Level[2]) (that is (3, 0.5, 4)), I get a 4, because 0.5 < 3 < 4 and 0.5 stands on fourth position of the row.
I have to perform this over each row in the array.
My solution to the problem works as follows:
Level = c(2,3,3) #elements per section, here: 3 sections with 2,3 and 3 levels
rows = 10 #number of rows in array X
X = matrix(runif(rows*sum(Level),-5,5),rows,sum(Level)) #array with 10 rows and sum(Level) columns, here: 8
Position_min = matrix(0,rows,length(Level)) #array in which the position of minimum values for each section and row are stored
for(i in 1:rows){
for(j in 1:length(Level)){ #length(Level) is number of intervals, here: 3
if(j == 1){coeff=0}else{coeff=1}
Position_min[i,j] = coeff*sum(Level[1:(j-1)]) + which(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])] == min(X[i,(coeff*sum(Level[1:(j-1)])+1):sum(Level[1:j])]))
}
}
It works fine but I would prefer a solution with better performance. Any ideas?
This will remove the outer level of the loop:
Level1=c(0,cumsum(Level))
for(j in 1:(length(Level1)-1)){
Position_min[,j]=max.col(-X[,(Level1[j]+1):Level1[j+1]])+(Level1[j])
}
Here is a "fully vectorized" solution with no explicit loops:
findmins <- function(x, level) {
series <- rep(1:length(Level), Level)
x <- split(x, factor(series))
minsSplit <- as.numeric(sapply(x, which.min))
minsSplit + c(0, cumsum(level[-length(level)]))
}
Position_min_vectorized <- t(apply(X, 1, findmins, Level))
identical(Position_min, Position_min_vectorized)
## [1] TRUE
You can get better performance by making your matrix into a list, and then using parallel's mclapply():
X_list <- split(X, factor(1:nrow(X)))
do.call(rbind, parallel::mclapply(X_list, findmins, Level))
## [,1] [,2] [,3]
## 1 1 5 6
## 2 2 3 6
## 3 1 4 7
## 4 1 5 6
## 5 2 5 7
## 6 2 4 6
## 7 1 5 8
## 8 1 5 8
## 9 1 3 8
## 10 1 3 8