How to make data randomization faster in R? - r

I have very big data set and I'm computing thousands of models for it. For every model I need to randomize my data 100 times.This randomization part makes my script very slow.
Would someone help me to make this step faster?
Here is my code:
for (l in seq(repeat.times)) {
y <- as.matrix(dfr[1])
x <- as.matrix(dfr[2:ncol(dfr)])
# Random Generation
x.random.name = sample(colnames(x),1,replace=FALSE)
x.random.1 <- sample(x[,x.random.name],nrow(y),replace=FALSE)
x <- cbind(x,x.random.1)
.
.
.
For example:
> x
A B C D E
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> y
[,1]
[1,] 10
[2,] 20
[3,] 30
[4,] 40
After randomization:
> x
A B C D E x.random.1
[1,] 1 5 9 13 17 10
[2,] 2 6 10 14 18 12
[3,] 3 7 11 15 19 9
[4,] 4 8 12 16 20 11
>

This is way way faster if I understand OP's requirement correctly
x
## A B C D E
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
y
## [,1]
## [1,] 10
## [2,] 20
## [3,] 30
## [4,] 40
xncol <- ncol(x)
ynrow <- nrow(y)
require(microbenchmark)
microbenchmark(xrand <- sapply(1:100, FUN = function(iter) {
sample(x[, sample(1:xncol, 1)], ynrow)
}), times = 1L)
## Unit: milliseconds
## expr min
## xrand <- sapply(1:100, FUN = function(iter) { sample(x[, sample(1:xncol, 1)], ynrow) }) 1.083906
## lq median uq max neval
## 1.083906 1.083906 1.083906 1.083906 1
x <- cbind(x, xrand)
x
## A B C D E
## [1,] 1 5 9 13 17 8 16 2 18 5 3 10 10 14 9 19 6 6 15 18 2 13 13 15 18 7 20 17 11 13 1 16 1 20 1 9 19 14 20
## [2,] 2 6 10 14 18 7 14 3 20 8 4 12 9 13 10 20 8 8 13 20 1 14 15 16 20 6 19 19 10 16 2 15 4 17 4 12 20 15 19
## [3,] 3 7 11 15 19 5 15 1 19 7 2 11 12 15 11 18 7 7 14 17 4 15 16 14 19 8 17 18 9 14 4 14 2 18 3 11 18 16 17
## [4,] 4 8 12 16 20 6 13 4 17 6 1 9 11 16 12 17 5 5 16 19 3 16 14 13 17 5 18 20 12 15 3 13 3 19 2 10 17 13 18
##
## [1,] 5 13 2 3 5 2 5 8 4 6 19 3 7 19 4 7 6 4 17 9 18 9 5 3 1 15 8 19 19 3 19 15 15 1 1 10 15 19 11 6 5 17 7
## [2,] 7 15 1 1 7 1 6 6 3 8 18 2 6 17 2 6 5 3 18 10 17 11 8 1 3 13 6 17 18 4 17 16 13 4 3 11 16 18 9 8 8 18 6
## [3,] 8 14 3 2 8 3 8 7 2 7 20 1 8 18 3 8 8 1 20 12 19 10 6 2 2 16 5 20 17 2 18 13 16 3 4 12 13 20 12 7 7 20 8
## [4,] 6 16 4 4 6 4 7 5 1 5 17 4 5 20 1 5 7 2 19 11 20 12 7 4 4 14 7 18 20 1 20 14 14 2 2 9 14 17 10 5 6 19 5
##
## [1,] 3 3 15 19 2 12 16 11 18 7 10 11 5 12 12 10 1 2 19 2 16 17 11
## [2,] 4 2 13 20 1 11 15 12 17 5 11 12 6 10 9 11 4 3 18 3 14 19 9
## [3,] 1 4 16 18 4 10 14 9 19 8 12 9 8 11 11 9 3 4 20 4 13 20 12
## [4,] 2 1 14 17 3 9 13 10 20 6 9 10 7 9 10 12 2 1 17 1 15 18 10
The key step is ofcourse, which I have wrapped in microbenchmark purely for benchmarking purpose.
xrand <- sapply(1:100, FUN = function(iter) { sample(x[, sample(1:xncol, 1)], ynrow) })

Here is a one-liner:
# Data
x<-matrix(1:10^4,nrow=10)
# Generate 2000 replicates.
replicate(2000,x[order(runif(nrow(x))),sample(ncol(x),1)])
Or even just:
replicate(2000,sample(x[,sample(ncol(x),1)]))

I found that you could dramatically reduce the runtime by moving x and y outside the loop. Then you can just create a new transformed matrix in the loop
y <- as.matrix(dfr[1])
XX <- as.matrix(dfr[2:ncol(dfr)])
for (l in seq(repeat.times)) {
# Random Generation
x.random.name = sample(colnames(x),1,replace=FALSE)
x.random.1 <- sample(XX[,x.random.name],nrow(y),replace=FALSE)
x <- cbind(XX,x.random.1)
}
So i've moved out x and renamed it. Then when you do your analysis, you would continue to use the newly made x. I found that with my benchmark this speed things up by nearly two orders of magnitude.

Related

Error in R function rmcorr: Error in psych::r.con(rmcorrvalue, errordf, p = CI.level) : number of subjects must be greater than 3

I'm trying to do a repeated measures correlation in R using rmcorr, but received the above error, even though I have more than 3 subjects.
> scores$SUBJECT
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[36] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[71] 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5
[106] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
[141] 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8
[176] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
[211] 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11
[246] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
[281] 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 14 14 14
[316] 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15 15 15 15
[351] 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 17
[386] 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 18 18
[421] 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
[456] 19 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21 21 21
[491] 21 21 21 21 21 21 21 21 21 21 21 21 21 21
# Convert data types
scores$SUBJECT<-factor(scores$SUBJECT)
scores$FACTOR1<-factor(scores$FACTOR1)
scores$FACTOR2<-factor(scores$FACTOR2)
Interestingly, I was able to perform the correlation on some subsets of the data but not others.
# SUBSETS
subset1 <- subset(scores, FACTOR1 == "m1")
subset1a <- subset(subset1, FACTOR2 == "a")
subset1b <- subset(subset1, FACTOR2 == "b")
subset1c <- subset(subset1, FACTOR2 == "c")
subset2 <- subset(scores, FACTOR1 == "mp")
subset2a <- subset(subset2, FACTOR2 == "a")
subset2b <- subset(subset2, FACTOR2 == "b")
subset2c <- subset(subset2, FACTOR2 == "c")
rmcorr(participant = subset1$SUBJECT, measure1 = subset1$SCORE, measure2 = subset2$SCORE, dataset = scores)
rmcorr(participant = subset1a$SUBJECT, measure1 = subset1a$SCORE, measure2 = subset2a$SCORE, dataset = scores)
rmcorr(participant = subset1b$SUBJECT, measure1 = subset1b$SCORE, measure2 = subset2b$SCORE, dataset = scores)
rmcorr(participant = subset1c$SUBJECT, measure1 = subset1c$SCORE, measure2 = subset2c$SCORE, dataset = scores)
Specifically
rmcorr(participant = subset1$SUBJECT, measure1 = subset1$SCORE, measure2 = subset2$SCORE, dataset = scores)
worked, but all of the other calls to rmcorr generated the error. Does anyone know where I went wrong?

Is there any method to sort the matrix by both column and row in R?

could you guys help me?
I have a matrix like this. the first column and row are the IDs.
I need to sort it by column and row ID like this.
Thanks!
Two thoughts:
mat <- matrix(1:25, nr=5, dimnames=list(c('4',3,5,2,1), c('4',3,5,2,1)))
mat
# 4 3 5 2 1
# 4 1 6 11 16 21
# 3 2 7 12 17 22
# 5 3 8 13 18 23
# 2 4 9 14 19 24
# 1 5 10 15 20 25
If you want a strictly alphabetic ordering, then this will work:
mat[order(rownames(mat)),order(colnames(mat))]
# 1 2 3 4 5
# 1 25 20 10 5 15
# 2 24 19 9 4 14
# 3 22 17 7 2 12
# 4 21 16 6 1 11
# 5 23 18 8 3 13
This will not work well if the names are intended to be ordered numerically:
mat <- matrix(1:30, nr=3, dimnames=list(c('2',1,3), c('4',3,5,2,1,6,7,8,9,10)))
mat
# 4 3 5 2 1 6 7 8 9 10
# 2 1 4 7 10 13 16 19 22 25 28
# 1 2 5 8 11 14 17 20 23 26 29
# 3 3 6 9 12 15 18 21 24 27 30
mat[order(rownames(mat)),order(colnames(mat))]
# 1 10 2 3 4 5 6 7 8 9
# 1 14 29 11 5 2 8 17 20 23 26
# 2 13 28 10 4 1 7 16 19 22 25
# 3 15 30 12 6 3 9 18 21 24 27
(1, 10, 2, ...) For that, you need a slight modification:
mat[order(as.numeric(rownames(mat))),order(as.numeric(colnames(mat)))]
# 1 2 3 4 5 6 7 8 9 10
# 1 14 11 5 2 8 17 20 23 26 29
# 2 13 10 4 1 7 16 19 22 25 28
# 3 15 12 6 3 9 18 21 24 27 30

How to find all pairs of two lists, and categorize them without repetitions?

We are preparing for a program where 18 people should discuss topics in a way that in each round they form pairs, and then they switch until everyone has talked to everyone. It means 153 discussions, 9 pairs talking parallelly in each round, for 17 rounds. I tried to formulate a matrix showing who should talk to whom in order to avoid the chaos, but could not succeed. For the sake of simplicity everyone is given a number, so the bottom line is, i would need all pairs of combinations of the numbers from 1 to 18 (did that with combn function), but then these pairs should be rearranged for the 17 round so that each number only appears once per round. Any ideas?
Let's first look at a simpler problem with 6 persons. The following matrix lists who (rows) is talking to whom (columns) in which round (entry):
So for example in round 1 (yellow) we have the following pairs:
(1-2), (3-5), (4-6)
For round 2 (green) we would have:
(1-3), (2-6), (4-5)
and so on.
Thus, basically we are looking for a symmetric latin square (i.e. in each row and in each column each entry appears only once, cf. Latin Squares on Wikipedia).
The latin square in the box can be easily generated via an addition table:
inner_ls <- function(k) {
res <- outer(0:(k-1), 0:(k-1), function(i, j) (i + j) %% k)
## replace zeros by k
res[res == 0] <- k
## replace diagonal by NA
diag(res) <- NA
res
}
inner_ls(5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 2 3 4
# [2,] 1 NA 3 4 5
# [3,] 2 3 NA 5 1
# [4,] 3 4 5 NA 2
# [5,] 4 5 1 2 NA
So all is left to append the last row (column) with the missing round number:
full_ls <- function(k) {
i_ls <- inner_ls(k - 1)
last_row <- apply(i_ls, 1, function(row) {
rounds <- 1:(k - 1)
rounds[!rounds %in% row]
})
res <- cbind(rbind(i_ls, last_row), c(last_row, NA))
rownames(res) <- colnames(res) <- 1:k
res
}
full_ls(6)
# 1 2 3 4 5 6
# 1 NA 1 2 3 4 5
# 2 1 NA 3 4 5 2
# 3 2 3 NA 5 1 4
# 4 3 4 5 NA 2 1
# 5 4 5 1 2 NA 3
# 6 5 2 4 1 3 NA
With that you get your assignment matrix as follows:
full_ls(18)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# 1 NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 2 1 NA 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2
# 3 2 3 NA 5 6 7 8 9 10 11 12 13 14 15 16 17 1 4
# 4 3 4 5 NA 7 8 9 10 11 12 13 14 15 16 17 1 2 6
# 5 4 5 6 7 NA 9 10 11 12 13 14 15 16 17 1 2 3 8
# 6 5 6 7 8 9 NA 11 12 13 14 15 16 17 1 2 3 4 10
# 7 6 7 8 9 10 11 NA 13 14 15 16 17 1 2 3 4 5 12
# 8 7 8 9 10 11 12 13 NA 15 16 17 1 2 3 4 5 6 14
# 9 8 9 10 11 12 13 14 15 NA 17 1 2 3 4 5 6 7 16
# 10 9 10 11 12 13 14 15 16 17 NA 2 3 4 5 6 7 8 1
# 11 10 11 12 13 14 15 16 17 1 2 NA 4 5 6 7 8 9 3
# 12 11 12 13 14 15 16 17 1 2 3 4 NA 6 7 8 9 10 5
# 13 12 13 14 15 16 17 1 2 3 4 5 6 NA 8 9 10 11 7
# 14 13 14 15 16 17 1 2 3 4 5 6 7 8 NA 10 11 12 9
# 15 14 15 16 17 1 2 3 4 5 6 7 8 9 10 NA 12 13 11
# 16 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 NA 14 13
# 17 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 NA 15
# 18 17 2 4 6 8 10 12 14 16 1 3 5 7 9 11 13 15 NA

R Generate a vector with increasing and then decreasing elements

How do I generate a vector in the form
1 2 ... 19 20 19 ... 2 1
Is it possible using the c() function?
You can use seq as well as rev function for the desired purpose.
seq
> c(1:20, seq(19,1,-1))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
As suggested by #jimbou,
> c(1:20, 19:1)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
> c(1:20, rev(1:19))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Changing every set of 5 rows in R

I have a dataframe that looks like this:
df$a <- 1:20
df$b <- 2:21
df$c <- 3:22
df <- as.data.frame(df)
> df
a b c
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12
11 11 12 13
12 12 13 14
13 13 14 15
14 14 15 16
15 15 16 17
16 16 17 18
17 17 18 19
18 18 19 20
19 19 20 21
20 20 21 22
I would like to add another column to the data frame (df$d) so that every 5 rows (df$d[seq(1, nrow(df), 4)]) would take the value of the start of the respective row in the first column: df$a.
I have tried the manual way, but was wondering if there is a for loop or shorter way that can do this easily. I'm new to R, so I apologize if this seems trivial to some people.
"Manual" way:
df$d[1:5] <- df$a[1]
df$d[6:10] <- df$a[6]
df$d[11:15] <- df$a[11]
df$d[16:20] <- df$a[16]
>df
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16
I have tried
for (i in 1:nrow(df))
{df$d[i:(i+4)] <- df$a[seq(1, nrow(df), 4)]}
But this is not going the way I want it to. What am I doing wrong?
This should work:
df$d <- rep(df$a[seq(1,nrow(df),5)],each=5)
And here's a data.table solution:
library(data.table)
dt = data.table(df)
dt[, d := a[1], by = (seq_len(nrow(dt))-1) %/% 5]
I'd use logical indexing after initializing to NA
df$d <- NA
df$d <- rep(df$a[ c(TRUE, rep(FALSE,4)) ], each=5)
df
#--------
a b c d
1 1 2 3 1
2 2 3 4 1
3 3 4 5 1
4 4 5 6 1
5 5 6 7 1
6 6 7 8 6
7 7 8 9 6
8 8 9 10 6
9 9 10 11 6
10 10 11 12 6
11 11 12 13 11
12 12 13 14 11
13 13 14 15 11
14 14 15 16 11
15 15 16 17 11
16 16 17 18 16
17 17 18 19 16
18 18 19 20 16
19 19 20 21 16
20 20 21 22 16

Resources