In R, how do I locally shuffle a vector's elements - r

I have the following vector in R. Think of them as a vector of numbers.
x = c(1,2,3,4,...100)
I want to randomize this vector "locally" based on some input number the "locality factor". For example if the locality factor is 3, then the first 3 elements are taken and randomized followed by the next 3 elements and so on. Is there an efficient way to do this? I know if I use sample, it would jumble up the whole array.
Thanks in advance

Arun didn't like how inefficient my other answer was, so here's something very fast just for him ;)
It requires just one call each to runif() and order(), and doesn't use sample() at all.
x <- 1:100
k <- 3
n <- length(x)
x[order(rep(seq_len(ceiling(n/k)), each=k, length.out=n) + runif(n))]
# [1] 3 1 2 6 5 4 8 9 7 11 12 10 13 14 15 18 16 17
# [19] 20 19 21 23 22 24 27 25 26 29 28 30 33 31 32 36 34 35
# [37] 37 38 39 40 41 42 43 44 45 47 48 46 51 49 50 52 54 53
# [55] 55 57 56 58 60 59 62 63 61 66 64 65 68 67 69 71 70 72
# [73] 75 74 73 76 77 78 81 80 79 84 82 83 86 85 87 89 88 90
# [91] 93 92 91 94 96 95 97 98 99 100

General solution:
Edit: As #MatthewLundberg comments, the issue I pointed out with "repeating numbers in x" can be easily overcome by working on seq_along(x), which would mean the resulting values will be indices. So, it'd be like so:
k <- 3
x <- c(2,2,1, 1,3,4, 4,6,5, 3)
x.s <- seq_along(x)
y <- sample(x.s)
x[unlist(split(y, (match(y, x.s)-1) %/% k), use.names = FALSE)]
# [1] 2 2 1 3 4 1 4 5 6 3
Old answer:
The bottleneck here is the amount of calls to function sample. And as long as your numbers don't repeat, I think you can do this with just one call to sample in this manner:
k <- 3
x <- 1:20
y <- sample(x)
unlist(split(y, (match(y,x)-1) %/% k), use.names = FALSE)
# [1] 1 3 2 5 6 4 8 9 7 12 10 11 13 14 15 17 16 18 19 20
To put everything together in a function (I like the name scramble from #Roland's):
scramble <- function(x, k=3) {
x.s <- seq_along(x)
y.s <- sample(x.s)
idx <- unlist(split(y.s, (match(y.s, x.s)-1) %/% k), use.names = FALSE)
x[idx]
}
scramble(x, 3)
# [1] 2 1 2 3 4 1 5 4 6 3
scramble(x, 3)
# [1] 1 2 2 1 4 3 6 5 4 3
To reduce the answer (and get it faster) even more, following #flodel's comment:
scramble <- function(x, k=3L) {
x.s <- seq_along(x)
y.s <- sample(x.s)
x[unlist(split(x.s[y.s], (y.s-1) %/% k), use.names = FALSE)]
}

For the record, the boot package (shipped with base R) includes a function permutation.array() that is used for just this purpose:
x <- 1:100
k <- 3
ii <- boot:::permutation.array(n = length(x),
R = 2,
strata = (seq_along(x) - 1) %/% k)[1,]
x[ii]
# [1] 2 1 3 6 5 4 9 7 8 12 11 10 15 13 14 16 18 17
# [19] 21 19 20 23 22 24 26 27 25 28 29 30 33 31 32 36 35 34
# [37] 38 39 37 41 40 42 43 44 45 46 47 48 51 50 49 53 52 54
# [55] 57 55 56 59 60 58 63 61 62 65 66 64 67 69 68 72 71 70
# [73] 75 73 74 76 77 78 79 80 81 82 83 84 86 87 85 89 88 90
# [91] 93 91 92 94 95 96 97 98 99 100

This will drop elements at the end (with a warning):
locality <- 3
x <- 1:100
c(apply(matrix(x, nrow=locality, ncol=length(x) %/% locality), 2, sample))
## [1] 1 2 3 4 6 5 8 9 7 12 10 11 13 15 14 16 18 17 19 20 21 22 24 23 26 25 27 28 30 29 32 33 31 35 34 36 38 39 37
## [40] 42 40 41 43 44 45 47 48 46 51 49 50 54 52 53 55 57 56 58 59 60 62 61 63 64 65 66 67 69 68 71 72 70 74 75 73 78 77 76
## [79] 80 81 79 83 82 84 87 85 86 88 89 90 92 93 91 96 94 95 99 98 97

v <- 1:16
scramble <- function(vec,n) {
res <- tapply(vec,(seq_along(vec)+n-1)%/%n,
FUN=function(x) x[sample.int(length(x), size=length(x))])
unname(unlist(res))
}
set.seed(42)
scramble(v,3)
#[1] 3 2 1 6 5 4 9 7 8 12 10 11 15 13 14 16
scramble(v,4)
#[1] 2 3 1 4 5 8 6 7 10 12 9 11 14 15 16 13

I like Matthew's approach way better but here was the way I did the problem:
x <- 1:100
fact <- 3
y <- ceiling(length(x)/fact)
unlist(lapply(split(x, rep(1:y, each =fact)[1:length(x)]), function(x){
if (length(x)==1) return(x)
sample(x)
}), use.names = FALSE)
## [1] 3 1 2 6 4 5 8 9 7 11 10 12 13 15 14 17 16 18
## [19] 20 21 19 24 23 22 26 27 25 29 30 28 31 32 33 35 34 36
## [37] 39 37 38 41 42 40 45 43 44 47 46 48 51 49 50 52 53 54
## [55] 57 56 55 59 60 58 63 62 61 64 66 65 67 68 69 70 71 72
## [73] 75 73 74 77 76 78 80 79 81 82 84 83 85 86 87 90 89 88
## [91] 92 91 93 96 94 95 98 99 97 100

Related

What should I do when the if statement in a for loop is false and I don't wanna take any operation for the false and then directly test next value?

I wanna find multiples of 2 between 0 and 100 and save these multiples in a vector.
This is my code:
i <- c(0:100)
a <- c()
for (value in i) {
if (i %% 2 == 0) {
a[i+1] <- i
}
}
#> Warning in if (i%%2 == 0) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (i%%2 == 0) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (i%%2 == 0) {: the condition has length > 1 and only the first
#> element will be used
...
print(a)
#> [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
#> [19] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
#> [37] 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
#> [55] 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
#> [73] 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
#> [91] 90 91 92 93 94 95 96 97 98 99 100
Created on 2020-06-12 by the reprex package (v0.3.0)
The result that I expected should be "0,2,4,6,8,10,12...".
Where am I wrong?
Based on the way the 'a' is initialized (i.e. as NULL vector), we can concatenate the 'value' if the condition is satisified
a <- c()
for(value in i) if(value %%2 == 0) a <- c(a, value)
a
#[1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
#[35] 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
In the OP's code, the condition inside if is done withe the whole vector i instead of the 'value' resulting in the warning message because if/else expects a single element of TRUE/FALSE
This can be done without a loop in R as these are vectorized operations
i[!i %% 2]
Instead of checking for every value of i why not generate a sequence with a step of 2.
i <- 0:100
seq(min(i), max(i), 2)
# [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
#[19] 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
#[37] 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100

How to cut the values in a regular interval and define them into the separate group? [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 3 years ago.
How to cut the values (1 to 100) in a regular interval (25) and place them into 4 groups as below:
sdr <- c(1:100)
Group1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Group2: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Group3: 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
Group4: 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Any suggestion, please.
You could use split
sdr <- 1:100
split(sdr, rep(1:4, each = 25))
#$`1`
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#
#$`2`
# [1] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#
#$`3`
# [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#
#$`4`
# [1] 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
#[20] 95 96 97 98 99 100
This returns a list with 4 vector elements.
Also note that the c() around 1:100 is not necessary.
Or we can define the number of groups
ngroup <- 4
split(sdr, rep(1:ngroup, each = length(sdr) %/% ngroup))
giving the same result.
You can make a dataframe for your groups and then transpose using t:
df <- t(data.frame(Group1 = c(1:25), Group2 = c(26:50), Group3 = c(51:75), Group4 = c(76:100)))

Generate sequence with alternating increments in R? [duplicate]

This question already has answers here:
Get a seq() in R with alternating steps
(6 answers)
Closed 6 years ago.
I want to use R to create the sequence of numbers 1:8, 11:18, 21:28, etc. through 1000 (or the closest it can get, i.e. 998). Obviously typing that all out would be tedious, but since the sequence increases by one 7 times and then jumps by 3 I'm not sure what function I could use to achieve this.
I tried seq(1, 998, c(1,1,1,1,1,1,1,3)) but it does not give me the results I am looking for so I must be doing something wrong.
This is a perfect case of vectorisation( recycling too) in R. read about them
(1:100)[rep(c(TRUE,FALSE), c(8,2))]
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 31 32
#[27] 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57 58 61 62 63 64
#[53] 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85 86 87 88 91 92 93 94 95 96
#[79] 97 98
rep(seq(0,990,by=10), each=8) + seq(1,8)
You want to exclude numbers that are 0 or 9 (mod 10). So you can try this too:
n <- 1000 # upper bound
x <- 1:n
x <- x[! (x %% 10) %in% c(0,9)] # filter out (0, 9) mod (10)
head(x,80)
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27
# 28 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57
# 58 61 62 63 64 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85
# 86 87 88 91 92 93 94 95 96 97 98
Or in a single line using Filter:
Filter(function(x) !((x %% 10) %in% c(0,9)), 1:100)
# [1] 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 21 22 23 24 25 26 27 28 31 32 33 34 35 36 37 38 41 42 43 44 45 46 47 48 51 52 53 54 55 56 57
# [48] 58 61 62 63 64 65 66 67 68 71 72 73 74 75 76 77 78 81 82 83 84 85 86 87 88 91 92 93 94 95 96 97 98
With a cycle: for(value in c(seq(1,991,10))){vector <- c(vector,seq(value,value+7))}

Subsetting Data frame or matrix based on criteria of values

Suppose I have a matrix or a data frame and I want only those values that are greater than 15 and no values between 85 and 90 both inclusive
a<-matrix(1:100,nrow = 10, ncol = 10)
rownames(a) <- LETTERS[1:10]
colnames(a) <- LETTERS[1:10]
A B C D E F G H I J
A 1 11 21 31 41 51 61 71 81 91
B 2 12 22 32 42 52 62 72 82 92
C 3 13 23 33 43 53 63 73 83 93
D 4 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 86 96
G 7 17 27 37 47 57 67 77 87 97
H 8 18 28 38 48 58 68 78 88 98
I 9 19 29 39 49 59 69 79 89 99
J 10 20 30 40 50 60 70 80 90 100
Note: You can convert it into dataframe if you know this kind of operation is possible in dataframe
Now I want My result in such a format that only those values that are greater than 5 and less than 85 retain and all else got deleted and replaced with blank space.
My desired out is like below
A B C D E F G H I J
A 11 21 31 41 51 61 71 81 91
B 12 22 32 42 52 62 72 82 92
C 13 23 33 43 53 63 73 83 93
D 14 24 34 44 54 64 74 84 94
E 5 15 25 35 45 55 65 75 85 95
F 6 16 26 36 46 56 66 76 96
G 7 17 27 37 47 57 67 77 97
H 8 18 28 38 48 58 68 78 98
I 9 19 29 39 49 59 69 79 99
J 10 20 30 40 50 60 70 80 100
Is there any kind of function in R which can take my condition and produce the desired result. I want to change code according to problem . I searched it over stack flow but didn't find something like this. I don't want to format based on rows or column.
I tried
a[a> 5 & a!=c(85:90)]
but this give me values and looses the structure.
Assuming that the 'a' is matrix, we can assign the values of 'a' %in% 86:90 or | less than 5 (a < 5) to NA. Here, I am not assigning it to '' as it will change the class from numeric to character. Also, assigning to NA would be useful for later processing.
a[a %in% 86:90 | a<5] <- NA
However, if we need it to be ''
a[a %in% 86:90 | a<5] <- ""
If we are using a data.frame
a1 <- as.data.frame(a)
a1[] <- lapply(a1, function(x) replace(x, x %in% 86:90| x <5, ""))
a1
# A B C D E F G H I J
#A 11 21 31 41 51 61 71 81 91
#B 12 22 32 42 52 62 72 82 92
#C 13 23 33 43 53 63 73 83 93
#D 14 24 34 44 54 64 74 84 94
#E 5 15 25 35 45 55 65 75 85 95
#F 6 16 26 36 46 56 66 76 96
#G 7 17 27 37 47 57 67 77 97
#H 8 18 28 38 48 58 68 78 98
#I 9 19 29 39 49 59 69 79 99
#J 10 20 30 40 50 60 70 80 100
NOTE: This changes the class of each column to character
In the OP's code, a!=c(85:90) will not work as intended as the 85:90 will recycle to the length of the 'a' and the comparison will be between the corresponding values in the recycled value and 'a'. Instead, we need to use %in% for a vector with length > 1.

Argument "partial" of the sort function in R

?sort states that the partial argument may be NULL or a vector of indices for partial sorting.
I tried:
x <- c(1,3,5,2,4,6,7,9,8,10)
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x, partial=5)
## [1] 1 3 4 2 5 6 7 9 8 10
sort(x, partial=2)
## [1] 1 2 5 3 4 6 7 9 8 10
sort(x, partial=4)
## [1] 1 2 3 4 5 6 7 9 8 10
I am not sure what partial means when sorting a vector.
As ?sort states,
If partial is not NULL, it is taken to contain indices of elements of the result
which are to be placed in their correct positions in the sorted array by partial sorting.
In other words, the following assertion is always true:
stopifnot(sort(x, partial=pt_idx)[pt_idx] == sort(x)[pt_idx])
for any x and pt_idx, e.g.
x <- sample(100) # input vector
pt_idx <- sample(1:100, 5) # indices for partial arg
This behavior is different from the one defined in the Wikipedia article on partial sorting. In R sort()'s case we are not necessarily computing k smallest elements.
For example, if
print(x)
## [1] 91 85 63 80 71 69 20 39 78 67 32 56 27 79 9 66 88 23 61 75 68 81 21 90 36 84 11 3 42 43
## [31] 17 97 57 76 55 62 24 82 28 72 25 60 14 93 2 100 98 51 29 5 59 87 44 37 16 34 48 4 49 77
## [61] 13 95 31 15 70 18 52 58 73 1 45 40 8 30 89 99 41 7 94 47 96 12 35 19 38 6 74 50 86 65
## [91] 54 46 33 22 26 92 53 10 64 83
and
pt_idx
## [1] 5 54 58 95 8
then
sort(x, partial=pt_idx)
## [1] 1 3 2 4 5 6 7 8 11 12 9 10 13 15 14 16 17 18 23 30 31 27 21 32 36 34 35 19 20 37
## [31] 38 33 29 22 26 25 24 28 39 41 40 42 43 48 46 44 45 47 51 50 52 49 53 54 57 56 55 58 59 60
## [61] 62 64 63 61 65 66 70 72 73 69 68 71 67 79 78 82 75 81 80 77 76 74 89 85 88 87 83 84 86 90
## [91] 92 93 91 94 95 96 97 99 100 98
Here x[5], x[54], ..., x[8] are placed in their correct positions - and we cannot say anything else about the remaining elements. HTH.
EDIT: Partial sorting may reduce the sorting time, of course if you are interested in e.g. finding only some of the order statistics.
require(microbenchmark)
x <- rnorm(100000)
microbenchmark(sort(x, partial=1:10)[1:10], sort(x)[1:10])
## Unit: milliseconds
## expr min lq median uq max neval
## sort(x, partial = 1:10)[1:10] 2.342806 2.366383 2.393426 3.631734 44.00128 100
## sort(x)[1:10] 16.556525 16.645339 16.745489 17.911789 18.13621 100
regarding the statement "Here x[5], x[54], ..., x[8] are placed in their correct positions", I don't think it's correct, it should be "in the result, i.e. sorted x, result[5], result[54],.....,result[8], will be placed with right values from x."
quote from R manual:
If partial is not NULL, it is taken to contain indices of elements of
the result which are to be placed in their correct positions in the
sorted array by partial sorting. For each of the result values in a
specified position, any values smaller than that one are guaranteed to
have a smaller index in the sorted array and any values which are
greater are guaranteed to have a bigger index in the sorted array.

Resources