I'm afraid I can't find the answer to this so I need your expertise.
I need to randomize a set of data in R, where the datasets are sets of choices (represented by letters here) in pairs; however, I also need to, for each set, randomize the order of precedence (which goes first). Additionally, I need to include a negative control (XX). It would look something like this:
(1) X A or A X
(2) X B or B X
(3) X C or C X
(4) X D or D X
(5) X E or E X
(6) XX
I can randomize 1-6 easy enough using sample(1:6, 4), but I don't know how to add in randomization for the pair order as well. Any ideas are great!
Assuming your original set of data looks like this:
li
[[1]]
[1] "X" "A"
[[2]]
[1] "X" "B"
[[3]]
[1] "X" "C"
[[4]]
[1] "X" "D"
[[5]]
[1] "X" "E"
[[6]]
[1] "X" "X"
You can randomize it both at the level of list and at the level of each pair as this:
lapply(li, function(pair) pair[sample(1:2)])[sample(1:6)]
[[1]]
[1] "X" "D"
[[2]]
[1] "B" "X"
[[3]]
[1] "E" "X"
[[4]]
[1] "X" "X"
[[5]]
[1] "X" "A"
[[6]]
[1] "C" "X"
If i understand the question, the below is a little brute force, but I believe answers your question
s<- c("a","b","c","d","e","x")
n<-6
(x<-cbind(sample(s,n),rep("x",n)))
for (i in 1:n) {
if(sample(1:2,1)==2) {
tmp<-x[i,1]
x[i,1] <- x[i,2]
x[i,2] <- tmp
}
}
x
If I understand; you have LETTERS[1:5] and LETTERS[24] (X) that you are sampling from twice, with random ordering. This should do it;
c(sample(LETTERS[c(1:5, 24)], 1), LETTERS[24])[sample(2)]
Broken down;
c( ## combination of
sample(LETTERS[c(1:5, 24)], 1), ## A:E, X, sampled once
LETTERS[24]) ## and X
[sample(2)] ## re-sampled
e.g.
set.seed(1337)
[1] "X" "D"
A list of some possible outcomes;
set.seed(1337)
replicate(10, c(sample(LETTERS[c(1:5, 24)], 1), LETTERS[24])[sample(2)])
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "X" "C" "X" "X" "E" "X" "C" "E" "X" "X"
[2,] "D" "X" "X" "A" "X" "A" "X" "X" "D" "X"
Related
I would like to generate an array of data matrices, where each matrix will refer to a different scenario, for the purpose of example I have included only 1
p13=0.493;p43=0.325;p25=0.335;p35=0.574;p12=0.868
std_e2=sqrt(1-p12^2);std_e2
std_e3=sqrt(1-(p13^2+p43^2));std_e3
std_e5=sqrt(1-(p25^2+p35^2+2*p25*p35*(p13*p12)));std_e5
scenario_1<-matrix(c(3,0,0,0,0),ncol = 5,nrow = 1);scenario_1
genereting_fuction<- function(n,scenario){
sample <- vector("list")
for (i in scenario){
x1=rnorm(n)+scenario[i,1]
x4=rnorm(n)+scenario[i,4]
x2=x1*p12+std_e2*rnorm(n)+scenario[i,2]
x3=x1*p13+x4*p43+std_e3*rnorm(n)+scenario[i,3]
x5=x2*p25+x3*p35+std_e5*rnorm(n)+scenario[i,5]
sample[[i]]=cbind(x1,x2,x3,x4,x5)
colnames(sample[[i]])<-c("x1","x2","x3","x4","x5")
}
sample
}
array_scenari<-array(dim=c(5,5,2));array_scenari
for(j in 1:nrow(scenario_1)){
set.seed(1234)
dati_prova<- sapply(rep(1,5), function(x) genereting_fuction(x,scenario_1),simplify = 'array');dati_prova
dati_prova<-do.call(rbind.data.frame, dati_prova);dati_prova<-as.matrix(dati_prova);dati_prova
dati_prova<-as.matrix(dati_prova)
array_scenari[,,j]<-dati_prova[]
}
I can't understand why it doesn't work and gives me an error
It's hard to give a concrete answer without a working reproducible example, since your function contains externally defined variables. However, the source of the error is clear. When you create an empty array with array() it has a single fixed dimension:
matrix_classification <- array()
dim(matrix_classification)
#> [1] 1
And if you try to write into its third dimension you get an error:
k <- 1
matrix_classification[, , k] <- "x"
#> Error in matrix_classification[, , k] <- "x": incorrect number of subscripts
If you want to write into an array you should define its dimensions first. For example, the following creates an empty 5 x 5 x 5 array:
matrix_classification <- array("", dim = c(5, 5, 5))
dim(matrix_classification)
#> [1] 5 5 5
And if we want to write a matrix into the kth slice we can do:
matrix_classification[, , k] <- matrix(sample(letters, 25), nrow = 5)
matrix_classification[,,1]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] "a" "d" "k" "t" "c"
#> [2,] "f" "b" "n" "m" "s"
#> [3,] "u" "q" "y" "o" "j"
#> [4,] "l" "g" "h" "w" "v"
#> [5,] "r" "i" "e" "p" "z"
Created on 2022-03-06 by the reprex package (v2.0.1)
I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).
Consider the following example vector in R:
x <- rep(LETTERS[1:5], 3) # Create example vector
x
# [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:
set.seed(53135)
sample(x) # sample() function puts same elements too close
# [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
How could I ensure that identical elements have a minimum distance of three?
So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:
min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Bundled in a function
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
after #27ϕ9 comment:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E
3 3 3 3 3
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
UPDATE
After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else x[which(freq.got)[1]] )
}, .init=sample(x,1))
}
now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.
Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)
shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E
3 3 3 3 3
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E
3 3 3 3 3
Update:
shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}
this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.
If your data is large, then it may be (way) faster to rely on probability to do that kind of task.
Here's an example:
prob_shuffler = function(x, min.dist = 2){
n = length(x)
res = sample(x)
OK = FALSE
# We loop until we have a solution
while(!OK){
OK = TRUE
for(i in 1:min.dist){
# We check if identical elements are 'i' steps away
pblm = res[1:(n-i)] == res[-(1:i)]
if(any(pblm)){
if(sum(pblm) >= (n - i)/2){
# back to square 1
res = sample(x)
} else {
# we pair each identical element with
# an extra one
extra = sample(which(!pblm), sum(pblm))
id_reshuffle = c(which(pblm), extra)
res[id_reshuffle] = sample(res[id_reshuffle])
}
# We recheck from the beginning
OK = FALSE
break
}
}
}
res
}
Even though the while loop looks scary, in practice convergence is fast. Of course, the lower the probability to have two characters at min.dist away, the faster the convergence.
The current solutions by #Abdessabour Mtk and #Carles Sans Fuentes work but, depending on the size of the input data, quickly become prohibitively slow. Here's a benchmark:
library(microbenchmark)
x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520
microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 1) 87.001 111.501 155.071 131.801 192.401 264.401 10
#> shuffler_am(x, 1) 17218.100 18041.900 20324.301 18740.351 22296.301 26495.200 10
#> shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701 10
microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 2) 140.1 195.201 236.3312 245.252 263.202 354.101 10
#> shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800 29133.400 10
#> shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401 10
microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 3) 318.001 450.402 631.5312 573.352 782.2 1070.401 10
#> shuffler_am(x, 3) 19003.501 19622.300 23314.4808 20784.551 28281.5 32885.101 10
#> shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901 10
We can remark two things: a) in all logic, the speed of prob_shuffler depends on min.dist while the other methods not so much, b) prob_shuffler is about 100-fold faster for just 520 observations (and it scales).
Of course if the probability to have two identical characters at min.dist away is extremely high, then the recursive methods should be faster. But in most practical cases, the probability method is faster.
I hope this answer works fine for you. It is done with base R, but it works. I leave the printing if you want to check line by line:
x <- rep(LETTERS[1:5], 3) # Create example vector
shuffle <- function(x, min_dist=3){
#init variables
result<-c() # result vector
count<-0
vec_use<-x
vec_keep<-c()
for(i in 1:length(x)){
# print(paste0("iteration =", i))
if (count>min_dist){
valback<-vec_keep[1]
# print(paste0("value to be returned:", valback))
ntimes_valback<-(table(vec_keep)[valback])
vec_use<- c(vec_use,rep(valback,ntimes_valback))
# print(paste0("vec_use after giving back valbak =", valback))
# print(paste0(vec_use,","))
vec_keep <- vec_keep[!vec_keep %in% valback]
# print(paste0("vec_keep after removing valback =", valback))
# print(paste0(vec_keep,","))
}
val<-sample(vec_use,1)
# print(paste0("val = ",val))#remove value
vec_keep<- c(vec_keep,x[x %in% val])
vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
# print(paste0("vec_keep ="))
# print(paste0(vec_keep,","))
vec_use <- vec_use[!vec_use %in% val]
# print(paste0("vec_use ="))
# print(paste0(vec_use,","))
result[i]<-val
count<-count+1
}
return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"
I am okay with Python, Numpy but trying R and this seemingly simple operation has me stuck in R.
This is what I have but I get a number of items to replace is not a multiple of replacement length error.
# find sample variance with n
samples <- matrix(0, nrow=num_samples, ncol=samp_size)
for(i in 1:10000){
temp <- sample(population, samp_size, replace=TRUE)
samples[i] = temp
}
samples[0]
I am not hung up on using matrices, can be an array or vector or list or anything but just some standard ways of doing this because searching online did not give me a quick answer for this basic operation.
The simplest solution is just to initialize samples with your actual samples:
set.seed(123)
n_samples <- 5
n_obs <- 10
population <- letters
samples <- matrix(sample(population, n_obs*n_samples, replace=TRUE),
nrow=n_samples, ncol=n_obs)
But to do it the way you've started, you just need to let R know you'd like to put entries into all columns of samples, like this: samples[i, ].
It's similar to using : syntax in Numpy: array[i, :].
samples <- matrix(0, nrow=n_samples, ncol=n_obs)
for(i in 1:n_samples){
temp <- sample(population, n_obs, replace=TRUE)
samples[i,] = temp
}
Either way, the output is the same:
samples
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "h" "u" "k" "w" "y" "b" "n" "x" "o" "l"
[2,] "y" "l" "r" "o" "c" "x" "g" "b" "i" "y"
[3,] "x" "s" "q" "z" "r" "s" "o" "p" "h" "d"
[4,] "z" "x" "r" "u" "a" "m" "t" "f" "i" "g"
[5,] "d" "k" "k" "j" "d" "d" "g" "m" "g" "w"
Have you tried replicate?
samples <- t(replicate(n = num_samples, sample(population, samp_size, replace=TRUE),simplify = "array"))
This will return a matrix of the dimensions samp_size x num_samples.
I am writing a program that (as a part of it) automatically creates dendrograms from an input dataset.
For each node/split I want to extract all the labels that are under that node and the location of that node on the dendrogram plot (for further plotting purposes).
So, let's say my data looks like this:
> Ltrs <- data.frame("A" = c(3,1), "B" = c(1,1), "C" = c(2,4), "D" = c(6,6))
> dend <- as.dendrogram(hclust(dist(t(Ltrs))))
> plot(dend)
The dendrogram
Now I can extract the location of the splits/nodes:
> library(dendextend)
> nodes <- get_nodes_xy(dend)
> nodes <- nodes[nodes[,2] != 0, ]
> nodes
[,1] [,2]
[1,] 1.875 7.071068
[2,] 2.750 3.162278
[3,] 3.500 2.000000
Now I want to get all the labels under a node, for each node (/row from the 'nodes' variable).
This should look something like this:
$`1`
[1] "D" "C" "B" "A"
$`2`
[1] "C" "B" "A"
$`3 `
[1] "B" "A"
Can anybody help me out? Thanks in advance :)
How about something like this?
library(tidyverse)
library(dendextend)
Ltrs <- data.frame("A" = c(3,1), "B" = c(1,1), "C" = c(2,4), "D" = c(6,6))
dend <- as.dendrogram(hclust(dist(t(Ltrs))))
accumulator <- list();
myleaves <- function(anode){
if(!is.list(anode))return(attr(anode,"label"))
accumulator[[length(accumulator)+1]] <<- (reduce(lapply(anode,myleaves),c))
}
myleaves(dend);
ret <- rev(accumulator); #generation was depth first, so root was found last.
Better test this. I am not very trustworthy. In particular, I really hope the list ret is in an order that makes sense, otherwise it's going to be a pain associating the entries with the correct nodes! Good luck.
Function partition_leaves() extracts all leaf labels per each node and makes a list ordered in the same fashion as get_nodes_xy() output. With your example,
Ltrs <- data.frame("A" = c(3,1), "B" = c(1,1), "C" = c(2,4), "D" = c(6,6))
dend <- as.dendrogram(hclust(dist(t(Ltrs))))
plot(dend)
partition_leaves(dend)
yields:
[[1]]
[1] "D" "C" "A" "B"
[[2]]
[1] "D"
[[3]]
[1] "C" "A" "B"
[[4]]
[1] "C"
[[5]]
[1] "A" "B"
[[6]]
[1] "A"
[[7]]
[1] "B"
filtering list by vector length will give output similar to the desired one.
I have a matrix that contains integer values that represent the index of the item in an array and I'd like to switch out item 1 for the values[1] and so on for each item in the values array.
Some code to demonstrate what I'd like
> m = matrix(1:3, ncol=3, nrow=3)
> m
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 3 3 3
> replace(m, 1="a", 2="b", 3="c")
> m
[,1] [,2] [,3]
[1,] "a" "a" "a"
[2,] "b" "b" "b"
[3,] "c" "c" "c"
Basically it takes 1 and turns it into "a" and so on. It seems like if I try to do this with a for loop it changes after the first iteration from int to string and since I'd like to do this with any object type that's not great behavior.
I can think of three possibilities to solve this
m <- matrix(1:3, 3, 3) # Your data
1
Either define a function that will get a vector in the correct matching order (the first entry will match the first unique value in m, etc.)
vec <- c("Ralf", "Jhons", "Pete")
Then you can define a simple function such as
Match_func <- function(x, y) "dim<-"(y[match(unique(x), seq_along(y))], dim(x))
Test
Match_func(m, vec)
# [,1] [,2] [,3]
# [1,] "Ralf" "Ralf" "Ralf"
# [2,] "Jhons" "Jhons" "Jhons"
# [3,] "Pete" "Pete" "Pete"
2
The second option will be to define your manual replace function, something like
Match_func2 <- function(x, ...) {
temp <- list(...)[[1]]
"dim<-"(temp[match(x, as.numeric(names(temp)))], dim(x))
}
Test
Match_func2(m, c("1" = "a", "2" = "b", "3" = "c"))
# [,1] [,2] [,3]
# [1,] "a" "a" "a"
# [2,] "b" "b" "b"
# [3,] "c" "c" "c"
3
You can also make a use of plyr::revalue
library(plyr)
Match_func3 <- function(x, ...) {
temp <- list(...)[[1]]
"dim<-"(revalue(as.character(x), temp), dim(x))
}
Test
Match_func3(m, c("1" = "a", "2" = "b", "3" = "c"))
# [,1] [,2] [,3]
# [1,] "a" "a" "a"
# [2,] "b" "b" "b"
# [3,] "c" "c" "c"
Note: The last approach is the safest in case you don't want to replace all the unique values
Here's an option, starting with a character matrix so that you don't need to worry about making a copy or coercion of the original matrix.
m = matrix(as.character(1:3), ncol=3, nrow=3)
old <- as.character(1:3)
new <- c("a", "b", "c")
for (i in 1:length(old)) {
m <- ifelse(m == old[i], new[i], m)
}