everyone!
how to generate a vector which satisfy some conditions?
Problem: generate a vector a such that length(a)=400000 which is made up of 8 elements:0, 5, 10, 50, 500, 5000, 50000, 300000. Each element appears a set number of times, namely 290205, 100000, 8000, 1600, 160, 32, 2, 1, respectively. Further, a is blocked into 4,000 "groups" of 100 consecutive elements; call them a_k, k=1,...,4000. These groups must satisfy the following:
The sum of every group exceeds 150, i.e. sum_i a_k_i>150 for all k.
The elements 5, 10 and 50 appear between 25 and 29 times in each group, i.e. for all k, the set {i|a_i_k in (5,10,50)} has magnitude between 25 and 29.
0 never appears more than 8 times in a row in any group.
I have tried this many times, but it does not seem to work:
My current code is as follows:
T <- 4*10^(5) # data size
x <- c(0, 5, 10, 50, 500, 5000, 50000, 300000) #seed vector
t <- c(290205, 100000, 8000, 1600, 160, 32, 2, 1) #frequency
A <- matrix(0, 4000, 100) #4000 groups
k <- rep(0, times = 8) #record the number of seeds
for(m in 1:4000) {
p <- (t - k)/(T - 100*(m - 1)) #seed probability
A[, m] <- sample(x, 100, replace = TRUE, prob = p) #group m
sm <- 0
i <- 0
for(j in 1:92) {
if(sum(A[m,j:j + 8])==0){
if(A[m,j] > 0 & A[m,j] < 500) {i <- i+1}
sm <- sm+A[100*m+j]
}
else j <- 0
}
if (sm >= 150 & i > 24 & i < 30 & j != 0) {
m <- m + 1
for (n in seq_len(x)) {
k[n] <- sum(A[, m+1] == x[n]) + k[n]
}
}
}
How about just doing it by construction? For example:
amat<-matrix(rep(c(rep(rep(c(0,5),c(8,3)),8),
rep(c(0,NA),c(8,4))),4000),nrow=100)
amat[97:100,1:2205]<-c(rep(10,3),0)
amat[97:98,2206:4000]<-c(5,5)
amat[99:100,2206:2897]<-c(10,10)
amat[99:100,2898]<-c(5,50)
amat[99:100,2899:3307]<-c(5,50)
amat[99:100,3308:3902]<-c(50,50)
amat[which(is.na(amat))]<-rep(c(10,500,5000,5e4,3e5),c(1,160,32,2,1))
a<-c(amat)
This satisfies all your conditions:
Element counts:
>sapply(c(0,5,10,50,500,5000,50000,300000),function(x)length(which(a==x)))
[1] 290205 100000 8000 1600 160 32 2 1
Group sums:
> table(colSums(amat)>=150)
TRUE
4000
5,10,50 frequency:
> table(sapply(1:4000,function(x)abs(sum(amat[,x] %in% c(5,10,50))-27)<=2))
TRUE
4000
Runs of 0:
> table(sapply(1:4000,function(x)max(rle(amat[,x])$lengths[rle(amat[,x])$values==0])<=8))
#If this is slow, we can just use max(rle(amax[,x]))<=8
# because there aren't many valid groups with strings of 9+
# non-0 elements
TRUE
4000
if in fact we're never allowed to have strings of 9 0s, we'll need to make a slight adjustment to groups 2:2206, because, e.g. a[100:108]==0
I can start it off and maybe someone can help get to the next step. My approach is to start with the constraints and let sample work out the numbers.
set.seed(77)
choose <- c(0,5,10,50,500,5000,50000,300000)
freqs <- c(290205,100000,8000,1600,160,32,2,1)
probs <- freqs/sum(freqs)
check.sum <- function(vec) sum(vec) >= 150
check.interval <- function(vec) abs(sum(vec %in% c(5,10,50))-27)<=2
check.runs <- function(vec, runmax=8) max(rle(vec)$lengths[rle(vec)$values==0]) <= runmax
check.all <- function(vector) {
logicals <- c(check.sum(vector),
check.runs(vector),
check.runs(vector)
)
return(all(logicals))
}
nums <- NULL
res <- list()
for(i in 1:4000) {
nums <- numeric(100)
while(!check.all(nums)) {nums <- sample(choose, 100, replace=T,prob=probs)}
res[i] <- list(nums)
}
str(res)
List of 4000
$ : num [1:100] 1e+01
So this gets you a list of 4,000 groups of 100 numbers that fit the constraints. It only took about two seconds of system time.
Next step is for someone to get a way to build something similar except eliminate 300000 once it is used, and 50000 once it is used twice and so on.
Inspired by #plafort's approach, I've come up with the following that seems to work very quickly and should be capable of generating all vectors satisfying your conditions:
elts<-c(0,5,10,50,500,5000,50000,300000)
freq<-c(290205,100000,8000,1600,160,32,2,1)
ngrp<-4000L
grp.cond1<-function(x)sum(x)>=150
grp.cond2<-function(x)abs(sum(x %in% c(5,10,50))-27)<=2
grp.cond3<-function(x)max(rle(x)$lengths[rle(x)$values==0])<=8
check.all<-function(mat){
all(sapply(1:ncol(mat),function(y)grp.cond1(mat[,y])),
sapply(1:ncol(mat),function(y)grp.cond2(mat[,y])),
sapply(1:ncol(mat),function(y)grp.cond3(mat[,y])))}
while(!check.all(amat)){amat<-matrix(sample(rep(elts,freq)),ncol=ngrp)}
a<-c(amat)
I've also written the code in a way that should be easy to generalize to other element sets/counts, group numbers, and group-wise conditions.
Unfortunately it seems these conditions are pretty stringent, and it may take a long time to generate an acceptable a. I let the while loop run ~1300 times with no success...
Thanks for everyone! I have figured out my problem.
rm(list = ls())
media <- matrix(rep(rep(c(0,5,NA),c(72,25,3)),4000),nrow=100)
media[98:100,1:2400] <-c(10,10,10)
media[98:99,2401:3200] <-c(50,10)
media[98:99,3201:4000] <-c(50,0)
media[100,2401:4000] <-rep(c(0,500,5000,50000,300000),c(1405,160,32,2,1))
obj1 <- matrix(0,100L,4000)
obj2 <-obj1
grp.cond<-function(x) max(rle(x)$lengths[rle(x)$values==0])<=8
elts<-c(0,5,10,50,500,5000,50000,300000)
for(i in 1:4000){
freq<-c(sapply(elts, function(x) length(which(media[,i]==x))))
while(!grp.cond(obj1[,i])){obj1[,i]<-c(sample(rep(elts,freq)))}
i<-i+1
}
elts1<-c(1:4000)
freq1<-rep(1,times=4000)
a1<-sample(rep(elts1,freq1))
for(i in 1:4000){obj2[,i]<-obj1[,a1[i]]}
a <- c(obj2)
Related
recently I am trying to mimic a game.
I am going to throw 2 dice at the same time. If the sum of 2 dice is greater than or equals to 10, I win 1 point.
If it is lower than 10, I lose 1 point. I will do this for 1000 times.
At the very beginning, I draw 2000 random samples with set.seed (1234)
set.seed(1234)
d = sample(c(1:6), size = 2000, replace = T)
d
And then, I turn it into a matrix, and sum each row
a = matrix(d, nrow=1000, ncol=2, byrow=T)
t = rowSums(a)
t
Now, I have 1000 elements (sum of two dice each time). I would like to create a vector X to calculate the point that I can get.
However, how can I apply if statement to create vector X in this time?
Thank you very much
Do you mean this?
X <- ifelse(t>=10,1,-1)
or
X <- 2*(t>=10)-1
Using case_when
library(dplyr)
case_when(t >= 10 ~ 1, TRUE ~ -1)
You could assign a temporary variable and assign points by comparing the values.
tmp <- t
t[tmp >= 10] <- 1
t[tmp < 10] <- -1
Or without a temporary variable.
t1 <- c(-1, 1)[(t >= 10) + 1]
I'm trying to "pseudo-randomize" a vector in R using a while loop.
I have a vector delays with the elements that need to be randomized.
I am using sample on a vector values to index randomly into delays. I cannot have more than two same values in a row, so I am trying to use an if else statement. If the condition are met, the value should be added to random, and removed from delays.
When I run the individual lines outside the loop they are all working, but when I try to run the loop, one of the vector is populated as NA_real, and that stops the logical operators from working.
I'm probably not great at explaining this, but can anyone spot what I'm doing wrong? :)
delay_0 <- rep(0, 12)
delay_6 <- rep(6, 12)
delays <- c(delay_6, delay_0)
value <- c(1:24)
count <- 0
outcasts <- c()
random <- c(1,2)
while (length(random) < 27) {
count <- count + 1
b <- sample(value, 1, replace = FALSE)
a <- delays[b]
if(a == tail(random,1) & a == head(tail(random,2),1) {
outcast <- outcasts + 1
}
else {
value <- value[-(b)]
delays <- delays[-(b)]
random <- c(random,a)
}
}
Two problems with your code:
b can take a value that is greater than the number of elements in delays. I fixed this by using sample(1:length(delays), 1, replace = FALSE)
The loop continues when delays is empty. You could either change length(random) < 27 to length(random) < 26 I think or add length(delays) > 0.
The code:
delay_0 <- rep(0, 12)
delay_6 <- rep(6, 12)
delays <- c(delay_6, delay_0)
value <- c(1:24)
count <- 0
outcasts <- c()
random <- c(1, 2)
while (length(random) < 27 & length(delays) > 0) {
count <- count + 1
b <- sample(1:length(delays), 1, replace = FALSE)
a <- delays[b]
if (a == tail(random, 1) & a == head((tail(random, 2)), 1))
{
outcast <- outcasts + 1
}
else {
value <- value[-(b)]
delays <- delays[-(b)]
random <- c(random, a)
}
}
I just want to say first that I'm pretty new to R coding. I wrote up some R code which will run over thousands of iterations. The code works and gets the results that I need, however it takes way too long to run. I'll first explain what the code is doing and then the code it self. How can I make this more efficient and make it run in a relatively short time over 200K+ iterations?
There is a while loop which runs until the total dollars reach the target dollars. First I generate a random number, which I look up on the Prob column in the first table below which returns the Dist column (this value is stored as a string). I parse the string and get a value based on the distribution and add it to a vector. Then I use this value to do a another look up on the second table below and get a factor and save these factors for each value in a second vector. I do this loop until I reach my target dollars. Then I multiple the two vectors to get my result vector. This while loop is then looped 200K+ times.
Prob Range Dist
.12 5000 rgamma(1, 3, , 900) + 1000
.70 100000 rgamma(1, 1, , 900) + 5000
.85 350000 rgamma(1,0.9, , 150000) + 200000
.95 1500000 rgamma(1,0.8, , 230000) + 200000
1.0 2500000 runif(1, 1500000, 2500000)
Range Factor
5000 rweibull(1, 20, 1.1)
100000 rweibull(1, 30, 1.2)
250000 rweibull(1, 25, 1.5)
2500000 rweibull(1, 25, 1.8)
Sample code is below. I've used dummy values in many places, there is other operations having a couple more similar operations as below. Ruing this 100 times takes about a minute. When I run it thousands of time, it will take too long. How can I make this code more efficient?
t <- proc.time()
#inputs
sims <- 100
totalD <- 0
totalRev <- c(150000000)
i <- 0
set.seed(1)
ProbRnge <- matrix(c(0.12, 0.70, 0.85, 0.95, 1,
5000, 100000, 350000, 1500000, 2500000,
1000, 5000, 100000, 350000, 1500000), ncol=3)
Dis1 <- c("rgamma(1, 3.0268, , 931.44) + 1000", "rgamma(1, 1.0664, , 931.44) + 5000",
"rgamma(1, 1.0664, , 931.44) + 5000", "rgamma(1, 1.0664, , 931.44) + 5000",
"runif(1, 1250000, 2000000)")
SizeRnge <- c(5000, 100000, 250000, 2500000)
Dis2 <- c("rweibull(1, 20, 1.1)", "rweibull(1, 30, 1.2)", "rweibull(1, 25, 1.5)",
"rweibull(1, 25, 1.8)")
#simulation loop
for (j in 1:sims) {
TotalDTemp <- NULL
FacTmp <- NULL
TotalDTemp <- vector()
FacTmp <- vector()
# loop while total simulated reached target total.
while(totalD < totalRev[1])
{
i = i + 1
#find where random number falls in range and look up distribution and calculate value and store in vector
row_i <- which.max(ProbRnge[,1] > runif(1))
tmpSize <- max(min(eval(parse(text=Dis1[row_i])), ProbRnge[row_i, 2]), ProbRnge[row_i, 3])
if (totalD + tmpSize > totalRev[1]) {
tmpSize = totalRev[1] - totalD
totalD = totalD + tmpSize
} else {
totalD = totalD + tmpSize }
TotalDTemp [i] <-tmpSize
# take value an lookup up factor to apply and store in vector
row_i <- which.max(SizeRnge > tmpSize)
tempRTR <- max(min(eval(parse(text=Dis2[row_i])), 2), 1)
FacTmp [i] <- tempRTR
}
DfacTotal <- TotalDTemp * FacTmp
totalD = 0
i = 0
}
proc.time() - t
If you profile your code, you see that what is taking the most of time is parsing the expressions. You could do that beforehand (before the loops) by computing
expr1 <- lapply(Dis1, function(text) parse(text = text))
expr2 <- lapply(Dis2, function(text) parse(text = text))
And then using eval(expr1[[row_i]]) instead of eval(parse(text=Dis1[row_i])).
For me, this reduces computation time from 45 sec to less than 2 sec.
I am looking for a efficient way in R to derive possible combinations.
I have a data frame with 3 columns and on the basis first column contents I am calculating all the possible combinations.
df <- data.frame("H" = c("H1","H2","H3","H4"), "W1" = c(95, 0, 85 ,0) , "W2" = c(50, 85, 0,0))
df$H <- as.character.factor(df$H)
nH <- nrow(df)
nW <- 2
library(plyr)
library(gtools)
if(nW<=5){
# Find all possible combinations
mat1 <- matrix(nrow = 0, ncol = nH)
for(i in 1:nH){
# mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
mat1 <- rbind.fill.matrix(mat1, t(combn(df$H,nH-(i-1))))
}
df_comb <- data.frame(mat1)
}
View(df_comb)
df_comb gives correct output. Above code works good for small data sets but when the values for H column is more than 15 , R results into out of memory.
Looking for ways in which calculation of combinations in above scenario can be done efficiently in R till H1, H2 .... H49, H50.
EDIT:
Tried a different Approach, Now after certain number of possible combinations (in below case - 32767), applied random sampling to generate combinations using ratio method.
nH <- 26
nW <- 2
if(nW<=5){
# Find all possible combinations ~~~~~ Random Sampling
ncomb <- 0
for(i in 1:nH){
ncomb <- ncomb + choose(nH, nH-(i-1))
}
nmax <- 10000 # Total number of combinations cannot exceed 10000
mat1 <- matrix( nrow = 0, ncol = nH)
for(i in 1:nH){ # For each Group 26C1 26C2 26C3 ..... 26C25 26C26
ncombi <- choose(nH, nH-(i-1)) #For i = 1 , 26C25
ncombComputed <- ceiling(nmax/ncomb*choose(nH, nH-(i-1)))
if(ncomb <= 32767 ){ # This condition is independent of NMAX - For 15
#Combinations
print("sefirst")
final <- mat1
print(paste(nH," ",i))
abc <- combinations(nH,nH-(i-1),df$herbicide)
mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
}
else {
print(i)
print("second")
combi <- matrix( nrow = 0, ncol = nH-(i-1))
#random sampling
while(nrow(combi) < ncombComputed){
combi<- rbind(combi,sort(sample(df$herbicide,nH-(i-1))))
combi <- unique(combi)
}
mat1 <- rbind.fill.matrix(mat1, combi)
}
}
df_comb_New <- data.frame(mat1)
}
The above code gives the result but for 26 Entries its taking 36 seconds for 10000 Combinations.Now I am looking that is there a way to optimize the while loop so that execution becomes faster or any other way to achieve the same result in efficient manner.
I have data in blocks[[i]] where i = 4 to 6 like so
Stimulus Response PM
stretagost s <NA>
colpublo s <NA>
zoning d <NA>
epilepsy d <NA>
resumption d <NA>
incisive d <NA>
440 rows in each block[[i]].
Currently my script does some stuff to 1 randomly selected item out of every 15 trials (except for the first 5 trials every 110, also I have it set so I can never choose rows less than 2 apart) for each block [[i]].
What I would like to be able to do is do stuff to 1 item from every 15 trials, randomly selected out of only those where response == "d". i.e., I don't want my random selection to ever do stuff to rows where response=="s". I have no idea how to achieve this but here is the script I have so far, which just randomly chooses 1 row out of each 15:
PMpositions <- list()
for (i in 4:6){
positions <- c()
x <- 0
for (j in c(seq(5, 110-15, 15),seq(115, 220-15, 15),seq(225, 330-15, 15),seq(335,440-15, 15)))
{
sub.samples <- setdiff(1:15 + j, seq(x-2,x+2,1))
x <- sample(sub.samples, 1)
positions <- c(positions,x)
}
PMpositions[[i]] <- positions
blocks[[i]]$Response[PMpositions[[i]]] <- Wordresponse
blocks[[i]]$PM[PMpositions[[i]]] <- PMresponse
blocks[[i]][PMpositions[[i]],]$Stimulus <- F[[i]]
}
I ended up dealing with it like so
PMpositions <- list()
for (i in 1:3){
startingpositions <- c(seq(5, 110-15, 15),seq(115, 220-15, 15),seq(225, 330-15,
15),seq(335, 440-15, 15))
positions <- c()
x <- 0
for (j in startingpositions)
{
sub.samples <- setdiff(1:15 + j, seq(x-2,x+2,1))
x <- sample(sub.samples, 1)
positions <- c(positions,x)
}
repeat {
positions[which(blocks[[i]][positions,2]==Nonwordresponse)]<-
startingpositions[which(blocks[[i]][positions,2]==Nonwordresponse)]+sample(1:15,
size=length(which(blocks[[i]][positions,2]==Nonwordresponse)), replace = TRUE)
distancecheck<- which ( abs( c(positions[2:length(positions)],0)-positions ) < 2)
if (length(positions[which(blocks[[i]][positions,2]==Nonwordresponse)])== 0 & length
(distancecheck)== 0) break
}
PMpositions[[i]] <- positions
blocks[[i]]$Response[PMpositions[[i]]] <- Wordresponse
blocks[[i]]$PM[PMpositions[[i]]] <- PMresponse
blocks[[i]][PMpositions[[i]],]$Stimulus <- as.character(NF[[i]][,1])
Nonfocal[[i]] <- blocks[[i]]
}
I realised when getting stuck on repeat loops that sometimes I have 15 "s" in response in a row! doh. Would be nice to be able to fix this but it is ok for what I need, when I get stuck I'm just running it again (the location of d/s are randomly generated).
EDIT: Here's a different approach that only samples 'd' rows. It's pretty customized code, but the main idea is to use the prob argument to only sample rows where "Response"=="d" and set the probably of sampling all other rows to zero.
Response <- rep(c("s","d"),220)
chunk <- sort(rep(1:30,15))[1:440] # chunks of 15 up to 440
# function to randomly sample from each set of 15 rows
sampby15 <- function(i){
sample((1:440)[chunk==i], 1,
# use the `prob` argument to only sample 'd' values
prob=rep(1,length=440)[chunk==i]*(Response=="d")[chunk==i])
}
s <- sapply(1:15,FUN=sampby15) # apply to each chunk to get sample rows
Response[s] # confirm only 'd' values
# then you have code to do whatever to those rows...
So the really basic function you'll want to operate on each block is like this:
subsetminor <- function(dataset, only = "d", rows = 1) {
remainder <- subset(dataset, Response == only)
return(remainder[sample(1:nrow(remainder), size = rows), ])
}
We can spruce it up a bit to avoid rows next to each other:
subsetminor <- function(dataset, only = "d", rows = 1) {
remainder <- subset(dataset, Response == only)
if(rows > 1) {
sampled <- sample(1:nrow(remainder), size = rows)
pairwise <- t(combn(sampled, 2))
while(any(abs(pairwise[, 1] - pairwise[, 2]) <= 2)) {
sampled <- sample(1:nrow(remainder), size = rows)
pairwise <- t(combn(sampled, 2))
}
}
out <- remainder[sampled, ]
return(out)
}
The above can be simplified/DRY'd out quite a bit, but it should get the job done.