I am running a coin-toss simulation with a loop which runs about 1 million times.
Each time I run the loop I wish to retain the table output from the RLE command. Unfortunately a simple append does not seem to be appropriate. Each time I run the loop I get a slightly different amount of data which seems to be one of the sticking points.
This code gives an idea of what I am doing:
N <- 5 #Number of times to run
rlex <-NULL
#begin loop#############################
for (i in 1:N) { #tells R to repeat N number
x <-sample(0:1, 100000, 1/2)
rlex <-append(rlex, rle(x))
}
table(rlex) #doesn't work
table(rle(x)) #only 1
So instead of having five separate rle results (in this simulation, 1 million in the full version), I want one merged rle table. Hope this is clear. Obviously my actual code is a bit more complex, hence any solution should be as close to what I have specified as possible.
UPDATE: The loop is an absolute requirement. No ifs or buts. Perhaps I can pull out the table(rle(x)) data and put it into a matrix. However again the stumbling block is the fact that some of the less frequent run lengths do not always turn up in each loop. Thus I guess I am looking to conditionally fill a matrix based on the run length number?
Last update before I give up: Retaining the rle$values will mean that too much data is being retained. My simulation is large-scale and I really only wish to retain the table output of the rle. Either I retain each table(rle(x)) for each loop and combine by hand (there will be thousands), or I find a programmatic way to keep the data (yes for zeroes and ones) and have one table that is formed from merging each of the individual loops as I go along.
Either this is easyish to do, as specified, or I will not be doing it. It may seem a silly idea/request, but that should be incidental to whether it can be done.
Seriously last time. Here is an animated gif showing what I expect to happen.
After each iteration of the loop data is added to the table. This is as clear as I am going to be able to communicate it.
OK, attempt number 4:
N <- 5
set.seed(1)
x <- NULL
for (i in 1:N){
x <- rbind(x, table(rle(sample(0:1, 100000, replace=TRUE))))
}
x <- as.data.frame(x)
x$length <- as.numeric(rownames(x))
aggregate(x[, 1:2], list(x[[3]]), sum)
Produces:
Group.1 0 1
1 1 62634 62531
2 2 31410 31577
3 3 15748 15488
4 4 7604 7876
5 5 3912 3845
6 6 1968 1951
7 7 979 971
8 8 498 477
9 9 227 246
10 10 109 128
11 11 65 59
12 12 24 30
13 13 21 11
14 14 7 10
15 15 0 4
16 16 4 2
17 17 0 1
18 18 0 1
If you want the aggregation inside the loop, do:
N <- 5
set.seed(1)
x <- NULL
for (i in 1:N){
x <- rbind(x, table(rle(sample(0:1, 100000, replace=TRUE))))
y <- aggregate(x, list(as.numeric(rownames(x))), sum)
print(y)
}
Following up #CarlWitthoft's answer, you probably want:
N <- 5
rlex <-NULL
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlex <-append(rlex, rle(x)$lengths)
}
since I think you don't care about the $values component (i.e. whether each run is a run of zeros or ones).
Result: one long vector of run lengths.
But this would probably be a lot more efficient:
maxlen <- 30
rlemat <- matrix(nrow=N,ncol=maxlen)
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlemat[i,] <- table(factor(rle(x)$lengths,levels=1:maxlen))
}
Result: an N by maxlen table of run lengths from each iteration.
If you only want to save the total number of runs of each length you could try:
rlecumsum <- rep(0,maxlen)
for (i in 1:N) {
x <-sample(0:1, 100000, 1/2)
rlecumsum <- rlecumsum + table(factor(rle(x)$lengths,levels=1:maxlen))
}
Result: an vector of length maxlen of the total numbers of run lengths across all iterations.
And here's my final answer:
rlecumtab <- matrix(0,ncol=2,nrow=maxlen)
for (i in 1:N) {
x <- sample(0:1, 100000, 1/2)
r1 <- rle(x)
rtab <- table(factor(r1$lengths,levels=1:maxlen),r1$values)
rlecumtab <- rlecumtab + rtab
}
Result: a maxlen by 2 table of the total numbers of run lengths across all iterations, divided by type (0-run vs 1-run).
You need to read the help page for rle . Consider:
names(rlex) #"lengths" "values" "lengths" "values" .... and so on
In the meantime, I strongly suggest you spend some time reading up on statistical methods. There is zero (+/- epsilon) chance that running a binomial simulation a million times will tell you anything you won't learn after a few hundred tries, unless your coin has p=1e-5 :-).
Related
I'm trying to identify common elements across multiple vectors, with all combinations possible.
I had previously tried this one here, but it doesn't quite work out because it only retrieves the common elements between 2 groups.
Take this example: I have 10 vectors (varying in number of elements) that may have common elements with one or more other vectors. It is also possible that some elements are exclusive to some groups. As an example, here is the data:
#Creating a mock example: 10 groups, with varying number of elements:
set.seed(753)
for (i in 1:10){
assign(paste0("grp_",i), paste0("target_", sample(1:40, sample(20:34))))
}
Simply put, I want to do something analogous to a Venn diagram, but put into a data frame/matrix with the counts, instead. Something like this (note that here, I am just adding a snapshot of random parts of how the result data frame/matrix should look like):
grp1 grp2 grp3 grp4 grp1.grp4.grp5.grp8.grp10
grp1 - 16 12 20 5
grp2 16 - 10 20 4
grp3 12 10 - 16 3
grp4 20 20 16 - 5
grp1.grp4.grp5.grp8.grp10 5 4 3 5 10
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 0 0 0 0 0
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10
grp1 3
grp2 6
grp3 4
grp4 1
grp1.grp4.grp5.grp8.grp10 5
grp1.grp2.grp3.grp4.grp5.grp6.grp7.grp8.grp9.grp10 2
From the table above, please also note that counts that have the same row and column names mean that they are exclusive to that particular group (e.g. count on row1/col1 means that there are 88 exclusive elements).
Any help is very much appreciated!
EDIT: the real counts for the expected final matrix has now been added.
Ok, if I understood all well, lets give it a try. Note that I added your sample data in a list, so we can index them to intersect.
set.seed(753)
grps <- list()
for (i in 1:10){
grps[i] <- list(paste0("target_", sample(1:40, sample(20:34))))
}
You want all 10 groups resulting in 1023 x 1023 combinations
Making it flexible makes testing a bit easier ;)
The key here is I keep them as list with integers that we can index in grps.
N <- 10
combinations <- unlist(sapply(1:N, function(n) combn(1:N, n, simplify = F)), recursive = F)
Now we have to loop twice over your combinations as you compare each 1023 x 1023 combinations with their intersects. The use of sapply gives us the nice 1023 x 1023 matrix you want.
results <- sapply(seq_along(combinations), function(i) {
sapply(seq_along(combinations), function(j) {
length(intersect(
Reduce(intersect, grps[combinations[[i]]]),
Reduce(intersect, grps[combinations[[j]]])
))
})
})
Now we create the names as shown in your example, they are based on the combinations we created and used earlier.
names <- sapply(combinations, function(x) paste("grp", x, sep = "", collapse = "."))
Create the colnames and rownames of the matrix
colnames(results) <- rownames(results) <- names
Seems in your output you want to values for the diagonals, so we change that to NA
diag(results) <- NA
I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30
Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?
I wish to find the speediest way to find up to 1000 possible combinations of 'n' integers to find a target integer.
For example. Say I wanted to sum to the number '20'. I want to find up to 1000 combinations of four integers that sum to this number. The integers can repeat themselves. I also have a condition that the integer must not be smaller than a particular number, in this case 4.
target<-20 #the number I wish to sum to
lowest<-4 #the smallest integer I allow
size<-4 #the number of integers I wish to use to sum
maxposs <- target - ((size-1) * lowest) #given the lowest, this is the max possible integer. In my example it is 8.
This is how I have started to work this out. Using combn to find all combinations of the four chosen integers and then filtering by those that sum to my target.
m <- combn(rep(lowest:maxposs,size), size)
m1<- m[,colSums(m)==target]
Here, 'm1' has 245 columns. There are only this many solutions. The last few columns:
# [,238] [,239] [,240] [,241] [,242] [,243] [,244] [,245]
#[1,] 4 4 4 4 4 4 5 5
#[2,] 5 5 5 6 7 4 6 4
#[3,] 7 4 5 4 4 5 4 5
#[4,] 4 7 6 6 5 7 5 6
However, in my real application, I can be dealing with very high integers (summing up to 1000) and want to limit myself to a random sample of 1000 possible combinations. As this is for a randomization statistical test, speed is of the essence. I wonder if anyone knows of a faster way of doing this. My way doesn't feel intuitively quick.
my_matrix <- matrix(nrow = 1000, ncol = 4)
i <- 1
nn <- 1000
while(i <= 1000){
x <- sample(x = 4:nn, size = 3)
y = nn - sum(x)
if(y >= 4){
my_matrix[i, ] <- c(x, y)
i <- i + 1
}
}
Per Gavin's suggestion, redone with a preallocated matrix. Now this runs in .158 seconds, twice as fast, and probably scales better.
What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)