Data generation: Creating a vector of vectors - r

I have a vector of positive integers of unknown length. Let's call it vector a with elements a[1], a[2], ...
I want to perform calculations on vector b where for all i, 0 <= b[i] <= a[i].
The following does not work:
for(b in 0:a)
{
# calculations
}
The best I have come up with is:
probabilities <- function(a,p)
{
k <- a
k[1] <- 1
h <- rep(0,sum(a)+1)
for(i in 2:length(a))
{
k[i] <- k[i-1]*(a[i-1]+1)
}
for(i in 0:prod(a+1))
{
b <- a
for(j in 1:length(a))
{
b[j] <- (floor(i/k[j]) %% (a[j]+1))
}
t <- 1
for(j in 1:length(a))
{
t <- t * choose(a[j],b[j])*(p[j])^(b[j])*(1-p[j])^(a[j]-b[j])
}
h[sum(b)+1] <- h[sum(b)+1] + t
}
return(h)
}
In the middle of my function is where I create b. I start off by setting b equal to a (so that it is the same size). Then, I replace all of the elements of b with different elements that are rather tricky to calculate. This seems like an inefficient solution. It works, but it is fairly slow as the numbers get large. Any ideas for how I can cut down on process time? Essentially, what this does for b is the first time through, b is all zeros. Then, it is 1, 0,0,0,... The first element keeps incrementing until it reaches a[1], then b[2] increments and b[1] is set to 0. Then b[1] starts incrementing again.
I know the math is sound, I just do not trust that it is efficient. I studied combinatorics for a few years, but have never studied computational complexity theory, so coming up with a fast algorithm is a bit beyond my realm of knowledge. Any ideas would be helpful!

Related

Understanding Breakpoint function: how for loops work inside functions

I have the following exercise to be solved in R. Under the exercise, there is a hint towards the solution.
Exercise: If there are no ties in the data set, the function above will produce breakpoints with h observations in the interval between two consecutive breakpoints (except the last two perhaps). If there are ties, the function will by construction return unique breakpoints, but there may be more than h observations in some intervals.
Hint:
my_breaks <-function(x, h = 5) {
x <-sort(x)
breaks <- xb <- x[1]
k <- 1
for(i in seq_along(x)[-1])
{if(k<h)
{k <- k+1}
else{
if(xb<x[i-1]&&x[i-1]<x[i])
{xb <- x[i-1]
breaks <-c(breaks, xb)
k <- 1
}
}
}
However, I am having a hard time understanding the above function particularly the following lines
for(i in seq_along(x)[-1])
{if(k<h)
{k <- k+1}
Question:
How is the for loop supposed to act in k if k is previously defined as 1 and i is different than k? How are the breakpoints chosen according to the h=5 gap if the for loop is not acting on x? Can someone explain to me how this function works?
Thanks in advance!
First, note that your example is incomplete. The return value and the final brace are missing there. Here is the correct version.
my_breaks <-function(x, h = 5) {
x <- sort(x)
breaks <- xb <- x[1]
k <- 1
for(i in seq_along(x)[-1]){
if(k<h) {
k <- k+1
} else {
if(xb<x[i-1]&&x[i-1]<x[i]){
xb <- x[i-1]
breaks <-c(breaks, xb)
k <- 1
}
}
}
breaks
}
Let's check if it works.
my_breaks(c(1,1,1:5,8:10), 2)
#[1] 1 2 4 8
my_breaks(c(1,1,1:5,8:10), 5)
#[1] 1 3
As you can see, everything is fine. And what is seq_along(x)[-1]? We could write this equation as 2:length(x). So the for loop goes through each element of the vector x in sequence, skipping the first element.
What is the k variable for? It counts the distance to take into account the h parameter.

How to make an R function that loops over two lists

I have an event A that is triggered when the majority of coin tosses in a series of tosses comes up heads. I have an unfair coin and I'd like to see how the likelihood of A changes as the number of tosses change and the probability in each toss changes.
This is my function assuming 3 tosses
n <- 3
#victory requires majority of tosses heads
#tosses only occur in odd intervals
k <- seq(n/2+.5,n)
victory <- function(n,k,p){
for (i in p) {
x <- 0
for (i in k) {
x <- x + choose(n, k) * p^k * (1-p)^(n-k)
}
z <- x
}
return(z)
}
p <- seq(0,1,.1)
victory(n,k,p)
My hope is the victory() function would:
find the probability of each of the outcomes where the majority of tosses are heads, given a particular value p
sum up those probabilities and add them to a vector z
go back and do the same thing given another probability p
I tested this with n <- 3, k <- c(2,3) and p <- (.5,.75) and the output was 0.75000, 0.84375. I know that the output should've been 0.625, 0.0984375.
I wasn't able to get exactly the result you wanted, but maybe can help you along a bit.
When looping in R the vector you are looping through remains unchanged and value you are using to loop changes. For example see the differences in these loops:
test <- seq(0,1,length.out = 5)
for ( i in test){
print(test)
}
for ( i in test){
print(i)
}
for ( i in 1:length(test)){
print(test[i])
}
when you are iterating you are firstly setting i to the first number in p, then to the first number in k and then using the unchanged vectors.
You are also assigning to z in the first loop of p and then writing over it in the second loop.
Try using the below - I am still not getting the answer you say but it might help you find where the error is (printing out along the way or using debug(victory) might also be helpful
victory <- function(n,k,p){
z <-list()
for (i in 1:length(p)) {
x <- 0
for (j in 1:length(k)) {
x <- x + choose(n, k[j]) * p[i]^k[j] * (1-p[i])^(n-k[j])
}
z[i] <- x
}
return(z)
}

Simplify Simulations on R

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning.
However, I've run into the following problem, my professor has given us the following:
sim1 <- function(n) {
xm <- matrix(nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d + 64
}
}
return(xm)
}
With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.
I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.
Any help means a whole lot.
Thank you!
If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.
This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.
sim2 <- function(n) {
n1 <- floor(n/2) #this is how many of the else clause we'll do
n2 <- n - n1 #this is how many of the if clause we'll do
col11 <- rep(0, n1) #bam! we have a vector filled with 0s
col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
col21 <- rep(1, n2) #bam! vector filled with 1s
col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}
No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).
Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.
Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.
We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.
This version runs 866% faster!
> system.time({ sim1(500000)})
user system elapsed
1.341 0.179 1.527
> system.time({ sim2(500000)})
user system elapsed
0.145 0.011 0.158
If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.
sim3 <- function(n) {
a <- 0
for(j in 1:n) {
if(runif(1) < 0.5) {
a <- a + 1
}
}
return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844
The binomial process produces 50% 1s and 50% 0s (column 1).
I'll do what I think is the most obvious step, namely to move rnorm() out of the loop and take advantage of its vectorized nature (as rawr alluded to)
sim2 <- function(n) {
xm <- matrix(nrow=n, ncol=2)
d <- rnorm(n)
for (i in 1:n) {
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d[i] + 69
} else {
xm[i,1] <- 0
xm[i,2] <- 2*d[i] + 64
}
}
return(xm)
}
n <- 1e3
set.seed(1); system.time(s1 <- sim1(n)); system.time(s2 <- sim2(n))
# user system elapsed
# 0.019 0.004 0.023
# user system elapsed
# 0.010 0.000 0.009
t.test(s1[,2], s2[,2]) # Not identical, but similar, again alluded to by rawr
Just that gives us a reasonable improvement. A similar thing can be done with runif() as well, but I'll leave that to you.
If you want some reading material I can recommend Hadley Wickhams Advanced R and the chapter Optimising code.
And in case you're wondering, it is indeed possible to eliminate both the loop and the conditionals.
One optimization I can suggest is that you create the matrix with default value as 0. Once matrix has been created with 0 value as default then there will be no need to populate a value 0 in function.
The modified code will look like:
sim1 <- function(n) {
#create matrix with 0 value.
xm <- matrix(0,nrow=n,ncol=2)
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
xm[i,1] <- 1
xm[i,2] <- 2.5*d + 69
} else {
#xm[i,1] <- 0 --- No longer needed
xm[i,2] <- 2*d + 64
}
}
return(xm)
}

Interpreting [R] Greatest Common Divisor (GCD) (and LCM) Function in {numbers} package

I don't have background in programming (except from wrestling with R to get things done), and I'm trying to verbalize what the formula for the greater common divisor in the R {numbers} package is trying to do at each step. I need help with understanding the flow of steps within the function:
function (n, m)
{
stopifnot(is.numeric(n), is.numeric(m))
if (length(n) != 1 || floor(n) != ceiling(n) || length(m) !=
1 || floor(m) != ceiling(m))
stop("Arguments 'n', 'm' must be integer scalars.")
if (n == 0 && m == 0)
return(0)
n <- abs(n)
m <- abs(m)
if (m > n) {
t <- n
n <- m
m <- t
}
while (m > 0) {
t <- n
n <- m
m <- t%%m
}
return(n)
}
<environment: namespace:numbers>
For instance, in the if (m > n) {} part the n becomes t and ultimately it becomes m? I'm afraid to ask, because it may be painfully obvious, but I don't know what is going on. The same apply to, I guess, he else part of the equation with %% being perhaps modulo.
What it says is:
Stop if either m or n are not numeric, more than one number, or have decimals, and return the message, "Arguments 'n', 'm' must be integer scalars."
If they both are zero, return zero.
Using absolute values from now on.
Make sure that n > m because of the algorithm we'll end up applying in the next step. If this is not the case flip them: initially place n in a temporary variable "t", and assign m to n, so that now the larger number is at the beginning of the (n, m) expression. At this point both the initial (n, m) values contain m. Finish it up by retrieving the value in the temporary variable and assigning it to m.
Now they apply the modified Euclidean algorithm to find the GCD - a more efficient version of the algorithm that shortcuts the multiple subtractions, instead replacing the larger of the two numbers by its remainder when divided by the smaller of the two.
The smaller number at the beginning of the algorithm will end up being the larger after the first iteration, therefore we'll assign it to n to get ready for the second iteration. To do so, though, we need to get the current n out of the way by assigning it to the temporary variable t. After that we get the modulo resulting from dividing the original larger number (n), which now is stored in t, by the smaller number m. The result will replace the number stored in m.
As long as there is a remainder (modulo) the process will go on, this time with the initial smaller number, m playing the role of the big guy. When there is no remainder, the smaller of the numbers in that particular iteration is returned.
ADDENDUM:
Now that I know how to read this function, I see that it is limited to two numbers in the input to the function. So I entertained myself putting together a function that can work with three integers in the input:
require(numbers)
GCF <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
GCD_elem <- numeric()
for(i in 1:max.len){
GCD_elem[i] <- min(tab_x[i], tab_y[i], tab_z[i]) * i
}
GCD_elem <- GCD_elem[!GCD_elem==0]
GrCD <- prod(GCD_elem)
print(GrCD)
}
Also for the LCM:
LCM <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
LCM_elem <- numeric()
for(i in 1:max.len){
LCM_elem[i] <- i^(max(tab_x[i], tab_y[i], tab_z[i]))
}
LCM_elem <- LCM_elem[!LCM_elem==0]
LCM <- prod(LCM_elem)
print(LCM)
}

Recursion in a prime generator

I'm making a prime generator, and to make it more efficient, i'm trying to only test numbers against primes that I've already found rather than all numbers < sqrt of the number being tested. I'm trying to get a to be my list of primes, but i'm not sure how to make it recur inside my second for loop. I think this is only testing against a <- 2 and not a <- c(a,i)
x <- 3:1000
a <- 2
for (i in x)
{for (j in a)
{if (i %% j == 0)
{next}
else {a <- unique(c(a,i))}}}
a
The solution might be to cut out the second loop and instead compare your proposed prime number to the entire vector instead, like:
x <- 3:1000
a <- 2
for (i in x) {
if (!any(i %% a == 0)) {
a <- c(a,i)
}
}
That seemed to work for me.
A non-recursive mod using simple prime function that's about as fast as you can make it in R is below. Rather than cycle through each individual value and test it's primeness it removes all of the multiples of primes in big chunks. This isolates each subsequent remaining value as a prime. So, it takes out 2x, then 3x, then 4 is gone so 5x values go. It's the most efficient way to do it in R.
primest <- function(n){
p <- 2:n
i <- 1
while (p[i] <= sqrt(n)) {
p <- p[p %% p[i] != 0 | p==p[i]]
i <- i+1
}
p
}
(you might want to see this stack question for faster methods using a sieve and also my timings of the function. What's above will run 50, maybe 500x faster than the version you're working from.)

Resources