R poisson simulation - r

I need to write a function which calculates the number of arrivals until time t in n trials. And the arguments should be the lambda (which I can assume to be between 0.1 and 1), the time (I can assume to be less than or equal to 1) and the number of counts to be sampled.
I've previously written a function which takes a vector of length n which has the first n-1 elements as inter-event times and the nth element as the time t, and it counts the number of events which occur before t.
inp <- readline(prompt="Input vector with each element seperated by a space")
inp <- strsplit(inp," ")
inp <- as.integer(as.vector(inp[[1]]))
t <- tail(inp, n=1)
c.e <- function(x) {
inp = x
stopped = NA
for (i in seq_along(inp)) {
runsum <- sum(inp[1:i])
cat("The sum of the", i, "first elements is", runsum, "\n")
if (runsum > tail(inp, 1)) {
stopped = i - 1
break()
}
}
stopped
}
cat(c.e(inp), "events occur in", t, "time units")
(e.g. inputting 1 2 3 4 7 would output that 3 events occur in 6 time units)
I think I need to use and possibly edit this function in order to get it to do what I need it do, but I'm really not sure how to do this. Any help would be appreciated :)

You could edit this in to your old function if you want something like the printing and the nice inputs/outputs that you've got, but as for what you've actually asked for, it seems like all that you need is counts<-function(lambda, t, n) sum(rpois(n, lambda) < t).
rpois generates the trial results, with parameters n (the number of trials) and lambda, we then compare them with t (time) and sum our results.

Related

Function that returns the numbers of subintervals?

I'm trying create a function that the user enters with a vector (file) and the function returns the number of subintervals possible of differents sizes without exceeding the dimensions of the vector. The function has worked well for many values, but for some specific values has exceeded the dimensions and i don't know why. Follows the reproducible example, note that when box_size = 101 (size of subintervals) them the function returns ninbox = 5 (number of subintervals) and not 4 how should be.
file = rnorm(500);N<-length(file)
box_size <-c(92,101,111)
j=1;ninbox2<-0;aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
for(j in 1:length(box_size)){
while(aux_ninbox<=N){
ninbox<-ninbox+1
sum_box[ninbox]<- box_size[j]
aux_ninbox<-sum_box[ninbox]+ aux_ninbox
ninbox2[j]<-ninbox
}
aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
}
ninbox2
For instance, if the size of the subinterval is 101 and the size of the vector is 500, them the function count how many subintervals of size 101 fit in 500. In this case, 101+101+101+101 = 404 (4 subintervals) because the next exceeding the dimension of the vector (500). This function have an error, because is returning 5 for intervals of size 101. But note that for intervals of size 92 and 111 is working perfectly!
Okay, now I get it. I simplified your code a little bit:
file <- rnorm(500)
N <- length(file)
box_size <- c(92,101,111)
ninbox2 <- 0
for (j in 1:length(box_size)){
aux_ninbox <- box_size[j]
ninbox <- 0
sum_box <- 0
while (aux_ninbox<=N){
ninbox <- ninbox+1
sum_box[ninbox] <- box_size[j]
aux_ninbox <- sum_box[ninbox]+ aux_ninbox
ninbox2[j] <- ninbox
}
}
ninbox2
There were a few issues when your variables were declared:
aux_ninbox[1] <- box_size[j]
should not be assigned at the end of your for-loop. You have to put it at the beginning, before the while-loop starts. That caused your error, since the second iteration of you for-loop again used the box_size of 92 for the calculation.
If you just want to get the number of boxes, you could simply use
N %/% box_size
This divides the length of your given vector by the sizes of your boxes ignoring the rests. See "Arithmetic Operators" in R help for more information about div and mod.

How to determine the time-complexity of a simple palindrome R code?

I am pretty new to calculating time-complexity of an algorithm or a code, so I'm not sure what will be the complexity of this next function:
isPalindrome <- function(num){
if(num < 0) return(F)
rev <- 0
orig_num <- num
while(num != 0){
pop <- num %% 10
num <- num %/% 10
rev <- rev*10 + pop
}
if(orig_num == rev) return(T)
else return(F)
}
And calling the function, e.g. isPalindrome(122221) will return TRUE.
The basic idea is that a reverse number is being calculated and then compared against the original number, if they are equal then it is a palindrome.
My basic intuition was that in order to calculate the reverse number the while loop will go through every digit, so e.g for a 4 digit number like 1221 there will be 4 actions to be made (with some execution time to complete each), and so if my number becomes 2 times larger with respect to its digits, e.g 12222221 then I will need 8 actions to be made. Then, my input grew by 2 and time also grew by 2, so the time-complexity should be O(n). Is this correct?
Your intuition is right: your algorithm will run in O(n) with respect to the digits of the number. That is to say, comparing it to the size of the number, an O(floor(log10(n)) + 1), which is is the number-counting function.

The sum of the first n odd integers

I am trying to create a function that takes the sum of the first n odd integers, i.e the summation from i=1 to n of (2i-1).
If n = 1 it should output 1
If n = 2 it should output 4
I'm having problems using a for loop which only outputs the nth term
n <-2
for (i in 1:n)
{
y<-((2*i)-1)
}
y
In R programming we try avoiding for loops
cumsum ( seq(1,2*n, by=2) )
Or just use 'sum' if you don't want the series of partial sums.
There's actually no need to use a loop or to construct the sequence of the first n odd numbers here -- this is an arithmetic series so we know the sum of the first n elements in closed form:
sum.first.n.odd <- function(n) n^2
sum.first.n.odd(1)
[1] 1
sum.first.n.odd(2)
[1] 4
sum.first.n.odd(100)
[1] 10000
This should be a good deal more efficient than any solution based on for or sum because it never computes the elements of the sequence.
[[Just seeing the title -- the OP apparently knows the analytic result and wanted something else...]]
Try this:
sum=0
n=2
for(i in seq(1,2*n,2)){
sum=sum+i
}
But, of course, R is rather slow when working with loops. That's why one should avoid them.

How can I skip increments in R 'for' loop?

I need to find stretches of values above 0 in a numeric vector where there are at least 10 members within each region. I do not want to check every single position as it would be very time intensive (vector is over 10 million).
Here is what I'm trying to do (very preliminary as I can't figure out how to skip increments in for loop):
1. Check if x[i] (start position) is positive.
a) if positive, check to see if x[i+10] (end position) is positive (since we want at least length 10 of positive integers)
* if positive, check every position in between to see if positive
* if negative, move to x[i+11], skip positions (e.g. new start position is x[i+12]) in between start & end positions since we would not get >10 members if negative end position is included.
x <- rnorm(50, mean=0, sd=4)
for(i in 1:length(x)){
if(x[i]>0){ # IF START POSITION IS POSITIVE
flag=1
print(paste0(i, ": start greater than 1"))
if(x[i+10]>0){ # IF END POSITION POSITIVE, THEN CHECK ALL POSITIONS IN BETWEEN
for(j in i+1:i+9){
if(x[j]>0){ # IF POSITION IS POSITIVE, CHECK NEXT POSITION IF POSITIVE
print(paste0(j, ": for j1"))
}else{ # IF POSITION IS NEGATIVE, THEN SKIP CHECKING & SET NEW START POSITION
print(paste0(j, ": for j2"))
i <- i+11
break;
}
}
}else{ # IF END POSITION IS NOT POSITIVE, START CHECK ONE POSITION AFTER END POSITION
i <- i+11
}
}
}
The issue I have is that even when I manually increment i, the for loop i value masks the new set value. Appreciate any insight.
I dunno if this approach is as efficient as Curt F's, but how about
runs <- rle(x>0)
And then working with the regions defined by runs$lengths>10 & runs$values ==TRUE ?
Here is a solution that finds stretches of ten positive numbers in a vector of length ten million. It does not use the loop approach suggested in the OP.
The idea here is to take the cumulative sum of the logical expression vec>0. The difference between position n and n-10 will be 10 only if all values of the vector at positions between n-10 and n are positive.
filter is an easy and relatively fast way to calculate these differences.
#generate random data
vec <- runif(1e7,-1,1)
#cumulative sum
csvec <- cumsum(vec>0)
#construct a filter that will find the difference between the nth value with the n-10th value of the cumulative sign vector
f11 <- c(1,rep(0,9),-1)
#apply the filter
fv <- filter(csvec, f11, sides = 1)
#find where the difference as computed by the filter is 10
inds <- which(fv == 10)
#check a few results
> vec[(inds[1]-9):(inds[1])]
[1] 0.98457526 0.03659257 0.77507743 0.69223183 0.70776891 0.34305865 0.90249491 0.93019927 0.18686722 0.69973176
> vec[(inds[2]-9):(inds[2])]
[1] 0.0623790 0.8489058 0.3783840 0.8781701 0.6193165 0.6202030 0.3160442 0.3859175 0.8416434 0.8994019
> vec[(inds[200]-9):(inds[200])]
[1] 0.0605163 0.7921233 0.3879834 0.6393018 0.2327136 0.3622615 0.1981222 0.8410318 0.3582605 0.6530633
#check all the results
> prod(sapply(1:length(inds),function(x){prod(sign(vec[(inds[x]-9):(inds[x])]))}))
[1] 1
I played around with system.time() to see how long the various steps took. On my not-very-powerful laptop the longest step was filter(), which took just over half a second for a vector of length ten million.
Vectorised solution using only basic commands:
x <- runif(1e7,-1,1) # generate random vector
y <- which(x<=0) # find boundaries i.e. negatives and zeros
dif <- y[2:length(y)] - y[1:(length(y)-1)] # find distance in boundaries
drange <- which(dif > 10) # find distances more than 10
starts <- y[drange]+1 # starting positions of sequence
ends <- y[drange+1]-1 # last positions of sequence
The first range you want is from x[starts[1]] to x[ends[1]] , etc.

Aligning sequences with missing values

The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time

Resources