How to compute autocorrelation function in R - r

I'd like to know the best way to compute autocorrelation function as defined below.
For i=1,2,... I would like to compute the i-th autocorrelation function acf.
This is the sum, from k = 1 to n-i, of +1 if v(k) = v(k+i) or -1 if v(k) is different from v(k+i), where n is the length of a vector.
For example:
if v<-c(0,1,1,0,0) and i = 2. Then
acf(v) = (-1) + (-1) + (-1) = -3
Thanks!

What about using R-help? There you should have found the acf function.
v = c(1,1,0,0,1,0,1,0,1)
acf(v,plot=F) -> acf_v
acf_v[2]

I created a function to do to it but still looking for a short and efficient way to do it.
Here is the function:
> v<-c(0,1,1,1,0,1,1)
> acf_bit <-function(vec,lag) {
+ m<-length(vec)
+ t<-0
+ for (k in 1:(m-lag)) {
+ if (v[k]==v[k+lag]) {t<-t+1}
+ else {t<-t-1}
+ }
+ return(t)
+ }
> acf_bit(v,2)
[1] -1

Related

Multi-parameter optimization in R

I'm trying to estimate parameters that will maximize the likelihood of a certain event. My objective function looks like that:
event_prob = function(p1, p2) {
x = ((1-p1-p2)^4)^67 *
((1-p1-p2)^3*p2)^5 *
((1-p1-p2)^3*p1)^2 *
((1-p1-p2)^2*p1*p2)^3 *
((1-p1-p2)^2*p1^2) *
((1-p1-p2)*p1^2*p2)^2 *
(p1^3*p2) *
(p1^4)
return(x)
}
In this case, I'm looking for p1 and p2 [0,1] that will maximize this function. I tried using optim() in the following manner:
aaa = optim(c(0,0),event_prob)
but I'm getting an error "Error in fn(par, ...) : argument "p2" is missing, with no default".
Am I using optim() wrong? Or is there a different function (package?) I should be using for multi-parameter optimization?
This problem can in fact be solved analytically.
The objective function simplifies to
F(p1,p2) = (1-p1-p2)^299 * p1^19 * p2^11
which is to be maximised over the region
C = { (p1,p2) | 0<=p1, 0<=p2, p1+p2<=1 }
Note that F is 0 if p1=0 or p2 =0 or p1+p2 = 1, while if none of those are true then F is positive. Thus the maximum of F occurs in the interior of C
Taking the log
f(p1,p2) = 299*log(1-p1-p2) + 19*log(p1) + 11*log(p2)
In fact it is as easy to solve the more general problem: maximise f over C where
f( p1,..pN) = b*log( 1-p1-..-pn) + Sum{ a[j]*log(p[j])}
where b and each a[j] is positive and
C = { (p1,..pN) | 0<pj, j=1..N and p1+p2+..pN<1 }
The critical point occurs where all the partial derivatives of f are zero, which is at
-b/(1-p1-..-pn) + a[j]/p[j] = 0 j=1..N
which can be written as
b*p[j] + a[j]*(p1+..p[N]) = a[j] j=1..N
or
M*p = a
where M = b*I + a*Ones', and Ones is a vector with each component 1
The inverse of M is
inv(M) = (1/b)*(I - a*Ones'/(b + Ones'*a))
Thus the unique critical point is
p^ = inv(M)*a
= a/(b + Sum{i|a[i]})
Since there is a maximum, and only one critical point, the critical point must be the maximum.
Based on Erwin Kalvelagen's comment: Redefine your function event_prob:
event_prob = function(p) {
p1 = p[1]
p2 = p[2]
x = ((1-p1-p2)^4)^67 *
((1-p1-p2)^3*p2)^5 *
((1-p1-p2)^3*p1)^2 *
((1-p1-p2)^2*p1*p2)^3 *
((1-p1-p2)^2*p1^2) *
((1-p1-p2)*p1^2*p2)^2 *
(p1^3*p2) *
(p1^4)
return(x)
}
You may want to set limits to ensure that p1 and p2 fulfill your constraints:
optim(c(0.5,0.5),event_prob,method="L-BFGS-B",lower=0,upper=1)

What is wrong with my R for-loop that sums a series?

Here is my function that does a loop:
answer = function(a,n) {
for (k in 0:n) {
x =+ (a^k)/factorial(k)
}
return(x)
}
answer(1,2) should return 2.5 as it is the calculated value of
1^0 / 0! + 1^1 / 1! + 1^2 / 2! = 1 + 1 + 0.5 = 2.5
But I get
answer(1,2)
#[1] 0.5
Looks like it fails to accumulate all three terms and just stores the newest value every time. += does not work so I used =+ but it is still not right. Thanks.
answer = function(a,n) {
x <- 0 ## initialize the accumulator
for (k in 0:n) {
x <- x + (a^k)/factorial(k) ## note how to accumulate value in R
}
return(x)
}
answer(1, 2)
#[1] 2.5
There is "vectorized" solution:
answer = function(a,n) {
x <- a ^ (0:n) / factorial(0:n)
return(sum(x))
}
In this case you don't need to initialize anything. R will allocate memory behind that <- and sum.
You are using Taylor expansion to approximate exp(a). See this Q & A on the theme. You may want to pay special attention to the "numerical convergence" issue mentioned in my answer.

Automating a function to return an expression with math constants and unknowns

I am trying to build a transitions matrix from Panel data observations in order to obtain the ML estimators of a weighted transitions matrix. A key step is obtaining the individual likelihood function for individuals. Say you have the following data frame:
ID Feature1 Feature2 Transition
120421006 10000 1 ab
120421006 12000 0 ba
120421006 10000 1 ab
123884392 3000 1 ab
123884392 2000 0 ba
908747738 1000 1 ab
The idea is to return, for each agent, the log-likelihood of his path. For agent 120421006, for instance, this boils down to (ignoring the initial term)
LL = log(exp(Yab)/1 + exp(Yab)) + log(exp(Yba) /(1 + exp(Yba))) +
log(exp(Yab)/1 + exp(Yab))
i.e,
log(exp(Y_transition)/(1 + exp(Y_transition)))
where Y_transition = xFeature1 + yFeature2 for that transition, and x and y are unknowns.
For example, for individual 120421006, this would boil down to an expression with three elements, since he transitions thrice, and the function would return
LL = log(exp(10000x + 1y)/ 1 + exp(10000x + 1y)) +
log(exp(12000x + 0y)/ 1 + exp(12000x + 0y)) +
log(exp(10000x + 1y)/ 1 + exp(10000x + 1y))
And here's the catch: I need x and y to return as unknowns, since the objective is to obtain a sum over the likelihoods of all individuals in order to pass it to an ML estimator. How would you automate a function that returns this output for all IDs?
Many thanks in advance
First you have to decide how flexible your function has to be. I am leaving it fairly rigid, but you can alter it at your flavor.
First, you have to input the initial guess parameters, which you will supply in the optimizer. Then, declare your data and variables to be used in your estimation.
Assuming you will always have only 2 variables (you can change it later)
y <- function(initial_param, data, features){
x = initial_param[1]
y = initial_param[2]
F1 = data[, features[1]]
F2 = data[, features[2]]
LL = log(exp(F1 * x + F2 * y) / (1 + exp(F1 * x + F2 * y)))
return(-sum(LL))
}
This function returns the sum of minus the log likelihood, given that most optimizers try to find the parameters at which your function reaches a minimum, by default.
To find your parameters just supply the below function with your likelihood function y, the initial parameters, data set and a vector with the names of your variables:
nlm(f = y, initial_param = your_starting_guess, data = your_data,
features = c("name_of_first_feature", "name_of_second_feature"), iterlim=1000, hessian=F)
Create the function:
fun=function(x){
a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
parse(text=paste("sum(",paste0("log(",a,"/(1+",a,"))"),")"))
}
by(test[2:3],test[,1],fun)
sum(log(exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)/(1 +
exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y))))
--------------------------------------------------------------------
sum(log(exp(c(3000, 2000) * x + c(1, 0) * y)/(1 + exp(c(3000,
2000) * x + c(1, 0) * y))))
--------------------------------------------------------------------
sum(log(exp(1000 * x + 1 * y)/(1 + exp(1000 * x + 1 * y))))
taking an example of x=0 and y=3 we can solve this:
x=0
y=3
sapply(by(test[2:3],test[,1],fun),eval)
[1] -0.79032188 -0.74173453 -0.04858735
in your example above:
x=0
y=3
log(exp(10000*x + 1*y)/ (1 + exp(10000*x + 1*y))) +#There should be paranthesis
log(exp(12000*x + 0*y)/ (1 + exp(12000*x + 0*y))) +
log(exp(10000*x + 1*y)/( 1 + exp(10000*x + 1*y)))
[1] -0.7903219
To get what you need within the comments:
fun1=function(x){
a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
paste("sum(",paste0("log(",a,"/(1+",a,"))"),")")
}
paste(by(test[2:3],test[,1],fun1),collapse = "+")
1] "sum( log(exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y)/(1+exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y))) )+sum( log(exp(c(3000, 2000)*x+c(1, 0)*y)/(1+exp(c(3000, 2000)*x+c(1, 0)*y))) )+sum( log(exp(1000*x+1*y)/(1+exp(1000*x+1*y))) )"
But this doesnt make sense why you would group them and then sum all of them. That is same as just summing them without grouping them using the ID which would be simpler and faster

Get derivative in R

I'm trying to take the derivative of an expression:
x = read.csv("export.csv", header=F)$V1
f = expression(-7645/2* log(pi) - 1/2 * sum(log(w+a*x[1:7644]^2)) + (x[2:7645]^2/(w + a*x[1:7644]^2)),'a')
D(f,'a')
x is simply an integer vector, a and w are the variables I'm trying to find by deriving. However, I get the error
"Function '[' is not in Table of Derivatives"
Since this is my first time using R I'm rather clueless what to do now. I'm assuming R has got some problem with my sum function inside of the expression?
After following the advice I now did the following:
y <- x[1:7644]
z <- x[2:7645]
f = expression(-7645/2* log(pi) - 1/2 * sum(log(w+a*y^2)) + (z^2/(w + a*y^2)),'a')
Deriving this gives me the error "sum is not in the table of derivatives". How can I make sure the expression considers each value of y and z?
Another Update:
y <- x[1:7644]
z <- x[2:7645]
f = expression(-7645/2* log(pi) - 1/2 * log(w+a*y^2) + (z^2/(w + a*y^2)))
d = D(f,'a')
uniroot(eval(d),c(0,1000))
I've eliminated the "sum" function and just entered y and z. Now, 2 questions:
a) How can I be sure that this is still the expected behaviour?
b) Uniroot doesn't seem to like "w" and "a" since they're just symbolic. How would I go about fixing this issue? The error I get is "object 'w' not found"
This should work:
Since you have two terms being added f+g, the derivative D(f+g) = D(f) + D(g), so let's separate both like this:
g = expression((z^2/(w + a*y^2)))
f = expression(- 1/2 * log(w+a*y^2))
See that sum() was removed from expression f, because the multiplying constant was moved into the sum() and the D(sum()) = sum(D()). Also the first constant was removed because the derivative is 0.
So:
D(sum(-7645/2* log(pi) - 1/2 * log(w+a*y^2)) + (z^2/(w + a*y^2)) = D( constant + sum(f) + g ) = sum(D(f)) + D(g)
Which should give:
sum(-(1/2 * (y^2/(w + a * y^2)))) + -(z^2 * y^2/(w + a * y^2)^2)
expression takes only a single expr input, not a vector, and it is beyond r abilities to vectorize that.
you can also do this with a for loop:
foo <- c("1+2","3+4","5*6","7/8")
result <- numeric(length(foo))
foo <- parse(text=foo)
for(i in seq_along(foo))
result[i] <- eval(foo[[i]])

Evaluating logarithm of expression, given logarithms of variables

I have to programmatically determine the value of the expression:
S = log(x1y1 + x2y2 + x3y3 ...)
Using only the values of:
lxi = log(xi)
lyi = log(yi)
Calculating anti-logs of each of lxi and lyi would probably be impractical and is not desired ...
Is there any way this evaluation can be broken down into a simple summation?
EDIT
I saw a C function somewhere that does the computation in a simple summation:
double log_add(double lx, double ly)
{
double temp,diff,z;
if (lx<ly) {
temp = lx; lx = ly; ly = temp;
}
diff = ly-lx;
z = exp(diff);
return lx+log(1.0+z);
}
The return values are added for each pair of values, and this seems to be giving the correct answer. But I'm not able to figure out how and why it's working!
The direct way is to perform two exponentiations:
ln(x+y) = ln(eln(x) + eln(y))
The log_add function uses a slightly different approach to get the same result with only one:
ln(x+y) = ln((x+y)x/x)
= ln((x+y)/x) + ln(x)
= ln(1 + y/x) + ln(x)
= ln(1 + eln(y/x)) + ln(x)
= ln(1 + eln(y)-ln(x)) + ln(x)

Resources