Given a -log10(P) value, I'd like to calculate the Z score in log space, how would I do that?
So, given the following code, how to recode the last line so that it calculates Z from log10P in the log space?
Z=10
log10P = -1*(pnorm(-abs(Z),log.p = T)*1/log(10) + log10(2))
Z== -1*(qnorm(10^-log10P/2)) # <- this needs to be in log space.
qnorm also has a log.p argument analogous to pnorm's, so you can reverse the operations that you used to get log10P in the first place (it took me a couple of tries to get this right ...)
I rearranged your log10P calculation slightly.
log10P_from_Z <- function(Z) {
abs((pnorm(-abs(Z),log.p=TRUE)+log(2))/log(10))
}
Z_from_log10P <- function(log10P) {
-1*qnorm(-(log10P*log(10))-log(2), log.p=TRUE)
}
We can check the round-trip accuracy (i.e. convert from -log10(p) to Z and back, see how close we got to the original value.) This works perfectly for values around 20, but does incur a little bit of round-off error for large values (would have to look more carefully to see if there's anything that can be remedied here).
zvec <- seq(20,400)
err <- sapply(zvec, function(z) {
abs(Z_from_log10P(log10P_from_Z(z))-z)
})
Related
Using R, before assessing some metric of accuracy on my own manual implementation of DFT, I wanted to do a sanity check on how well stats::fft() performs by doing the following:
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
sig.rt = fft(fft(sig.ts)/N, inverse="true");
#the two plots so perfectly align that you can't see them both
max(abs(sig.ts - sig.rt)) / max(sig.ts);
#arbitrary crude accuracy metric=1.230e-15 - EXCELLENT!
But I wanted to write the code for DFT myself, to ensure I understand it, then invert it in the hopes that it would be the same:
##The following is the slow DFT for now, not the FFT...
sR = 102.4; #the number of Hz at which we sample
freq1=3; freq2=12; #frequency(ies) of the wave
t = seq(1/sR,10, 1/sR);
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
N=length(t); kk=seq(0,N/2-1, 1); nn=seq(0,N-1, 1);
for(k in kk){
sig.freqd[k]=0;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n+1]*exp(-j*2*pi*n*k/N); } }
sig.freqd = (1/N)*sig.freqd; #for Normalization
#Checking the "accuracy" of my manual implementation of DFT...
sig.freqd_inv=Re(fft(sig.freqd, inverse="true"));
plot(t[1:100], window(sig.ts,end=100), col="black", type="l",lty=1,lwd=1, xaxt="n");
lines(t[1:100],window(sig.freqd_inv,end=100), col="red", type="l",lty=1,lwd=1, xaxt="n");
axis(1, at=seq(round(t[1],1),round(t[length(t)],1), by=0.1), las=2);
max(abs(sig.ts[1:(N/2-1)] - sig.freqd_inv)) / max(sig.ts[1:(N/2-1)]); #the metric here =1.482 unfortunately
Even without the metric, the plot makes it obvious that something's off here - it's lower amplitude, maybe out of phase, and more jagged. In all of my self-studying, I will say that I am a bit confused about how sensitive this all is to vector length..as well as how to ensure that the imaginary component's phase information is taken into account when plotting.
Bottom line, any insight into what's wrong with my DFT algorithm would be helpful. I don't want to just blackbox my use of functions - I want to understand these things more deeply before moving on to more complicated functions.
Thanks,
Christian
The main issues arise from the signal indexing. First to get a full transform usable by R's fft(..., inverse = TRUE), you would need to compute all N coefficients (even if the coefficients above N/2-1 could be obtained by symmetry).
Then you should realize that array indexing in R are 1-based. So, while indexing sig.freqd[k], the index k should start at 1 instead of 0. Since the argument to exp(-1i*2*pi*n*k/N) should start with n=0andk=0`, you'll need to adjust the indices:
kk=seq(1,N, 1); nn=seq(1,N, 1);
for(k in kk){
sig.freqd[k]=0i;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n]*exp(-1i*2*pi*(n-1)*(k-1)/N);
}
}
I've also changed you usage of j to represent the imaginary number 1i since that's the usual notation recognized by R (and R was complaining about it when trying your posted sample as-is). If you had defined j=1i that shouldn't affect the results.
Note also that R's fft is unnormalized. So to obtain the same result for the forward transform, your DFT implementation should not include the 1/N normalization. On the other hand, you will need to add this factor as a final step in order to get the full-circle forward+backward transform to match the original signal.
With these changes you should have the following code:
##The following is the slow DFT for now, not the FFT...
sR = 102.4; #the number of Hz at which we sample
freq1=3; freq2=12; #frequency(ies) of the wave
t = seq(1/sR,10, 1/sR);
sig.ts = ts( sin(2*pi*freq1*t) + sin(2*pi*freq2*t) );
N=length(t); kk=seq(1,N, 1); nn=seq(1,N, 1);
for(k in kk){
sig.freqd[k]=0i;
for(n in nn){
sig.freqd[k] = sig.freqd[k] + sig.ts[n]*exp(-1i*2*pi*(n-1)*(k-1)/N);
}
}
#Checking the "accuracy" of my manual implementation of DFT...
sig.freqd_inv=(1/N)*Re(fft(sig.freqd, inverse="true"));
plot(t[1:100], window(sig.ts,end=100), col="black", type="l",lty=1,lwd=2, xaxt="n");
lines(t[1:100],window(sig.freqd_inv,end=100), col="red", type="l",lty=2,lwd=1, xaxt="n");
axis(1, at=seq(round(t[1],1),round(t[length(t)],1), by=0.1), las=2);
max(abs(sig.ts - sig.freqd_inv)) / max(sig.ts)
This should yield a metric around 1.814886e-13, which is probably more in line with what you were expecting. The corresponding plot should also be showing the orignal signal and the roundtrip signal overlapping:
I am working on a script that should estimate the probability of having at least 2 out of n people having a same birthday within k days from eachother. To estimate this I have the following function:
birthdayRangeCheck.prob = function(nPeople, seperation, nSimulations) {
count = 0
for (i in 1:nSimulations) {
count = count + birthdayRangeCheck(nPeople, seperation)
}
return(count / nSimulations)
}
Now just entering simple values for nPeople, seperation, nSimulations gives me a normal number.
e.g.
birthdayRangeCheck.prob(10,4,100)
-> 0.75
However when I want to plot the probability as a function of nPeople, and seperation I stumble upon the following problem:
x = 1:999
y = 0:998
z = outer(X = x, Y = y, FUN = birthdayRangeCheck.prob, nSimulations = 100)
numerical expression has 576 elements: only the first used... (a lot of times)
So it seems like outer is not entering single elements of x and y, but rather the vectors themselfs, which is the opposite of what outer should do right?
Am I overlooking something? Because I can't figure out what is causing this error. (replacing FUN with e.g. sin(x+y) works like a charm so I did pin it down to the function itself. But since the function works just fine with numeric arguments I don't see why R doesn't understand to just enter elements of x and y as arguments.)
Any help would be greatly appreciated. Thanks ;)
I'd first like to describe my problem:
What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.
I have seen all Stackoverflow posts like e.g. this one:
Rollapply for time series
(If there are more relevant ones, please let me know ;) )
As I cannot and probably also should not upload my data, here's a minimal example:
I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).
library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)
The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.
Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h.
This leads to:
find_peaks <- function (x, m,h){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
And works fine: Back to the example:
plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")
However, using the rollapply() function leads to:
x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds
I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.
I have tried to trace this error too, but do not find the source.
Can anyone help me out here?
Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).
I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:
peakfinder = function(x,h = 0){
xdat = as.numeric(x)
meandat = mean(xdat)
center = xdat[ceiling(length(xdat)/2)]
ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}
y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)
Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.
I'm a new user to R, and I am trying to create a function that will simulate a random walk. The issue for me is trying to integrate some initial values smoothly. Say I have this basic function.
y(t) = y(t-2) + eps(t)
Epsilon (or eps(t)) will be the randomness factor. I want to define y(-1)=0, and y(0)=0.
Here is my code:
ran.walk=function(n){ # 'n' steps will be the input
eps=rnorm(n) # creates a vector taking random values from N(0,1)
y= c(eps[1], eps[2]) # this will set up my initial vector
for (i in 3:n){
ytemp = y[i-2] + eps[i] ## !!! problem is here. Details below !!!
y= c(y, ytemp)
}
return(y)
}
I'm trying to get this start adding y3, y4, y5, etc, but I think there is a flaw in this design... I'm not sure if I should just set up two separate lines, with an if statement: testing if n is even or odd, perhaps with:
if i%%2 == 1 #using modulus
Since,
y1= eps1,
y2= eps2,
y3= y1 + eps3,
y4= y2 + eps4,
y5= y3 + eps5 and so on...
Currently, I see the error in my code.
I have y1, and y2 concatenated, but I don't think it knows how to incorporate y[1]
Can I define beforehand somehow y[-1]=0, and y[0]=0 ? I tried this also and got an error.
Thank you kindly in advance for any assistance. This is first times attempting a for loop with recursion.
-N (sorry for any formatting issues, I had a lot of problems getting this question to go through)
I found that your odd and even series is independent one of the other. Assuming that it is the case, I just split the problem in two columns and use cumsum to get the random walk. The final data frame include the random numbers and the random walk, so you can compare it is working properly.
Hoping it helps
ran.walk=function(n) {
eps=rnorm(ceiling(n / 2)*2)
dim(eps) <- c(2,ceiling(n/2))
# since each series is independent, we can tally each one in its own
eps2 <- apply(eps, 1, cumsum)
# and just reorganize it
eps2 <- as.numeric(t(eps2))
rndwlk <- data.frame(rnd=as.numeric(eps), walk=eps2)
# remove the extra value if needed
rndwlk <- rndwlk[1:n,]
return(rndwlk)
}
ran.walk(13)
After taking a break with my piano, it came to me. It's funny how simple the answer becomes once you discover it... almost trivial.
Setting the initial value to be a vector, that is:
[y(1) = y(-1) + eps(1), y(2)= y(0) + eps(2)]
everything works out. It is still true that the evens and odds don't interact, but there is no reason to specify any of that.
The method to split the iterations with modulus, then concatenating it back into the main vector would also work, but is unnecessary and more complicated. Shorter is better for users and computers. As Einstein said, make it as simple as possible, but no simpler.
I've got the first line down which is defining the function:
f <- function(x) 3034*log(x)+2305.84*log(1-x)-1517*log(1-x)
Now the problem I'm having is I need to find all the x values where
f(x)=-1947.92 but I've got no idea what the command is to do this?
Normally I would say you should use uniroot(), after modifying the function to return zero at the target, but that will be problematic here:
target <- -1947.92
f <- function(x) 3034*log(x)+2305.84*log(1-x)-1517*log(1-x)
g <- function(x) f(x)-target
uniroot(g,interval=c(1e-4,1-1e-4))
## Error in uniroot(g, interval = c(1e-04, 1 - 1e-04)) :
## f() values at end points not of opposite sign
What's going on is that your curve crosses zero in two places. uniroot() requires that you bracket the root:
Let's take a look:
curve(g(x))
abline(h=0,col=2)
Zoom in:
curve(g(x),from=0.75,to=0.85)
abline(h=0,col=2)
Now we can either just eyeball this (i.e. use interval=c(1e-4,0.8) or interval=c(0.8,1-1e-4) depending on which root we're interested in) or find
opt1 <- optim(g,par=0.5,method="L-BFGS-B",lower=1e-4,upper=1-1e-4,
control=list(fnscale=-1)) ## maximize rather than min
then use opt1$par as your cut-point. (Or you could do some simple calculus: the maximum [point where the derivative wrt x is zero] is much easier to compute than the roots ...)
Alternatively, you could ask Wolfram Alpha ...