Understanding Event Coincidence Analysis from CoinCalc R package

Understanding Event Coincidence Analysis from CoinCalc R package - r

I have two binarized events (eventA and eventB), I want to know if there is any coincidence in these two events. So I'll use the new Package CoinCalc to investigate the potential relation between these two.
library(CoinCalc) #note that the package is not visible (at least for) me in CRAN. I got it from GitHub https://github.com/JonatanSiegmund/CoinCalc
two binary events
eventA= c(0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,1,0,0,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,1)
eventB = c(0,1,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,0,1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,1,0,1,1,1,1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0)
run ECA analysis
ca.out <- CC.eca.ts(eventA, eventB,delT=2,tau=2)
this yields:
$NH precursor
1 TRUE
$NH trigger
1 FALSE
$p-value precursor
1 0.2544052
$p-value trigger
1 0.003287963
$precursor coincidence rate
1 0.8243243
$trigger coincidence rate
1 0.9285714
I want to make sure I'm understanding this properly. Based on the results, the null hypothesis can only be rejected for the trigger, which is statistically significant at the 0.003 level, and the coincidence rate is 0.92 (very high, is this equivalent to R2?). Can this be interpreted that eventB has a strong influence on eventA, but not the opposite?
Then I can plot these two events using the CC.plot function:
CC.plot(eventA,eventB,dates=c(1900:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
Which yields:
Is there any way to modify the graphical parameters in CC.plot? The dummy years are not visible in this plot. I'd like to change fonts, size, colours, etc. Is there any way to draw the same figure by calling the model output (ca.out)?
Thanks in advance!

I'll try to answer your questions:
Question #1: The most important problem that I see in your example is that your events are not "rare". Therefore the most important pre-condition of the analytical significance test that you used by default (sigtest="poisson") in not fulfilled. Another "problem" is, that the events in both series seem to be clustered (may also be an effect of the high number of events). I would recommend to use sigtest="shuffle.surrogate" which is more appropriate for this case. More information about the significance test can be found at Siegmund et al. 2017 (http://www.sciencedirect.com/science/article/pii/S0098300416305489)
Executing this reveals that both coincidence rates are not significant. By the way: with such a high number of events it is extremely unlikely that you would ever get a 'significant coincidence rate', because the chance that simultaneities occur by random is very very high.
Nevertheless, if the trigger coincidence rate would be significant and the precursor not, your interpretation is a possible one.
Question #2: The problem with the plot is again, that there are too many events (compared to what the method was originally designed for). This is why everything looks so messy. The function was ment to be more like a help to explain how the method works and what you have done.
If you e.g. only plot e.g. 20 years of your data
CC.plot(eventA[120:140],eventB[120:140],dates=c(2020:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
you will get a much better image, that yet, due to the high event-density of almost 50%, is not very nice.
CoinCalc plot
For now, there are no options to change the plot parameters. This might come for a future version of the package.
I hope that this helps you a bit!

Related

Recomendations (functions/solution) to apply in OpenMDAO instead of boolean conditions (if/else)

I have been working for a couple of months with OpenMDAO and I find myself struggling with my code when I want to impose conditions for trying to replicate a physical/engineering behaviour.
I have tried using sigmoid functions, but I am still not convinced with that, due to the difficulty about trading off sensibility and numerical stabilization. Most of times I found overflows in exp so I end up including other conditionals (like np.where) so loosing linearity.
outputs['sigmoid'] = 1 / (1 + np.exp(-x))
I was looking for another kind of step function or something like that, able to keep linearity and derivability to the ease of the optimization. I don't know if something like that exists or if there is any strategy that can help me. If it helps, I am working with an OpenConcept benchmark, which uses vectorized computations ans Simpson's rule numerical integration.
Thank you very much.
PD: This is my first ever question in stackoverflow, so I would like to apologyze in advance for any error or bad practice commited. Hope to eventually collaborate and become active in the community.
Update after Justin answer:
I will take the opportunity to define a little bit more my problem and the strategy I tried. I am trying to monitorize and control thermodynamics conditions inside a tank. One of the things is to take actions when pressure P1 reaches certein threshold P2, for defining this:
eval= (inputs['P1'] - inputs['P2']) / (inputs['P1'] + inputs['P2'])
# P2 = threshold [Pa]
# P1 = calculated pressure [Pa]
k=100 #steepness control
outputs['sigmoid'] = (1 / (1 + np.exp(-eval * k)))
eval was defined in order avoid overflows normalizing the values, so when the threshold is recahed, corrections are taken. In a very similar way, I defined a function to check if there is still mass (so flowing can continue between systems):
eval= inputs['mass']/inputs['max']
k=50
outputs['sigmoid'] = (1 / (1 + np.exp(-eval*k)))**3
maxis also used for normalizing the value and the exponent is added for reaching zero before entering in the negative domain.
PLot (sorry it seems I cannot post images yet for my reputation)
It may be important to highlight that both mass and pressure are calculated from coupled ODE integration, in which this activation functions take part. I guess OpenConcept nature 'explore' a lot of possible values before arriving the solution, so most of the times giving negative infeasible values for massand pressure and creating overflows. For that sometimes I try to include:
eval[np.where(eval > 1.5)] = 1.5
eval[np.where(eval < -1.5)] = -1.5
That is not a beautiful but sometimes effective solution. I try to avoid using it since I taste that this bounds difficult solver and optimizer work.

I could give you a more complete answer if you distilled your question down to a specific code example of the function you're wrestling with and its expected input range. If you provide that code-sample, I'll update my answer.
Broadly, this is a common challenge when using gradient based optimization. You want some kind of behavior like an if-condition to turn something on/off and in many cases thats a fundamentally discontinuous function.
To work around that we often use sigmoid functions, but these do have some of the numerical challenges you pointed out. You could try a hyberbolic tangent as an alternative, though it may suffer the same kinds of problems.
I will give you two broad options:
Option 1
sometimes its ok (even if not ideal) to leave the purely discrete conditional in the code. Lets say you wanted to represent a kind of simple piecewise function:
y = 2x; x>=0
y = 0; x < 0
There is a sharp corner in that function right at 0. That corner is not differentiable, but the function is fine everywhere else. This is very much like the absolute value function in practice, though you might not draw the analogy looking at the piecewise definition of the function because the piecewise nature of abs is often hidden from you.
If you know (or at least can check after the fact) that your final answer will no lie right on or very near to that C1 discontinuity, then its probably fine to leave the code the way is is. Your derivatives will be well defined everywhere but right at 0 and you can simply pick the left or the right answer for 0.
Its not strictly mathematically correct, but it works fine as long as you're not ending up stuck right there.
Option 2
Apply a smoothing function. This can be a sigmoid, or a simple polynomial. The exact nature of the smoothing function is highly specific to the kind of discontinuity you are trying to approximate.
In the case of the piecewise function above, you might be tempted to define that function as:
2x*sig(x)
That would give you roughly the correct behavior, and would be differentiable everywhere. But wolfram alpha shows that it actually undershoots a little. Thats probably undesirable, so you can increase the exponent to mitigate that. This however, is where you start to get underflow and overflow problems.
So to work around that, and make a better behaved function all around, you could instead defined a three part piecewise polynomial:
y = 2x; x>=a
y = c0 + c1*x + c2*x**2; -a <= x < a
y = 0 x < -a
you can solve for the coefficients as a function of a (please double check my algebra before using this!):
c0 = 1.5a
c1 = 2
c2 = 1/(2a)
The nice thing about this approach is that it will never overshoot and go negative. You can also make a reasonably small and still get decent numerics. But if you try to make it too small, c2 will obviously blow up.
In general, I consider the sigmoid function to be a bit of a blunt instrument. It works fine in many cases, but if you try to make it approximate a step function too closely, its a nightmare. If you want to represent physical processes, I find polynomial fillet functions work more nicely.
It takes a little effort to derive that polynomial, because you want it to be c1 continuous on both sides of the curve. So you have to construct the system of equations to solve for it as a function of the polynomial order and the specific relaxation you want (0.1 here).

My goto has generally been to consult the table of activation functions on wikipedia: https://en.wikipedia.org/wiki/Activation_function
I've had good luck with sigmoid and the hyperbolic tangent, scaling them such that we can choose the lower and upper values as well as choosing the location of the activation on the x-axis and the steepness.
Dymos uses a vectorization that I think is similar to OpenConcept and I've had success with numpy.where there as well, providing derivatives for each possible "branch" taken. It is true that you may have issues with derivative mismatches if you have an analysis point right on the transition, but often I've had success despite that. If the derivative at the transition becomes a hinderance then implementing a sigmoid or relu are more appropriate.
If x is of a magnitude such that it can cause overflows, consider applying units or using scaling to put it within reasonable limits if you cannot bound it directly.

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?

I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.

This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

Computer Adaptive Testing 1PL Ability Calculation Math: How to implement?

Preamble:
I have been implementing my own CAT system. The resources that have helped me most are these:
An On-line, Interactive, Computer Adaptive Testing Tutorial, 11/98 -- A good explanation of how to pick a test question based on which one would return the most information. Fascinating idea, really. The equations are not illustrated with examples, however... but there is a simulation to play with. Unfortunately the simulation is down!
Computer-Adaptive Testing: A Methodology Whose Time Has Come -- This has similar equations, although it does not use IRT or the Newton-Raphson Method. It is also Rasch, not 3PL. It does, however, have a BASIC program that is far more explicit than the usual equations that are cited. I have converted portions of the program in order to get my own system to experiment with, but I would prefer to use 1PL and/or 3PL.
Rasch Dichotomous Model vs. One-parameter Logistic Model -- This clears some stuff up, but perhaps only makes me more dangerous at this stage.
Now, the question.
I want to be able to measure someone's ability level based on a series of questions that are rated at a 1PL difficulty level and of course the person's answers and whether or not they are correct.
I have to first have a function that calculates the probably of a given item. This equation gives the probability function for 1PL.
Probability correct = e^(ability - difficulty) / (1+ e^(ability - difficulty))
I'll go with this one arbitrarily for now. Using an ability estimate of 0, we get the following probabilities:
-0.3 --> 0.574442516811659
-0.2 --> 0.549833997312478
-0.1 --> 0.52497918747894
0 --> 0.5
0.1 --> 0.47502081252106
0.2 --> 0.450166002687522
0.3 --> 0.425557483188341
This makes sense. A problem targeting their level is 50/50... and the questions are harder or easier depending on which direction you go. The harder questions have a smaller chance of coming out correct.
Now... consider a test taker that has done five questions at this difficulty: -.1, 0, .1, .2, .1. Assume they got them all correct except the one that's at difficulty .2. Assuming an ability level of 0... I would want some equations to indicate that this person is slightly above average.
So... how to calculate that with 1PL? This is where it gets hard.
Looking at the equations on the various pages... I will start with an assumed ability level... and then gradually adjust it with each question after more or less like the following.
Starting Ability: B0 = 0
Ability after problem 1: B1 = B0 + [summations and function evaluated for item 1 at ability B0]
Ability after problem 2: B2 = B1 + [summations and functions evaluated for items 1-2 at ability B1]
Ability after problem 3: B3 = B2 + [summations and functions evaluated for items 1-3 at ability B2]
Ability after problem 4: B4 = B3 + [summations and functions evaluated for items 1-4 at ability B3]
Ability after problem 5: B5 = B4 + [summations and functions evaluated for items 1-5 at ability B4]
And so on.
Just reading papers on this, this is the gist of what the algorithm should be doing. But there are so many different ways to do this. The behaviour of my code is clearly wrong as I get division by zero errors... so this is where I get lost. I've messed with information functions and done derivatives, but my college level math is not cutting it.
Can someone explain to me how to do this part? The literature I've read is short on examples and the descriptions of the math appears incomplete to me. I suppose I'm asking for how to do this with a 3PL model that assumes that c is always zero and a is always 1.7 (or maybe -1.7-- whatever works.) I was trying to get to 1PL somehow anyway.
Edit: A visual guide to item response theory is the best explanation of how to do this I've seen so far, but the text gets confusing at the most critical point. I'm closer to getting this, but I'm still not understanding something. Also... the pattern of summations and functions isn't in this text like I expected.

How to do this:
This is an inefficient solution, but it works and is reasonably inituitive.
The last link I mentioned in the edit explains this.
Given a probability function, set of question difficulties, and corresponding set of evaluations-- ie, whether or not they got it correct.
With that, I can get a series of functions that will tell you the chance of their giving that exact response. Now... multiply all of those functions together.
We now have a big mess! But it's a single function in terms of the unknown ability variable that we want to find.
Next... run a slew of numbers through this function. Whatever returns the maximum value is the test taker's ability level. This can be used to either determine the standard error or to pick the next question for computer adaptive testing.

Trying to use the Naive Bayes Learner in R but predict() giving different results than model would suggest

I'm trying to use the Naive Bayes Learner from e1071 to do spam analysis. This is the code I use to set up the model.
library(e1071)
emails=read.csv("emails.csv")
emailstrain=read.csv("emailstrain.csv")
model<-naiveBayes(type ~.,data=emailstrain)
there a two sets of emails that both have a 'statement' and a type. One is for training and one is for testing. when I run
model
and just read the raw output it seems that it gives a higher then zero percent chance to a statement being spam when it is indeed spam and the same is true for when the statement is not. However when I try to use the model to predict the testing data with
table(predict(model,emails),emails$type)
I get that
ham spam
ham 2086 321
spam 2 0
which seems wrong. I also tried using the training set to test the data on as well, and in this case it should give quite good results, or at least as good as what was observed in the model. However it gave
ham spam
ham 2735 420
spam 0 6
which is only slightly better then with the testing set. I think it must be something wrong with how the predict function is working.
how the data files are set up and some examples of whats inside:
type,statement
ham,How much did ur hdd casing cost.
ham,Mystery solved! Just opened my email and he's sent me another batch! Isn't he a sweetie
ham,I can't describe how lucky you are that I'm actually awake by noon
spam,This is the 2nd time we have tried to contact u. U have won the £1450 prize to claim just call 09053750005 b4 310303. T&Cs/stop SMS 08718725756. 140ppm
ham,"TODAY is Sorry day.! If ever i was angry with you, if ever i misbehaved or hurt you? plz plz JUST SLAP URSELF Bcoz, Its ur fault, I'm basically GOOD"
ham,Cheers for the card ... Is it that time of year already?
spam,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k"
ham,"When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\'t know that Iam addicted to my sweet Friends..!! BSLVYL"
ham,Ugh hopefully the asus ppl dont randomly do a reformat.
ham,"Haven't seen my facebook, huh? Lol!"
ham,"Mah b, I'll pick it up tomorrow"
ham,Still otside le..u come 2morrow maga..
ham,Do u still have plumbers tape and a wrench we could borrow?
spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
ham,It vl bcum more difficult..
spam,UR GOING 2 BAHAMAS! CallFREEFONE 08081560665 and speak to a live operator to claim either Bahamas cruise of£2000 CASH 18+only. To opt out txt X to 07786200117
I would really love suggestions. Thank you so much for your help

Actually predict function works just fine. Don't get me wrong but problem is in what you are doing. You are building the model using this formula: type ~ ., right? It is clear what we have on the left-hand side of the formula so lets look at the right-hand side.
In your data you have only to variables - type and statement and because type is dependent variable only thing that counts as independent variable is statement. So far everything is clear.
Let's take a look at Bayesian Classifier. A priori probabilities are obvious, right? What about
conditional probabilities? From the classifier point of view you have only one categorical Variable (your sentences). For the classifier point it is only some list of labels. All of them are unique so a posteriori probabilities will be close to the the a priori.
In other words only thing we can tell when we get a new observation is that probability of it being spam is equal to probability of message being spam in your train set.
If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example mean stemming, lemmatization, computing n-gram statistics, tf-idf. Training classifier is the last step.

arithmetic library for tracking worst case error

(edited)
Is there any library or tool that allows for knowing the maximum accumulated error in arithmetic operations?
For example if I make some iterative calculation ...
myVars = initialValues;
while (notEnded) {
myVars = updateMyVars(myVars)
}
... I want to know at the end not only the calculated values, but also the potential error (the range of posible values if results in each individual operations took the range limits for each operand).
I have already written a Java class called EADouble.java (EA for Error Accounting) which holds and updates the maximum positive and negative errors along with the calculated value, for some basic operations, but I'm afraid I might be reinventing an square wheel.
Any libraries/whatever in Java/whatever? Any suggestions?
Updated on July 11th: Examined existing libraries and added link to sample code.
As commented by fellows, there is the concept of Interval Arithmetic, and there was a previous question ( A good uncertainty (interval) arithmetic library? ) on the topic. There just a couple of small issues about my intent:
I care more about the "main" value than about the upper and lower bounds. However, to add that extra value to an open library should be straight-forward.
Accounting the error as an independent floating point might allow for a finer accuracy (e.g. for addition the upper bound would be incremented just half ULP instead of a whole ULP).
Libraries I had a look at:
ia_math (Java. Just would have to add the main value. My favourite so far)
Boost/numeric/Interval (C++, Very complex/complete)
ErrorProp (Java, accounts value, and error as standard deviation)
The sample code (TestEADouble.java) runs ok a ballistic simulation and a calculation of number e. However those are not very demanding scenarios.

probably way too late, but look at BIAS/Profil: http://www.ti3.tuhh.de/keil/profil/index_e.html
Pretty complete, simple, account for computer error, and if your errors are centered easy access to your nominal (through Mid(...)).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex