I have two binarized events (eventA and eventB), I want to know if there is any coincidence in these two events. So I'll use the new Package CoinCalc to investigate the potential relation between these two.
library(CoinCalc) #note that the package is not visible (at least for) me in CRAN. I got it from GitHub https://github.com/JonatanSiegmund/CoinCalc
two binary events
eventA= c(0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,1,0,0,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,1)
eventB = c(0,1,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,0,1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,1,0,1,1,1,1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0)
run ECA analysis
ca.out <- CC.eca.ts(eventA, eventB,delT=2,tau=2)
this yields:
$NH precursor
1 TRUE
$NH trigger
1 FALSE
$p-value precursor
1 0.2544052
$p-value trigger
1 0.003287963
$precursor coincidence rate
1 0.8243243
$trigger coincidence rate
1 0.9285714
I want to make sure I'm understanding this properly. Based on the results, the null hypothesis can only be rejected for the trigger, which is statistically significant at the 0.003 level, and the coincidence rate is 0.92 (very high, is this equivalent to R2?). Can this be interpreted that eventB has a strong influence on eventA, but not the opposite?
Then I can plot these two events using the CC.plot function:
CC.plot(eventA,eventB,dates=c(1900:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
Which yields:
Is there any way to modify the graphical parameters in CC.plot? The dummy years are not visible in this plot. I'd like to change fonts, size, colours, etc. Is there any way to draw the same figure by calling the model output (ca.out)?
Thanks in advance!
I'll try to answer your questions:
Question #1: The most important problem that I see in your example is that your events are not "rare". Therefore the most important pre-condition of the analytical significance test that you used by default (sigtest="poisson") in not fulfilled. Another "problem" is, that the events in both series seem to be clustered (may also be an effect of the high number of events). I would recommend to use sigtest="shuffle.surrogate" which is more appropriate for this case. More information about the significance test can be found at Siegmund et al. 2017 (http://www.sciencedirect.com/science/article/pii/S0098300416305489)
Executing this reveals that both coincidence rates are not significant. By the way: with such a high number of events it is extremely unlikely that you would ever get a 'significant coincidence rate', because the chance that simultaneities occur by random is very very high.
Nevertheless, if the trigger coincidence rate would be significant and the precursor not, your interpretation is a possible one.
Question #2: The problem with the plot is again, that there are too many events (compared to what the method was originally designed for). This is why everything looks so messy. The function was ment to be more like a help to explain how the method works and what you have done.
If you e.g. only plot e.g. 20 years of your data
CC.plot(eventA[120:140],eventB[120:140],dates=c(2020:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
you will get a much better image, that yet, due to the high event-density of almost 50%, is not very nice.
CoinCalc plot
For now, there are no options to change the plot parameters. This might come for a future version of the package.
I hope that this helps you a bit!
I've been doing BigO recently, and I get the formula ok, but I've written a piece of code that takes and input and returns a time taken to complete a sort. So I have the input and time, how do I use this to classify what sort of BigO it is? I've made graphs and can see which sort they are but I can't do it using the formula? I'm not strong on maths which I think is my problem here!
For instance I get:
Size Time Operations
200 2 163648
400 1 162240
800 15 2489456
1600 6 10247376
3200 19 40858160
6400 79 165383984
12800 318 656588080
25600 1274 2624318128
51200 5059 10476803408
102400 20333 41969291968
I know that this is O(n^2) by looking at the graph and comparing, but how do I prove it?
Yes, you can sample a thousand different input sizes, and then try to derive a Big-O value from that, but you shouldn't - not only because it doesn't actually prove anything, but because that isn't the point.
The way to prove O(n^2) is to prove it on the code itself, not through experiments. The actual running time isn't important, because Big-O notation doesn't say anything about that - in simple terms, it only specifies the dominant term of whatever formula you would use to calculate the exact running time, in the sense of the number of operations executed for that function. Constants are thrown away, and so are smaller terms - the actual running time of a function might be 1000n^2+1000000n, but that's still O(n^2).
You can't mathematically prove anything from this table; the complexity might be O(1) if Time remains at 20333 for all larger values.
The best you can do is try fitting several curves to this table and selecting the best fit according to Occam's razor.
You can't prove it by looking at the timings, you can only prove it by analysing the code to see how many steps are performed. The reason for this is that the time taken is a function not only of your program but many other things outside of your control as well.
For example, who can say whether your machine didn't spend an inordinate amount of time in other processes during one particular test run of your program? This sort of thing can be minimsed to a point using statistical methods but the proof requires solid data.
What you can do is to look at some of your data points to get support for the contention that it's O(n2). Have a look at the last four entries:
Input Time
128 318
256 1274 1274 / 318 = 4.006
512 5059 5059 / 1274 = 3.971
1024 20333 20333 / 5059 = 4.019
You can see that each doubling of the input size has a multiplier effect of the time of about 4 which would tend to indicate an O(n2) property.
But this is support only. It applies only to that particular range of input values and, as stated, is subject to factors outside your control. Note also that the support would be harder to see if the time taken was not a simple one. For example, if the time function was t = n2/10 + 123n + 123456789, it would be a little harder to figure out.
Just by making a comparison between the values may not make any sense.However,if you plot a graph using this values( x-axis : input , y-axis:time),you will get a curve or a linear shape or whatever.Using this information,you can predict the BigO value of that function.Of course there may be(not always) some interrupts that affects the running of that process,but that does not last during the whole period.It is slight overhead that cannot affect the result.
In order to predict the BigO value , you will need some Calculus knowledge in order to make the analogy between the shape and BigO result.
For example,let's say that you got a linear shape and you know that it means O(n).In that point,you reached that result because you know the shape of a linear function graph and your graph looks like it.In order to reach the true proof , you have to draw both your functions curve and the graph of the mathematical function that has the closest shape to your graph.
There are some other functions like Big-Theta , Small-Omega that binds your function from upper or from lower.The mathematical function could be both of them,but as a result,your Big-O function is the closest one to that shape.
I am trying to optimize a bit of code, and am puzzled about information from summaryRprof(). In particular, it looks like a number of calls are made to external C programs, but I'm not able to pin down which C program, from which R function. I am planning to resolve this through a bunch of slicing and dicing of the code, but wondered if I am overlooking some better way to interpret the profiling data.
The highest-consuming function is .Call, which is apparently a generic description for calls to C code; the next leading functions appear to be assignment operations:
$by.self
self.time self.pct total.time total.pct
".Call" 2281.0 54.40 2312.0 55.14
"[.data.frame" 145.0 3.46 218.5 5.21
"initialize" 123.5 2.95 217.5 5.19
"$<-.data.frame" 121.5 2.90 121.5 2.90
"as.vector" 110.5 2.64 416.0 9.92
I decided to focus on the .Call to see how this arises. I looked through the profiling file to find those entries with .Call in the call stack, and the following are the top entries in the call stack (by count of # of appearances):
13640 "eval"
11252 "["
7044 "standardGeneric"
4691 "<Anonymous>"
4658 "tryCatch"
4654 "tryCatchList"
4652 "tryCatchOne"
4648 "doTryCatch"
This list is as clear as mud: I have <Anonymous> and standardGeneric in there.
I believe this is due to calls to functions in the Matrix package, but that's because I'm looking at the code and that package appears to be the only possible source of C code. However, a lot of different functions from Matrix are called in this package, and it seems very difficult to determine which function is consuming this time.
So, my question is pretty basic: is there some way of deciphering and attributing these calls (e.g. .Call, <Anonymous>, etc.) in some other way? The plot of the call graph for this code is rather tricky to render, given the # of functions involved.
The fallback tactics I see are to either (1) comment out bits of code (and hack around to make the code work with this) to see where the time consumption occurs, or to (2) wrap certain operations inside of other functions and see when those functions appear on the call stack. The latter is inelegant, but it seems like it's the best way to add a tag to the call stack. The former is unpleasant because it takes quite some time to run the code, and iteratively uncommenting code and rerunning is unpleasant.
May I suggest you use the profr package. This is another bit of Hadley magic. It's a wrapper around Rprof and gives a visulation of the call stack and timings.
I find profr very easy to use and interpret. For example, here is a profile of a bit of ddply example code and the resulting profr plot:
library(profr)
p <- profr(
ddply(baseball, .(year), "nrow"),
0.01
)
plot(p)
You can immediately see the following:
How ddply calls ldply, llply and loop_apply.
Inside loop_apply there is a .Call function.
You can confirm this by reading the source code for loop_apply:
> plyr:::loop_apply
function (n, f, env = parent.frame())
{
.Call("loop_apply", as.integer(n), f, env)
}
<environment: namespace:plyr>
Edit. There is something very odd about the ggplot.profr method. I have proposed the following fix to Hadley. (You may wish to try this on your example.)
ggplot.profr <- function (data, ..., minlabel = 0.1, angle = 0){
if (!require("ggplot2", quiet = TRUE))
stop("Please install ggplot2 to use this plotting method")
data$range <- diff(range(data$time))
ggplot(as.data.frame(data), aes(y=level)) +
geom_rect(
#aes(xmin=(level), xmax=factor(level)+1, ymin=start, ymax=end),
aes(ymin=level-0.5, ymax=level+0.5, xmin=start, xmax=end),
#position = "identity", stat = "identity", width = 1,
fill = "grey95",
colour = "black", size = 0.5) +
geom_text(aes(label = f, x = start + range/60),
data = subset(data, time > max(time) * minlabel), size = 4, angle = angle, vjust=0.5, hjust = 0) +
scale_x_continuous("time") +
scale_y_continuous("level")
}
It seems that the short answer is "No" and the long answer is "Yes, but you're not going to enjoy this." Even answering this question is going to take some time (so stick around, I may be updating it).
There are several basic things to get one's head around when working with profiling in R:
First, there are many different ways to think about profiling. It is quite typical to think in terms of a call stack. At any given instant, this is the sequence of function calls that are active, essentially nested within each other (subroutines, if you will). This is quite useful for understanding the state of evaluations, where functions will return, and lots of other things that are important for seeing things as the computer / interpreter / OS may see them. Rprof does call stack profiling.
Second, a different perspective is that I've got a bunch of code and a particular call is taking a long time: which line in my code caused that call to be made? This is line profiling. R doesn't have line profiling, as far as I can tell. This is in contrast with Python and Matlab, which both have line profilers.
Third, the map from from lines to calls is surjective, but it is not bijective: given a particular call stack we cannot guarantee that we can map it back to the code. In fact, call stack analyses often summarize the calls completely out of context of the whole stack (i.e. cumulative times are reported no matter where that call was on all of the different stacks in which it occurred).
Fourth, even though we have these constraints, we can put on our statistical hats and analyze the call stack data carefully and see what we can make of it. The call stack information is data and we like data, don't we? :)
Just a quick intro to a call stack. Let's just assume that our call stack looked like this:
"C" "B" "A"
This means that function A called B which then calls C (the order is reversed), and the call stack is 3 levels deep. In my code, the call stack gets to as many as 41 levels deep. Since the stacks can be so deep and are presented in reverse order, this is more interpretable by software than by a human. Naturally, we begin cleaning and transforming this data. :)
Now, our data really comes along looking like:
".Call" "subCsp_cols" "[" "standardGeneric" "[" "eval" "eval" "callGeneric"
"[" "standardGeneric" "[" "myFunc2" "myFunc1" "eval" "eval" "doTryCatch"
"tryCatchOne" "tryCatchList" "tryCatch" "FUN" "lapply" "mclapply"
"<Anonymous>" "%dopar%"
Miserable, isn't it? It even has duplicates of things like eval, some guy called <Anonymous> - probably some darn hacker. (Anonymous is legion, by the way. :-))
The first step in transforming this into something useful was to split each line of Rprof() output and reverse the entries (via strsplit and rev). The first 12 entries (last 12 if you look at the raw call stack, rather than the post-rev version) were the same for every line (of which there were about 12000, the sampling interval was 0.5 seconds - so about 100 minutes of profiling), and these can be discarded.
Remember, we're still interested in knowing which line(s) led to .Call, which took so much time. Before we get to that question, we put on our statistical caps: the profiling reports, e.g. from summaryRprof, profr, ggplot, etc., only reflect the cumulative time spent for a given call or for calls beneath a given call. What does this cumulative information not tell us? Bingo: whether that call was made many times, or a few, and whether the time spent was constant over all invocations of that call or whether there are some outliers. A particular function might be executed 100 times or 100K times, but all of the cost may come from a single invocation (it shouldn't, but we don't know until we look at the data).
This only begins to describe the fun. The A->B->C example doesn't reflect the way things may really appear, such as A->B->C->D->B->E. Now, "B" may be counted a couple of times. What's more, suppose that a lot of time is spent in the C level, but we never sample at precisely that level, only seeing its child calls in the stack. We may see a sizable time for "total.time", but not for "self.time". If there are lots of different child calls under C, we may lose sight of what to optimize - should we take out C altogether or tweak the children, B, D, and E?
Just to account for the time spent, I took the sequences and ran them through digest, storing counts for the digested values, via hash. I also split up the sequences, storing {(A),(A,B), (A,B,C), etc.}. This doesn't seem so interesting, but removing singletons from the counts helps a lot in cleaning up the data. We can also store the time spent in each call by using rle(). This is useful for analyzing the distribution of time spent for a given call.
Still we're nowhere closer to finding the actual time spent per line of code. We'll never get lines of code from the call stack. A simpler way to do this is to store a list of times throughout the code, which stores the output of proc.time(), for a given invocation. Taking the difference of these times reveals which lines or sections of code are taking a long time. (Hint: that's what we're really looking for, not the actual calls.)
However, we have this call stack and we might as well do something useful. Going up the stack is somewhat interesting, but if we rewind the profile information to a little earlier, we can find which calls tend to precede the longer running calls. This allows us to look for landmarks in the call stack - positions where we can tie a call to a particular line of code. This makes it a bit easier to map more calls back to code, if all we have is the call stack, rather than instrumented code. (As I keep mentioning: out of context, there isn't a 1:1 mapping, but at a fine enough granularity, especially in repeatedly hit calls that are distinctive, you may be able to find landmarks in the calls that map to code.)
Altogether, I was able to find which calls were taking a lot of time, whether that was based on 1 long interval or many small ones, what the distribution of time spent was like, and, with some effort, I was able to map the most important & time consuming calls back to the code and discover which parts of the code could benefit the most from rewriting or from a change in algorithms.
Statistical analyses of the call stack is loads of fun, but investigating a particular call based on cumulative time consumption is not a very good way to go. The cumulative time consumed by a call is informative on a relative basis, but it doesn't enlighten us as whether one or many calls consumed this time, nor the depth of the call in the stack, nor the section of code responsible for the invocations. The first two things can be addressed via a bit more R code, while the latter is best pursued through instrumented code.
As R doesn't yet have line profilers like Python and Matlab, the simplest way to handle this is to just instrument one's code.
A line in a profile file might look like
"strsplit" ".parseTabix" ".readVcf" "readVcf" "standardGeneric" "readVcf" "system.time"
which says, reading right to left, that the outermost function was system.time, which invoked readVcf, which was an S4 generic that dispatched to a readVcf method, invoking a function .readVcf, which invoked .parseTabix, which finally called strsplit.
Here we read in the profile file, sort the lines, tally them up (using rle -- run length encoding), then select the six most common paths in the profile file
r = rle(sort(readLines("readVcf.Rprof"))
o = order(r$lengths, decreasing=TRUE)
r$values[head(o)]
This
r$lengths[head(o)]
tells us how many times each of those call stacks were sampled.
There are some common patterns that can help interpret this. Here's an S4 generic being dispatched to its method
"readVcf" "standardGeneric" "readVcf"
an lapply iterating over its function
"FUN" "lapply"
and a tryCatch surrounding a .Call
".Call" "doTryCatch" "tryCatchOne" "tryCatchList" "tryCatch"
Usually one tries to profile relatively small chunks of code, rather than a whole script, with the small chunk identified by, e.g., stepping through the code interactively or making some educated guesses about what parts are likely to be slow. The fact that .Call is the most commonly sampled function is not encouraging -- it suggests most of the time is already being spent in C. Likely your best bet will involve coming up with a better overall algorithm, rather than say a brute force approach.