I have two binarized events (eventA and eventB), I want to know if there is any coincidence in these two events. So I'll use the new Package CoinCalc to investigate the potential relation between these two.
library(CoinCalc) #note that the package is not visible (at least for) me in CRAN. I got it from GitHub https://github.com/JonatanSiegmund/CoinCalc
two binary events
eventA= c(0,1,0,0,1,1,0,0,1,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,1,0,0,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,1,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,1,1,0,1,0,0,0,1,0,0,1,0,1)
eventB = c(0,1,0,0,0,0,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,1,0,1,0,1,0,1,1,0,1,1,1,0,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,1,0,1,1,1,1,1,1,0,0,1,1,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,1,0,0,0)
run ECA analysis
ca.out <- CC.eca.ts(eventA, eventB,delT=2,tau=2)
this yields:
$NH precursor
1 TRUE
$NH trigger
1 FALSE
$p-value precursor
1 0.2544052
$p-value trigger
1 0.003287963
$precursor coincidence rate
1 0.8243243
$trigger coincidence rate
1 0.9285714
I want to make sure I'm understanding this properly. Based on the results, the null hypothesis can only be rejected for the trigger, which is statistically significant at the 0.003 level, and the coincidence rate is 0.92 (very high, is this equivalent to R2?). Can this be interpreted that eventB has a strong influence on eventA, but not the opposite?
Then I can plot these two events using the CC.plot function:
CC.plot(eventA,eventB,dates=c(1900:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
Which yields:
Is there any way to modify the graphical parameters in CC.plot? The dummy years are not visible in this plot. I'd like to change fonts, size, colours, etc. Is there any way to draw the same figure by calling the model output (ca.out)?
Thanks in advance!
I'll try to answer your questions:
Question #1: The most important problem that I see in your example is that your events are not "rare". Therefore the most important pre-condition of the analytical significance test that you used by default (sigtest="poisson") in not fulfilled. Another "problem" is, that the events in both series seem to be clustered (may also be an effect of the high number of events). I would recommend to use sigtest="shuffle.surrogate" which is more appropriate for this case. More information about the significance test can be found at Siegmund et al. 2017 (http://www.sciencedirect.com/science/article/pii/S0098300416305489)
Executing this reveals that both coincidence rates are not significant. By the way: with such a high number of events it is extremely unlikely that you would ever get a 'significant coincidence rate', because the chance that simultaneities occur by random is very very high.
Nevertheless, if the trigger coincidence rate would be significant and the precursor not, your interpretation is a possible one.
Question #2: The problem with the plot is again, that there are too many events (compared to what the method was originally designed for). This is why everything looks so messy. The function was ment to be more like a help to explain how the method works and what you have done.
If you e.g. only plot e.g. 20 years of your data
CC.plot(eventA[120:140],eventB[120:140],dates=c(2020:2040),delT=2, tau=2, seriesAname = 'EventA', seriesBname = 'EventB')
you will get a much better image, that yet, due to the high event-density of almost 50%, is not very nice.
CoinCalc plot
For now, there are no options to change the plot parameters. This might come for a future version of the package.
I hope that this helps you a bit!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)
You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA
I'm trying to use the Naive Bayes Learner from e1071 to do spam analysis. This is the code I use to set up the model.
library(e1071)
emails=read.csv("emails.csv")
emailstrain=read.csv("emailstrain.csv")
model<-naiveBayes(type ~.,data=emailstrain)
there a two sets of emails that both have a 'statement' and a type. One is for training and one is for testing. when I run
model
and just read the raw output it seems that it gives a higher then zero percent chance to a statement being spam when it is indeed spam and the same is true for when the statement is not. However when I try to use the model to predict the testing data with
table(predict(model,emails),emails$type)
I get that
ham spam
ham 2086 321
spam 2 0
which seems wrong. I also tried using the training set to test the data on as well, and in this case it should give quite good results, or at least as good as what was observed in the model. However it gave
ham spam
ham 2735 420
spam 0 6
which is only slightly better then with the testing set. I think it must be something wrong with how the predict function is working.
how the data files are set up and some examples of whats inside:
type,statement
ham,How much did ur hdd casing cost.
ham,Mystery solved! Just opened my email and he's sent me another batch! Isn't he a sweetie
ham,I can't describe how lucky you are that I'm actually awake by noon
spam,This is the 2nd time we have tried to contact u. U have won the £1450 prize to claim just call 09053750005 b4 310303. T&Cs/stop SMS 08718725756. 140ppm
ham,"TODAY is Sorry day.! If ever i was angry with you, if ever i misbehaved or hurt you? plz plz JUST SLAP URSELF Bcoz, Its ur fault, I'm basically GOOD"
ham,Cheers for the card ... Is it that time of year already?
spam,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k"
ham,"When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\'t know that Iam addicted to my sweet Friends..!! BSLVYL"
ham,Ugh hopefully the asus ppl dont randomly do a reformat.
ham,"Haven't seen my facebook, huh? Lol!"
ham,"Mah b, I'll pick it up tomorrow"
ham,Still otside le..u come 2morrow maga..
ham,Do u still have plumbers tape and a wrench we could borrow?
spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
ham,It vl bcum more difficult..
spam,UR GOING 2 BAHAMAS! CallFREEFONE 08081560665 and speak to a live operator to claim either Bahamas cruise of£2000 CASH 18+only. To opt out txt X to 07786200117
I would really love suggestions. Thank you so much for your help
Actually predict function works just fine. Don't get me wrong but problem is in what you are doing. You are building the model using this formula: type ~ ., right? It is clear what we have on the left-hand side of the formula so lets look at the right-hand side.
In your data you have only to variables - type and statement and because type is dependent variable only thing that counts as independent variable is statement. So far everything is clear.
Let's take a look at Bayesian Classifier. A priori probabilities are obvious, right? What about
conditional probabilities? From the classifier point of view you have only one categorical Variable (your sentences). For the classifier point it is only some list of labels. All of them are unique so a posteriori probabilities will be close to the the a priori.
In other words only thing we can tell when we get a new observation is that probability of it being spam is equal to probability of message being spam in your train set.
If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example mean stemming, lemmatization, computing n-gram statistics, tf-idf. Training classifier is the last step.
My goal is to independently calculate the number of items an enemy would drop after it is killed. For example, say there are 50 potions each with a 50% chance of being dropped, I'd like to randomly return a number from 0 to 50, based on independent trials.
Currently, this is the code I'm using:
int droppedItems(int n, float probability) {
int count = 0;
for (int x = 1; x <= n; ++x) {
if (random() <= probability) {
++count;
}
}
return count;
}
Where probability is a number from 0.0 to 1.0, random() returns 0.0 to 1.0, and n is the maximum number of items to be dropped. This is in C++ code, however, I'm actually using Visual Basic 6 - so there's no libraries to help with this.
This code works flawlessly. However, I'd like to optimize this so that if n happens to be 999999, it doesn't take forever (which it currently does).
Use the binomial distribution. Wiki - Binomial Distribution
Ideally, use the libraries for whatever language this pseudocode will be written in. There's no sense in reinventing the wheel unless of course you are trying to learn how to invent a wheel.
Specifically, you'll want something that will let you generate random values given a binomial distribution with a probability of success in any given trial and a number of trials.
EDIT :
I went ahead and did this (in python, since that's where I live these days). It relies on the very nice numpy library (hooray, abstraction!):
>>>import numpy
>>>numpy.random.binomial(99999,0.5)
49853
>>>numpy.random.binomial(99999,0.5)
50077
And, using timeit.Timer to check execution time:
# timing it across 10,000 iterations for 99,999 items per iteration
>>>timeit.Timer(stmt="numpy.random.binomial(99999,0.5)", setup="import numpy").timeit(10000)
0.00927[... seconds]
EDIT 2 :
As it turns out, there isn't a simple way to implement a random number generator based off of the binomial distribution.
There is an algorithm you can implement without library support which will generate random variables from the binomial distribution. You can view it here as a PDF
My guess is that given what you want to use it for (having monsters drop loot in a game), implementing the algorithm is not worth your time. There's room for fudge factor here!
I would change your code like this (note: this is not a binomial distribution):
Use your current code for small values, say n up to 100.
For n greater than one hundred, calculate the value of count for
100 using your current algorithm and then multiply the result by
n/100.
Again, if you really want to figure out how to implement the BTPE algorithm yourself, you can - I think the method I give above wins in the trade off between effort to write and getting "close enough".
As #IamChuckB pointed out already, the key word is binomial distribution. When the number of Bernoulli trials (number of items in your example) is large enough, a good approximation is the Poisson distribution, which is much simpler to calculate and draw numbers from (the exact algorithm is spelled out in the linked Wikipedia article).
Is there a formula or algorithm which can prioritize items based on weight and a date? For instance, a critical item would always be at the top of the list while a two normal items would be prioritized based on their due date.
Scheduling is one of the most-studied areas of computer science, which is convenient, because it gives a lot of prior art that you can learn from.
Perhaps the easiest approach is Earliest Deadline First -- where you schedule the task with the first deadline and work on it until it blocks. Then work on the next earliest deadline. The downside is that low-priority tasks that take a long time might stall higher-priority tasks.
It might be worthwhile to determine if your scheduling must be hard, firm, or soft -- sometimes it makes sense to drop tasks completely and finish nearly everything on time than to finish everything but half a second too late.
Yes. This can either be done by defining a comparison function that checks priority first. I.e.
// Returns n < 0, 0, or n > 1 if value1 is less than, equal to or greater
compare(value1, value2) {
if(value1.priority != value2.priority) {
return value1.priority - value2.priority;
}
return value1.date - value2.date;
}
Alternatively, this function returns a value calculated from the date and the priority, this can be used to compare tasks and order them by priority (and then date):
// Returns
task.GetValue() {
return me.GetDateAsIntegerValue() + MAX_DATE_VALUE * me.GetPriority();
}
But just as sarnold mentioned, this is a highly studied area.
A different way to look at this is as a ranking problem. If you take these two values, weight and priority as inputs, you can create a table of paired comparisons that decompose items into their inputs (weight and priority) and outputs are relative orderings.
Consider, say, item 42 and item 69, denoted X42 and X69: if you have their weights and priority (W42, P42) and (W69, P69), you'd like to know if X42 should appear before X69, after it, or at an equal position. If you have a training set, you can tag whether one is preferred to the other.
What we're lacking here is a method for comparing these. A very simple method is to use logistic regression on the differences, i.e. a simple function f( (W_A - W_B), (P_A - P_B)), or f((W42 - W69),(P42 - P69)), in this case. If the result is above some threshold, then A is preferred to B, otherwise B is preferred to A. You can use this to sort the results.
As usual, most of the results online are not very accessible to beginners. Here's a short chapter that may be helpful in understanding the logistic regression. However, if you'd like to address such matters in more depth, the statistics StackExchange site would be better.
You'll have to decide: (1) if what you're looking at can be decomposed into an additive function of the weight and priority, and, if so, (2) the loss function or objective function that you need to minimize, so that you can get the optimal parameters for this additive function. An ordinal logistic model is one choice, ordinal probit another, and there are tons of other options. If you don't use an additive function (i.e. a linear combination), you'll have a challenging range of possibilities to consider, so it's best to start with something simple.
You can separate the tasks by rating the impact 1-10 (10 being highest) and the output needed 1-10 (also 10 being hardest)
You add the numbers together and divide by two. The result will be the priority ranking of your task 1-10 (10 being most important).
Example:
Check Emails: impact 2 output 1 = 1.5
Call potential customer: impact 10 output 2 = 6
From this example the calling of the customer would then be placed in a higher priority than checking emails.