For a report I have to find association rules of a data set of transactions. I downloaded this data set:
http://archive.ics.uci.edu/ml/datasets/online+retail
Then I deleted some columns, converted to nominal values and normalized and then
I got this: https://ufile.io/gz3do
So I thought I had a data set with transactions on which I could use FP-growth and Apriori but I'm not getting any rules.
It just tells me: No rules found!
Can someone please explain to me if and what I'm doing wrong?
one reason could be that your support and/or confidence value are too high. try low ones. e.g. a support and confidence level of 0.001%. another reason could be that your data set just doesn't contain any association rules. try another data set which certainly contains association rules from a set minimum support and confidence value.
Related
In firestore I'm wondering if there is a way to have a hueristic1 and get all data between two hueristic1 values but order the results based on a hueristic2.
I ask because the data at bottom of both pages
https://firebase.google.com/docs/firestore/query-data/order-limit-data
https://firebase.google.com/docs/firestore/query-data/query-cursors
there seems to be slightly contradictory documentation.
What I want to be able to do is
Ref.startAt(some h1 value).endAt(some second h1 value).orderBy(h2).
I know I'd probably have to index by h1 but even then I'm not sure if there is a way to do this.
Update:
I didn't test this well enough to see that is doesn't produce the desired ordering. The OP asked the question again and got an answer from a Firebase team member:
Because Cloud Firestore doesn't support ordering by a different field
than the supplied inequality, you won't be able to sort by name
directly from the query. Instead you'd need to sort client-side once
you've fetched the data.
The API supports the capability you want, although I don't see an example in the documentation that shows it.
The ordering of the query terms is important. Suppose you have a collection of cities and the fields of interest are population (h1) and name (h2). To get the cities with population in range 1000 to 2000, ordered by name, the query would be:
citiesRef.orderBy("population").orderBy("name").startAt(1000).endAt(2000)
This query requires a composite index, which you can create manually in the console. Or as the documentation there indicates, the system will help you:
Instead of defining a composite index manually, run your query in your
app code to get a link for generating the required index.
I've downloaded a data set from UCI Machine Learning Repository. In the description of the data set, they talk about "predictive attribute" and "non-predictive attribute". What does it mean and how can you identify them in a data set?
Predictive attribute are attributes that may help your prediction.
Non-predictive attributes are known to not help. For example, a record id, user number, etc. Unique keys usually fall into this category.
To me, it looks like attributes relates to type of data points available; therefore a predictive attribute would be a data point that can be used to "predict" something, such as MYCT, MMIN, MMAX, CACH, CHMIN, CHMAX. The "non-predictive attribute" would be the vendor names and model name. PRP seems to be the goal field, and linear regression guess is ERP.
I use orange to build association rules on medical sparse dataset. But I can't find a way to insert syntactic constraints in rules production?
Seems that in Orange I can only choose: min support, min confidence and max number of rules, but I'm interested to have a specific set of event on the right or on the left side of the implications.
For example, I be interested only in rules that have a specific item I(x) appearing in the consequent, or rules that have a specific item I(y) appearing in the antecedent, or combinations of the above constraints.
Rules are usually not generated as rules, but as frequent item sets.
To derive association rules, you need to know the support of every possible subset, too. Computing and storing these subsets is the challenge. Extracting the rules from the FIMs is not very difficult or expensive.
So you may as well apply the constraints only to the input data, or output rules after generation. If you apply the rules too early or in the wrong place, you may violate the monotonicity requirements required to have a correct result.
You can try the latest Orange 3. There seems to be an updated Orange3-Associate add-on available (installable via menu: Options > Add-ons) that seems to do exactly what you ask for, namely that you can filter induced itemsets/rules by number of items and/or regular expressions.
I created a cube that contain five attributes and a metric and I want to create a document from this cube with different visualisations for each attribute. the problem is that data in the cube are aggregated based on all attributes in the cube, so when you add a grid with one attribute and the metric the numbers will not be correct.
Is there any way to make the metric dynamically aggragate depending on the attribute in use?
This depends what kind of metric you have in the cube. The best way to achieve aggregation across all attributes is obviously to have the most granular, least aggregated data in the cube, but understandably this is not always possible.
If your metric is a simple SUM metric then you can set your dynamic aggregation settings on the metric to just do SUM and it should perform SUM's appropriately regardless of the attributes you are placing on your document/report? Unless your attribute relationships are not set up correctly or there are some many-to-many relationships taking place between some of those attributes.
If you metric is a distinct count metric, then the approach is slightly different and has been covered previously in a few places. Here is one, on an older version of Microstrategy but the logic can still be applied to newer versions.:
http://community.microstrategy.com/t5/tkb/articleprintpage/tkb-id/architect/article-id/1695
I'm setting up Graphite, and hit a problem with how data is represented on the screen when there's not enough pixels.
I found this post whose first answer is very close to what I'm looking for:
No what is probably happening is that you're looking at a graph with more datapoints than pixels, which forces Graphite to aggregate the datapoints. The default aggregation method is averaging, but you can change it to summing by applying the cumulative() function to your metrics.
Is there any way to get this cumulative() behavior by default?
I've modified my storage-aggregation.conf to use 'aggregationMethod = sum', but I believe this is for historical data and not for data that's displayed in the UI.
When I apply cumulative() everything is perfect, I'm just wondering if there's a way to get this behavior by default.
I'm guessing that even though you've modified your storage-aggregation.conf to use 'aggregationMethod = sum', your metrics you've already created have not changed their aggregationMethod. The rules in storage-aggregation.conf only affect new metrics.
To change your existing metrics to be summed instead of averaged, you'll need to use whisper-resize.py. Or you can delete your existing metrics and they'll be recreated with sum.
Here's an example of what you might need to run:
whisper-resize.py --xFilesFactor=0.0 --aggregationMethod=sum /opt/graphite/storage/whisper/stats_counts/path/to/your/metric.wsp 10s:28d 1m:84d 10m:1y 1h:3y
Make sure to run that as the same user who owns the file, or at least make sure the files have the same ownership when you're done, otherwise they won't be writeable for new data.
Another possibility if you're using statsd is that you're just using metrics under stats instead of stats_counts. From the statsd README:
In the legacy setting rates were recorded under stats.counter_name
directly, whereas the absolute count could be found under
stats_count.counter_name. With disabling the legacy namespacing those
values can be found (with default prefixing) under
stats.counters.counter_name.rate and stats.counters.counter_name.count
now.
Basically, metrics are aggregated differently under the different namespaces when using statsd, and you want stuff under stats_count or stats.counters for things that should be summed.