How to impose synstactic constraints in apriori in Orange - constraints

I use orange to build association rules on medical sparse dataset. But I can't find a way to insert syntactic constraints in rules production?
Seems that in Orange I can only choose: min support, min confidence and max number of rules, but I'm interested to have a specific set of event on the right or on the left side of the implications.
For example, I be interested only in rules that have a specific item I(x) appearing in the consequent, or rules that have a specific item I(y) appearing in the antecedent, or combinations of the above constraints.

Rules are usually not generated as rules, but as frequent item sets.
To derive association rules, you need to know the support of every possible subset, too. Computing and storing these subsets is the challenge. Extracting the rules from the FIMs is not very difficult or expensive.
So you may as well apply the constraints only to the input data, or output rules after generation. If you apply the rules too early or in the wrong place, you may violate the monotonicity requirements required to have a correct result.

You can try the latest Orange 3. There seems to be an updated Orange3-Associate add-on available (installable via menu: Options > Add-ons) that seems to do exactly what you ask for, namely that you can filter induced itemsets/rules by number of items and/or regular expressions.

Related

how to ensure there single edge in a graph for a given order_id?

My current scenario is like I have I have products,customer and seller nodes in my graph ecosystem. The problem I am facing is that I have to ensure uniqueness of
(customer)-[buys]->product
with order_item_id as property of the buys relation edge.I have to ensure that there is an unique edge with buys property for a given order_item_id. In this way I want to ensure that my graph insertion remains idempotent and no repeated buys edges are created for a given order_item_id.
creating a order_item_id property
if(!mgmt.getPropertyKey("order_item_id")){
order_item_id=mgmt.makePropertyKey("order_item_id").dataType(Integer.class).make();
}else{
order_item_id=mgmt.getPropertyKey("order_item_id");
}
What I have found so far is that building unique index might solve my problem. like
if(mgmt.getGraphIndex('order_item_id')){
ridIndexBuilder=mgmt.getGraphIndex('order_item_id')
}else{
ridIndexBuilder=mgmt.buildIndex("order_item_id",Edge.class).addKey(order_item_id).unique().buildCompositeIndex();
}
Or I can also use something like
mgmt.buildEdgeIndex(graph.getOrCreateEdgeLabel("product"),"uniqueOrderItemId",Direction.BOTH,order_item_id)
How should I ensure this uniqueness of single buys edge for a given
order_item_id. (I don't have a use-case to search based on
order_item_id.)
What is the basic difference in creating an index on edge using
buildIndex and using buildEdgeIndex?
You cannot enforce the uniqueness of properties at the edge-level, ie. between two vertices (see this question on the mailing list). If I understand your problem correctly, building a CompositeIndex on edge with a uniqueness constraint for a given property should address your problem, even though you do not plan to search these edges by the indexed property. However, this could lead to performance issues when inserting many edges simultaneously due to locking. Depending on the rate at which you insert data, you may have to skip the locking (= skip the uniqueness constraint) and risk duplicate edges, then handle the deduplication yourself at read time and/or run post-import batch jobs to clean up potential duplicates.
buildIndex() builds a global, graph-index (either CompositeIndex or MixedIndex). These kind of indexes typically allows you to quickly find starting points of your Traversal within a graph.
However, buildEdgeIndex() allows you to build a local, "vertex-centric index" which is meant to speed-up the traversal between vertices with a potentially high degree (= huge number of incident edges, incoming and/or outgoing). Such vertices are called "super nodes" (see A solution to the supernode problem blog post from Aurelius) - and although they tend to be quite rare, the likelihood of traversing them isn't that low.
Reference: Titan documentation, Ch. 8 - Indexing for better Performance.

Does supplying an ancestor improve filter queries performance in objectify / google cloud datastore?

I was wondering which of these two is superior in terms of performance (and perhaps cost?):
1) ofy().load().type(A.class).ancestor(parentKey).filter("foo", bar).list()
2) ofy().load().type(A.class) .filter("foo", bar).list()
Logically, I would assume (1) will perform better as it restricts the domain on which to perform the filtering (though obviously there's an index). However, I was wondering if I may be missing some consideration/factor.
Thank you.
Including an ancestor should not significantly affect query performance (and as you note, each option requires a different index).
The more important consideration when deciding to use ancestor queries is whether you require strongly consistent query results. If you do, you'll need to group your entities into entity groups and use an ancestor query (#1 in your example).
The trade-off of grouping entities into entity groups is that each entity group only supports 1 write per second.
You can find a more detailed discussion of structuring data for strong consistency here: https://developers.google.com/datastore/docs/concepts/structuring_for_strong_consistency

How tightly can Marklogic search scores be controlled?

Our database contains documents with a lot of metadata, including relationships between those documents. Fictional example:
<document>
<metadata>
<document-number>ID 12345 : 2012</document-number>
<publication-year>2012</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2004</cross-reference>
<supersedes>ID 12345 : 2004</supersedes>
...
</metadata>
</document>
<document>
<metadata>
<document-number>ID 12345 : 2004</document-number>
<publication-year>2004</publication-year>
<cross-reference>ID 67890 : 1995</cross-reference>
<cross-reference>ID 67890 : 1998</cross-reference>
<cross-reference>ID 67891 : 2000</cross-reference>
<cross-reference>ID 12345 : 2012</cross-reference>
<cross-reference>ID 12345 : 2001</cross-reference>
<superseded-by>ID 12345 : 2012</superseded-by>
<supersedes>ID 12345 : 2001</supersedes>
...
</metadata>
</document>
We're using a 1-box search, based on the Marklogic search api to allow users to search these documents. The search grammar describes a variety of contraints and search options, but mostly (and by default) they search by a field defined to include most of the metadata elements, with (somewhat) carefully chosen weights (what really matters here is that document-number has the highest weight.)
The problem is that the business wants quite specific ordering of results, and I can't think of a way to achieve it using the search api.
The requirement that's causing trouble is that if the user search matches a document number (say they search for "12345",) then all documents with that document number should be at the top of the result-set, ordered by descending date. It's easy enough to get them at the top of the result-set; document-number has the highest weight, so sorting by score works fine. The problem is that the secondary sort by date doesn't work because even though all the document-number matches have higher scores than other documents, they don't have the same score, so they end up ordered by how often the search term appears in the rest of the metadata; which isn't really meaningful at all.
What I think we really need is a way of having the search api score results simply by the highest weighted element that matches the search-term, without reference to any other matches in the document. I've had a look at the scoring algorithms and can't see one that does that; have I missed something or is this just not possible? Obviously, it doesn't have to be score that we order by; if there's some other way to get at the score of the single best match in a document and use it for sorting, that would be fine.
Is there some other solution I haven't even thought of?
I thought of doing two searches (one on document-number, and one on the whole metadata tree) and then combining the results, but that seems like it's going to cause a lot of pain with pagination and performance. Which sort-of defeats the purpose of using the search api in the first place.
I should add that it is correct to have those other matches in the result-set, so we can't just search only on document-number.
I think you've reached the limits of what the high-level search API can do for you. I have a few tricks to suggest, though. These won't be 100% robust, but they might be good enough for the business. Then you can get on with the application. Sorry if I sound cynical or dismissive, but I don't believe in micromanaging search results.
Simplest possible: re-sort the first page in memory. That first page could be a bit larger than the page you show to the user. Because it is still limited in size, you can make the rules for this fairly complex without suffering much. That would fix your 'descending date' problem. The results from page 1 wouldn't quite match up with page 2, but that might be good enough.
Taking the next step in complexity, consider using document-quality to handle the descending-date issue. This approach is used by http://markmail.org among others. As each document is inserted or updated, set document quality using a number derived from the date. This could be days or weeks or months since 1970, or using some other fixed date. Newer results will tend to float to the top. If any other boosts tend to swamp the date-based boost, you might get close to what you want.
There might also be some use in analyzing the query to extract the potentially boosting terms. If necessary you could then begin a recursive run of xdmp:exists(cts:search(doc(), $query)) on each boosting term as if it were a standalone query. Bail out as soon as you find a true() result: that means you are going to boost that query term with an absurdly high weight to make it float to the top.
Once you know what the boosting term is, rewrite the entire query to set all other term weights to much lower values, perhaps even 0. The lower the weight, the less those non-boosting terms will interfere with the date-based quality and the boosting weight. If there is no boosting term, you might want make other adjustments. All this is less expensive than it sounds, by the way. Aside from the xdmp:exists calls, it's just in-memory expression evaluation.
Again, though, these are all just tricks to nudge the scores. They won't give you the absolute control over ranking that you're looking for. In my experience, attempts to micromanage scores are doomed to failure. My bet is that your users would be happier with raw TF/IDF, whatever your business managers say.
Another way to do it is to use two searches, as you suggest. Put a range index on document-number (and ideally the document date), extract any potential document-number values from the query (search:parse, extract, then search:resolve is a good strategy), then execute a cts:element-range-query for docs matching those document-number values with date descending. If there aren't enough results to fill up your N-result page, then get the next N-x results from search api. You can keep track of the documents that were returned in the first result set and exclude those URIs from the second one. Keeping track of the pagination won't be too bad.
This might not perform as well as the first solution, but the time difference for the additional range index query combined with a shorter search api query should be negligible enough for most.

Give me a practical use-case of Multi-set

I would like to know a few practical use-cases (if they are not related/tied to any programming language it will be better).I can associate Sets, Lists and Maps to practical use cases.
For example if you wanted a glossary of a book where terms that you want are listed alphabetically and a location/page number is the value, you would use the collection TreeMap(OrderedMap which is a Map)
Somehow, I can't associate MultiSets with any "practical" usecase. Does someone know of any uses?
http://en.wikipedia.org/wiki/Multiset does not tell me enough :)
PS: If you guys think this should be community-wiki'ed it is okay. The only reason I did not do it was "There is a clear objective way to answer this question".
Lots of applications. For example, imagine a shopping cart. That can contain more than one instance of an item - i.e. 2 cpu's, 3 graphics boards, etc. So it is a Multi-set. One simple implementation is to also keep track of the number of items of each - i.e. keep around the info 2 cpu's, 3 graphics boards, etc.
I'm sure you can think of lots of other applications.
A multiset is useful in many situations in which you'd otherwise have a Map. Here are three examples.
Suppose you have a class Foo with an accessor getType(), and you want to know, for a collection of Foo instances, how many have each type.
Similarly, a system could perform various actions, and you could use a Multiset to keep track of how many times each action occurred.
Finally, to determine whether two collections contain the same elements, ignoring order but paying attention to how often instances are repeated, simply call
HashMultiset.create(collection1).equals(HashMultiset.create(collection2))
In some fields of Math, a set is treated as a multiset for all purposes. For example, in Linear Algebra, a set of vectors is teated as a multiset when testing for linear dependancy.
Thus, implementations of these fields should benefit from the usage of multisets.
You may say linear algebra isn't practical, but that is a whole different debate...
A Shopping Cart is a MultiSet. You can put several instances of the same item in a Shopping Cart when you want to buy more than one.

Visualisation of Tree Hierarchy in HTML

I am looking of inspiration for doing interaction design on a hierachy/tree structure. (products with a number of subproducts, rules that apply for selecting subproducts).
I want to have a tree where there is a visible connection between sub-nodes and their parent node. And i also want to visualize which rules that apply for selecting them.
Typical rules:
mandatory: select only one of one sub-product
optional: select 0 or more of several subproducts
mutual exclusive: select only one of several subproducts
I hope you get the idea.
I am looking for any inspiration in this area. Any suggestions, examples, tips are welcome
Here are several:
http://thejit.org/
http://www.jsviz.org/blog/
http://vis.stanford.edu/protovis/
If you are willing to use something other than html/javascript, Flare is an excellent library for Adobe Flash.
I've used Infoviz library for such scenario (here's the demo). You could use distinct node colors for different selection rules together with some textual description, although it wouldn't be very intuitive at first.
Default tree orientation is horizontal, which may look odd, but makes sense when you add textual node names of variable length.

Resources