OptaPlanner: Difference between filtering during or after join - constraints

Is there any difference, semantically or performance wise, between
cf.forEach(Entity.class)
.join(cf.forEach(Entity.class),
Joiners.filtering(myBiPredicate)))
...
vs
cf.forEach(Entity.class)
.join(cf.forEach(Entity.class))
.filter(myBiPredicate)))
...
Also, are the Joiners executed in the order given? (I.e. is it advised to, for example, put filters that prune a lot of pairs early in the argument list?)

To answer your first question - no, there is no practical difference between the two constraints. Joiners.filtering(...) exists so that you can apply a filter in ifExists(...) calls; when doing a normal join, you might as well just do a filter after the join.
To answer your second question, I wrote an entire blog post. Yes, it is better to prune early.

Related

Removing duplicated rows in the most efficient way

I an wondering what is the best way to remove duplicate rows, but keeping other data within the row.
For instance, let's say I have two rows that has the same ID, and I want to remove either one and keep just 1 copy of that row based on that field(s).
Online searches show that the answer is probably using | summarize arg_max(TimeField,*) by ID. However, since KQL is a columnar database, this operation is "heavy" by design, and will increase the processing time, probably significantly.
I was wondering if there is no way around it, or there is a more efficient way?
Thank you!
I tried to remove entire duplicate rows using arg_max but given the nature of the function, it is making the query to timeout.
"It is making the query to timeout."
The default timeout is 4 minutes.
You can increase it up to an hour.
2.
"Since KQL is a columnar database, this operation is "heavy" by design."
This might be a "heavy" operation due to the cardinality of the IDs.
You might need to use summarize with hint.strategy=shuffle instead of the default algorithm.
It has nothing to do with columnar.

Imperative matching in TinkerPop / CosmosDB

I'm want to make a query like this one
g.V().match(
as('foo').hasLabel('bar'),
as('foo').out('baz').hasId('123'),
as('foo').out('baz').hasId('456')
)
.select('foo').by('id')
which is meant to select the ids of all nodes of type bar, which has baz-typed edges to all the specified nodes.
However, CosmosDB only supports a subset of TinkerPop Gremlin, and match() is among the traversal steps that are not supported.
What is a way to formulate the above query using only supported constructs?
You can do something like this
g.V().hasLabel('bar').as('a').out('baz').hasId(within('123','456')).select('a').id()
In a large number of cases you can avoid using the match step.
Cheers
Kelvin
I think you want an "and" condition for the ids in the adjacent vertices so I would go for one of these approaches:
g.V().hasLabel('bar')
and(out('baz').hasId('123'),out('baz').hasId('456'))
.id()
or perhaps (note that the "2" in is(2) should be equivalent to the number of ids you are trying to validate):
g.V().hasLabel('bar')
where(out('baz').hasId('123','456').count().is(2))
.id()
I prefer this second approach I think as you only iterate out('baz') once to confirm the filter. You might make it a little more performant by including limit(2) as follows:
g.V().hasLabel('bar')
where(out('baz').hasId('123','456').limit(2).count().is(2))
.id()
In that way the where() filter will exit as early as possible depending on how quickly you find the "123" and "456" vertices. I suppose imposing the limit() would also protect you from situations where you have more than one edge to those vertices and you end up counting more than what you actually need for the where() to succeed.

Does supplying an ancestor improve filter queries performance in objectify / google cloud datastore?

I was wondering which of these two is superior in terms of performance (and perhaps cost?):
1) ofy().load().type(A.class).ancestor(parentKey).filter("foo", bar).list()
2) ofy().load().type(A.class) .filter("foo", bar).list()
Logically, I would assume (1) will perform better as it restricts the domain on which to perform the filtering (though obviously there's an index). However, I was wondering if I may be missing some consideration/factor.
Thank you.
Including an ancestor should not significantly affect query performance (and as you note, each option requires a different index).
The more important consideration when deciding to use ancestor queries is whether you require strongly consistent query results. If you do, you'll need to group your entities into entity groups and use an ancestor query (#1 in your example).
The trade-off of grouping entities into entity groups is that each entity group only supports 1 write per second.
You can find a more detailed discussion of structuring data for strong consistency here: https://developers.google.com/datastore/docs/concepts/structuring_for_strong_consistency

"SELECT * FROM..." VS "SELECT ID FROM..." Performance [duplicate]

This question already has answers here:
select * vs select column
(12 answers)
Closed 9 years ago.
As someone who is newer to many things SQL as I don't use it much, I'm sure there is an answer to this question out there, but I don't know what to search for to find it, so I apologize.
Question: if I had a bunch of rows in a database with many columns but only need to get back the IDs which is faster or are they the same speed?
SELECT * FROM...
vs
SELECT ID FROM...
You asked about performance in particular vs. all the other reasons to avoid SELECT *: so it is performance to which I will limit my answer.
On my system, SQL Profiler initially indicated less CPU overhead for the ID-only query, but with the small # or rows involved, each query took the same amount of time.
I think really this was only due to the ID-only query being run first, though. On re-run (in opposite order), they took equally little CPU overhead.
Here is the view of things in SQL Profiler:
With extremely high column and row counts, extremely wide rows, there may be a perceptible difference in the database engine, but nothing glaring here.
Where you will really see the difference is in sending the result set back across the network! The ID-only result set will typically be much smaller of course - i.e. less to send back.
Never use * to return all columns in a table–it’s lazy. You should only extract the data you need.
so-> select field from is more faster
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time.
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Long answer short, selecting only the columns you need will always be faster. SELECT * requires scanning the whole table. This is a best practice thing that you should adopt very early on.
For the second part, you should probably post a seperate question instead of piggybacking off this one. Makes it easy to distinguish what you are asking about.

What is an index in SQLite?

I don't understand what an index is or does in SQLite. (NOT SQL) I think it allows for sorting in acending and decending order and access to data quicker. But I'm just guessing here.
Why not SQL? The answer is the same, though the internal details will differ between implementations.
Putting an index on a column tells the database engine to build, unsurprisingly, an index that allows it to quickly locate rows when you search for certain values in a column, without having to scan every row in the table.
A simple (and probably suboptimal) index might be built with an ordinary binary search tree.
Yes, indexes are all about improved data access performance (but at the cost of storage)
http://en.wikipedia.org/wiki/Index_(database)
An index (in any database) is a list of some kind which associates a sorted (or at least, quickly searchable) list of keys with information about where to find the rest of the data associated with the key.
You may not be finding information about this on the Internet because you're assuming it's a SQLite concept, but it's not - it's a general computer engineering concept.
Think about an address book. If you are searching the phone number of Rossi Mario, you know that surnames are ordered alphabetically so you can go to the letter R, then search for the letter o and so on. Index do the same, are a collections of references to entries that speed up a lot some operations.
Searching in an unordered address book would be much more slower, you should start from the first name on the first page and search in all the pages until you find the name you are looking for.
I think it allows for sorting in
acending and decending order and
access to data quicker.
Yes, that's what's it's for. Indexes create the abstraction of having sorted data, which speeds up searches significantly. With an index using a balanced binary search tree, searches take O(log N) instead of O(N) time.
What the other answers haven't mentioned that most databases use indexes in order to implement UNIQUE (and therefore also PRIMARY KEY) constraints. Because in order to ensure uniqueness, you have to be able to detect whether the key is already there, and this means you want fast searches for it.
Take a look in your SQLite database. Those sqlite_autoindex_ indices were created to enforce UNIQUE constraints.
The same as an index in any SQL (YES SQL) RDBMS.
You can see the SQLite query optimizer considers indexes: http://www.sqlite.org/optoverview.html
Speed up searching and sorting
Different types of SQLite indices speed up searching and sorting in different ways.
The following tutorial explains this in a great way.

Resources