Performance difference in Solr for several vs. one filter queries - http

Is there any performance or other difference between providing several filter queries as separate fq parameters versus providing a single one with all constraints joined with AND?
E.g. fq=field1:foo&fq=field2:bar vs. fq=(field1:foo) AND (field2:bar)
Obviously the first method is more readable and manageable, but I'm sending long queries (suboptimal, but there are reasons for that) via POST, and the library I use doesn't handle POST array parameters very well: they come out as fq[0]=...&fq[1]=... which is not recognised by Solr. Hence I consider joining everything into a single fq parameter to avoid that and wonder if that has any other consequences apart from being an ugly crutch.
Solr version is 4.5 if that matters

you will have the same result with the 2 queries:
E.g. fq=field1:foo&fq=field2:bar vs. fq=(field1:foo) AND (field2:bar)
but for performance matters you should prefer to split into many fq your request ! :) It has always been faster for me ! Let me also ask you to use solr Filters they are great to optimise the request speed! enjoy solr! :)

Related

Gremlin: Does calling expensive steps after cheaper ones works as an optimization?

I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.
OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.
Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")
If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

Getting large number of entities from datastore

By this question, I am able to store large number (>50k) of entities in datastore. Now I want to access all of it in my application. I have to perform mathematical operations on it. It always time out. One way is to use TaskQueue again but it will be asynchronous job. I need a way to access these 50k+ entities in my application and process them without getting time out.
Part of the accepted answer to your original question may still apply, for example a manually scaled instance with 24h deadline. Or a VM instance. For a price, of course.
Some speedup may be achieved by using memcache.
Side note: depending on the size of your entities you may need to keep an eye on the instance memory usage as well.
Another possibility would be to switch to a faster instance class (and with more memory as well, but also with extra costs).
But all such improvements might still not be enough. The best approach would still be to give your entity data processing algorithm a deeper thought - to make it scalable.
I'm having a hard time imagining a computation so monolithic that can't be broken into smaller pieces which wouldn't need all the data at once. I'm almost certain there has to be some way of using some partial computations, maybe with storing some partial results so that you can split the problem and allow it to be handled in smaller pieces in multiple requests.
As an extreme (academic) example think about CPUs doing pretty much any super-complex computation fundamentally with just sequences of simple, short operations on a small set of registers - it's all about how to orchestrate them.
Here's a nice article describing a drastic reduction of the overall duration of a computation (no clue if it's anything like yours) by using a nice approach (also interesting because it's using the GAE Pipeline API).
If you post your code you might get some more specific advice.

Convention for combining GET parameters with AND?

I'm designing an API and I want to allow my users to combine a GET parameter with AND operators. What's the best way to do this?
Specifically I have a group_by parameter that gets passed to a Mongo backend. I want to allow users to group by multiple variables.
I can think of two ways:
?group_by=alpha&group_by=beta
or:
?group_by=alpha,beta
Is either one to be preferred? I've consulted a few API design references but no-one seems to have a view on this.
There is no strict preference. The advantage to the first approach is that many frameworks will turn group_by into an array or similar structure for you, whereas in the second approach you need to parse out the values yourself. The second approach is also less verbose, which may be relevant if your query string is particularly large.
You may also want to test with the first approach that the query strings always come into your framework in the order the client sent them. Some frameworks have a bug where that doesn't happen.

Heavy calculations in mysql

I'm using asp.net together with mysql and my site requires some quite heavy calculations on the data. These calculations are made with mysql, I thought it was easier to make the calculations in mysql to just be able to work with the data easy in asp.net and databind my gridviews and other data controls not using that much code.
But I start thinking I made a mistake making all calculations in the back because everything seems quite slow. In one of my queries I need to get data from several tables at once and add them together into my gridview so what I do in that query is:
Selecting from four different tables each having some inner joins. Then union them together using union all. Then some sum and grouping.
I can't post the full query here because its quite long. But do you think it's bad way to do the calculations like I've done? Should it be better doing them in asp.net? What is your opinion?
MySQL does not allow embedded assembly, so writing a query which whill input a sequence of RGB records and output MPEG-4 is probably not the best idea.
On the other hand, it's good in summing or averaging.
Seeing what calculations you are talking about will surely help to improve the answer.
Self-update:
MySQL does not allow embedded assembly, so writing a query which whill input a sequence of RGB records and output MPEG-4 is probably not the best idea.
I'm really thinking how to implement this query, and worse than that: I feel it's possible :)
My experience with MySql is that it can be quite slow for calculations. I would suggest moving a substantial amount of work (particularly grouping and sum) to the ASP.NET. This has not been my experience with other database engines, but a 30 day moving average in MySQL seems quite slow.
Without knowing the actual query, it sounds like you are doing relational/table work on your query, which RDMS are good at so it seems that your are doing it at the correct place... the problem may be on the optimization of the query, you may want to do "EXPLAIN(query)" and get an idea of the query plan that MySql is doing and try to optimize that way....

Resources