OmnisSci: "Sorting the result would be too slow" - omniscidb

I am using OmniSci with pymapd to get ~5M rows of data.
Running a select with
SELECT a, b ,c, d
FROM my_table
ORDER BY a, b
Fails with the following database error:
Sorting the result would be too slow
For this query, I really don't care it would be slow.
Can I make this work on OmniSci (even slowly), or should I leave sorting to pandas/etc?

You can run this query disabling the watchdog that's blocking you, so if You have an omnisci.conf file you have just to add enable-watchdog=false to it;
if you are running the omnisci_server from the command line, to start the server, append --enable-watchdog=false.
Said that the omnisci database is extremely fast filtering, aggregating, and joining large volumes of data, but it's not so fast projecting and/or sorting medium or large volume of data, because the operations are serialized right now; of course, we are working to improve those aspects.

Related

What could cause a sqlite application to slow down over time with high load?

I'll definitely need to update this based on feedback so I apologize in advance.
The problem I'm trying to solve is roughly this.
The graph shows Disk utilization in the Windows task manager. My sqlite application is a webserver that takes in json requests with timestamps, looks up the existing entry in a 2 column key/value table, merges the request into the existing item (they don't grow over time), and then writes it back to the database.
The db is created as follows. I've experimented with and without WAL without difference.
createStatement().use { it.executeUpdate("CREATE TABLE IF NOT EXISTS items ( key TEXT NOT NULL PRIMARY KEY, value BLOB );") }
The write/set is done as follows
try {
val insertStatement = "INSERT OR REPLACE INTO items (key, value) VALUES (?, ?)"
prepareStatement(insertStatement).use {
it.setBytes(1, keySerializer.serialize(key))
it.setBytes(2, valueSerializer.serialize(value))
it.executeUpdate()
}
commit()
} catch (t: Throwable) {
rollback()
throw t
}
I use a single database connection the entire time which seems to be ok for my use case and greatly improves performance relative to getting a new one for each operation.
val databaseUrl = "jdbc:sqlite:${System.getProperty("java.io.tmpdir")}/$name-map-v2.sqlite"
if (connection?.isClosed == true || connection == null) {
connection = DriverManager.getConnection(databaseUrl)
}
I'm effectively serializing access to the db. I'm pretty sure the default threading mode for the sqlite driver is to serialize and I'm also doing some serializing in kotlin coroutines (via actors).
I'm load testing the application locally and I notice that disk utilization spikes around the one minute mark but I can't determine why. I know that throughput plummets when that happens though. I expect the server to chug along at a more or less constant rate. The db in these tests is pretty small too, hardly reaches 1mb.
Hoping people can recommend some next steps or set me straight as far as performance expectations. I'm assuming there is some sqlite specific thing that happens when throughput is very high for too long, but I would have thought it would be related to WAL or something (which I'm not using).
I have a theory but it's a bit farfetched.
The fact that you hit a performance wall after some time makes me think that either a buffer somewhere is filling up, or some other kind of data accumulation threshold is being reached.
Where exactly the culprit is, I'm not sure.
So, I'd run the following tests.
// At the beginning
connection.setAutoCommit(true);
If the problem is in the driver side of the rollback transaction buffer, then this will slightly (hopefully) slow down operations, "spreading" the impact away from the one-minute mark. Instead of getting fast operations for 59 seconds and then some seconds of full stop, you get not so fast operations the whole time.
In case the problem is further down the line, try
PRAGMA JOURNAL_MODE=MEMORY
PRAGMA SYNCHRONOUS=OFF disables the rollback journal synchronization
(The data will be more at risk in case of a catastrophic powerdown).
Finally, another possibility is that the page translation buffer gets filled after a sufficient number of different keys has been entered. You can test this directly by doing these two tests:
1) pre-fill the database with all the keys in ascending order and a large request, then start updating the same many keys.
2) run the test with only very few keys.
If the slowdown does not occur in the above cases, then it's either TLB buffer management that's not up to the challenge, or database fragmentation is a problem.
It might be the case that issuing
PRAGMA PAGE_SIZE=32768
upon database creation might solve or mitigate the problem. Conversely, PRAGMA PAGE_SIZE=1024 could "spread" the problem avoiding performance bottlenecks.
Another thing to try is closing the database connection and reopening it when it gets older than, say, 30 seconds. If this works, we'll still need to understand why it works (in this case I expect the JDBC driver to be at fault).
First of all, I want to say that I do not use exactly your driver for sqlite, and I use different devices in my work. (but how different are they really?)
From what I see, correct me if im wrong, you use one transaction, for one insert statement. You get request, you use the disc, you use the memory, open, close etc... every time. This can't work fast.
The first thing I do when I have to do inserts in sqlite is to group them, and use a single transaction to do it. That way, you are using your resources in batches.
One transaction, many insert statements, single commit. If there is a problem with a batch, handle the valid separately, log the faulty, move the next batch of requests.

Faster way to read data from oracle database in R

I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.

Clickhouse - How often clickhouse triggers a merge operation and how to control it?

I read that 10-15 mins after insert into a merge-tree table, Clickhouse triggers a merge-operations.
Is there a way to tell it to reduce that interval, to make it merge a bit more often?
also, I noticed that even in old partitions, there are several parts and not only one, how come?
No control. No interval.
You should not rely on a merge process. It has own complicated algorithm to balance number of parts. Merge has no goal to do final merge -- to make 1 part because it's not efficient and wasting of disk I/O and CPU.
You can call unscheduled forced merge using 'optimize table' command.

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

Script size passed to Informix via ODBC

I'm trying to ascertain if there are any limits to the size of a script passed to Informix via ODBC.
My Informix script size is going to run into a few megabytes (approximately 3.5K INSERT rows to a TEMP table), and is of the form...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...followed by a section to return a SELECT list based on an existing table...
SELECT
t1.field_1,
t1.field_2,
...
t1.field_n,
t2.field_2
FROM
table_1 AS t1
INNER JOIN
temp_table_2 AS t2
ON t1.field_1 = t2.field_1
Are there any limits to the size of the script, or, for that matter, the memory table? I'm estimating (hoping?) that 3.5K rows (we're only looking at one or two columns) would not cause an issue, or affect the server in an adverse way (there's easily be enough memory). Please note that my only communication method is via ODBC, and this is a proprietary database - I cannot create actual data tables on the server.
The reason I'm asking, is that, previously, I generated a script that was a considerable size, but, instead of putting the 3.5k IDs in a TEMP table (with associated data), I used an IN condition to look for the IDs only (processing could take place once the records were located). However, I cannot be certain whether it was the script editor (which was some kind of interface to the database) that baulked, limits to the IN condition, or the size of the script itself, that caused a problem, but basically the script would not run. After this we VIed a script, saving it to a folder and attempted to execute this, with similar (but not the same) results (sorry - I don't have the error messages from either process - this was done a little while ago).
Any Informix oriented tips for in this area would really be appreciated! :o)
Which version of Informix are you using? Assuming it is either 12.10 or 14.10, then there is no specific limit on the size of a set of statements, but a monstrosity like you're proposing is cruel and unusual punishment for a database server (it is definitely abusing your server).
It can also be moderately risky; you have to ensure you quote any data provided by the user correctly to avoid the problem of Little Bobby Tables.
You should be preparing one INSERT statement with two placeholder values:
INSERT INTO table(field_1, field_2) VALUES(?,?)
You should then execute this repeatedly, providing the different values. This will be more effective than making the server parse 3,500 similar statements. In ESQL/C, you can declare an INSERT cursor which will buffer the sets of values, reducing the round trips to the server — that can also be very valuable. I'm not sure whether that's an option in ODBC; probably not.
At the very least, you should experiment with using a prepared statement. Sending 3,500 x 60+ bytes = 210 KiB to the server is doable. But you'd be sending less volume of data to the server (but there'd be more round trips — which can be a factor) if you use the prepared statement and execute it repeatedly with new parameters each time. And you avoid the security risks of converting the values to strings. (Since you've not stated the types of the values, it's not certain there's a risk. If they're numeric, or things like date and time, they're very low risk. If they're character strings, the risk of is considerable — not insuperable, but not negligible.)
Older versions of Informix had smaller limits on the size of a set of statements — 64 KiB, and before that, 32 KiB. You're unlikely to be using an old enough version for that to be a problem, but the rules have changed over time.

Resources