Script size passed to Informix via ODBC - sql-insert

I'm trying to ascertain if there are any limits to the size of a script passed to Informix via ODBC.
My Informix script size is going to run into a few megabytes (approximately 3.5K INSERT rows to a TEMP table), and is of the form...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...followed by a section to return a SELECT list based on an existing table...
SELECT
t1.field_1,
t1.field_2,
...
t1.field_n,
t2.field_2
FROM
table_1 AS t1
INNER JOIN
temp_table_2 AS t2
ON t1.field_1 = t2.field_1
Are there any limits to the size of the script, or, for that matter, the memory table? I'm estimating (hoping?) that 3.5K rows (we're only looking at one or two columns) would not cause an issue, or affect the server in an adverse way (there's easily be enough memory). Please note that my only communication method is via ODBC, and this is a proprietary database - I cannot create actual data tables on the server.
The reason I'm asking, is that, previously, I generated a script that was a considerable size, but, instead of putting the 3.5k IDs in a TEMP table (with associated data), I used an IN condition to look for the IDs only (processing could take place once the records were located). However, I cannot be certain whether it was the script editor (which was some kind of interface to the database) that baulked, limits to the IN condition, or the size of the script itself, that caused a problem, but basically the script would not run. After this we VIed a script, saving it to a folder and attempted to execute this, with similar (but not the same) results (sorry - I don't have the error messages from either process - this was done a little while ago).
Any Informix oriented tips for in this area would really be appreciated! :o)

Which version of Informix are you using? Assuming it is either 12.10 or 14.10, then there is no specific limit on the size of a set of statements, but a monstrosity like you're proposing is cruel and unusual punishment for a database server (it is definitely abusing your server).
It can also be moderately risky; you have to ensure you quote any data provided by the user correctly to avoid the problem of Little Bobby Tables.
You should be preparing one INSERT statement with two placeholder values:
INSERT INTO table(field_1, field_2) VALUES(?,?)
You should then execute this repeatedly, providing the different values. This will be more effective than making the server parse 3,500 similar statements. In ESQL/C, you can declare an INSERT cursor which will buffer the sets of values, reducing the round trips to the server — that can also be very valuable. I'm not sure whether that's an option in ODBC; probably not.
At the very least, you should experiment with using a prepared statement. Sending 3,500 x 60+ bytes = 210 KiB to the server is doable. But you'd be sending less volume of data to the server (but there'd be more round trips — which can be a factor) if you use the prepared statement and execute it repeatedly with new parameters each time. And you avoid the security risks of converting the values to strings. (Since you've not stated the types of the values, it's not certain there's a risk. If they're numeric, or things like date and time, they're very low risk. If they're character strings, the risk of is considerable — not insuperable, but not negligible.)
Older versions of Informix had smaller limits on the size of a set of statements — 64 KiB, and before that, 32 KiB. You're unlikely to be using an old enough version for that to be a problem, but the rules have changed over time.

Related

What could cause a sqlite application to slow down over time with high load?

I'll definitely need to update this based on feedback so I apologize in advance.
The problem I'm trying to solve is roughly this.
The graph shows Disk utilization in the Windows task manager. My sqlite application is a webserver that takes in json requests with timestamps, looks up the existing entry in a 2 column key/value table, merges the request into the existing item (they don't grow over time), and then writes it back to the database.
The db is created as follows. I've experimented with and without WAL without difference.
createStatement().use { it.executeUpdate("CREATE TABLE IF NOT EXISTS items ( key TEXT NOT NULL PRIMARY KEY, value BLOB );") }
The write/set is done as follows
try {
val insertStatement = "INSERT OR REPLACE INTO items (key, value) VALUES (?, ?)"
prepareStatement(insertStatement).use {
it.setBytes(1, keySerializer.serialize(key))
it.setBytes(2, valueSerializer.serialize(value))
it.executeUpdate()
}
commit()
} catch (t: Throwable) {
rollback()
throw t
}
I use a single database connection the entire time which seems to be ok for my use case and greatly improves performance relative to getting a new one for each operation.
val databaseUrl = "jdbc:sqlite:${System.getProperty("java.io.tmpdir")}/$name-map-v2.sqlite"
if (connection?.isClosed == true || connection == null) {
connection = DriverManager.getConnection(databaseUrl)
}
I'm effectively serializing access to the db. I'm pretty sure the default threading mode for the sqlite driver is to serialize and I'm also doing some serializing in kotlin coroutines (via actors).
I'm load testing the application locally and I notice that disk utilization spikes around the one minute mark but I can't determine why. I know that throughput plummets when that happens though. I expect the server to chug along at a more or less constant rate. The db in these tests is pretty small too, hardly reaches 1mb.
Hoping people can recommend some next steps or set me straight as far as performance expectations. I'm assuming there is some sqlite specific thing that happens when throughput is very high for too long, but I would have thought it would be related to WAL or something (which I'm not using).
I have a theory but it's a bit farfetched.
The fact that you hit a performance wall after some time makes me think that either a buffer somewhere is filling up, or some other kind of data accumulation threshold is being reached.
Where exactly the culprit is, I'm not sure.
So, I'd run the following tests.
// At the beginning
connection.setAutoCommit(true);
If the problem is in the driver side of the rollback transaction buffer, then this will slightly (hopefully) slow down operations, "spreading" the impact away from the one-minute mark. Instead of getting fast operations for 59 seconds and then some seconds of full stop, you get not so fast operations the whole time.
In case the problem is further down the line, try
PRAGMA JOURNAL_MODE=MEMORY
PRAGMA SYNCHRONOUS=OFF disables the rollback journal synchronization
(The data will be more at risk in case of a catastrophic powerdown).
Finally, another possibility is that the page translation buffer gets filled after a sufficient number of different keys has been entered. You can test this directly by doing these two tests:
1) pre-fill the database with all the keys in ascending order and a large request, then start updating the same many keys.
2) run the test with only very few keys.
If the slowdown does not occur in the above cases, then it's either TLB buffer management that's not up to the challenge, or database fragmentation is a problem.
It might be the case that issuing
PRAGMA PAGE_SIZE=32768
upon database creation might solve or mitigate the problem. Conversely, PRAGMA PAGE_SIZE=1024 could "spread" the problem avoiding performance bottlenecks.
Another thing to try is closing the database connection and reopening it when it gets older than, say, 30 seconds. If this works, we'll still need to understand why it works (in this case I expect the JDBC driver to be at fault).
First of all, I want to say that I do not use exactly your driver for sqlite, and I use different devices in my work. (but how different are they really?)
From what I see, correct me if im wrong, you use one transaction, for one insert statement. You get request, you use the disc, you use the memory, open, close etc... every time. This can't work fast.
The first thing I do when I have to do inserts in sqlite is to group them, and use a single transaction to do it. That way, you are using your resources in batches.
One transaction, many insert statements, single commit. If there is a problem with a batch, handle the valid separately, log the faulty, move the next batch of requests.

Faster way to read data from oracle database in R

I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.

OmnisSci: "Sorting the result would be too slow"

I am using OmniSci with pymapd to get ~5M rows of data.
Running a select with
SELECT a, b ,c, d
FROM my_table
ORDER BY a, b
Fails with the following database error:
Sorting the result would be too slow
For this query, I really don't care it would be slow.
Can I make this work on OmniSci (even slowly), or should I leave sorting to pandas/etc?
You can run this query disabling the watchdog that's blocking you, so if You have an omnisci.conf file you have just to add enable-watchdog=false to it;
if you are running the omnisci_server from the command line, to start the server, append --enable-watchdog=false.
Said that the omnisci database is extremely fast filtering, aggregating, and joining large volumes of data, but it's not so fast projecting and/or sorting medium or large volume of data, because the operations are serialized right now; of course, we are working to improve those aspects.

Performance issue for session with TPT write

I have session which taking data from AS400 and loading into Teradata. Source having 182 columns and ~19 millions records. I am using TPT write connection for writing data on Teradata. Target table is truncate and load and mapping is straight mapping. Still I am getting 6 throughput (Row/sec).
Break it down: is it a source/transformation/target bottleneck?
Modify the mapping and add a filter (criterion: FALSE) straight after the source and see how fast it runs, do the same right before target (if you have a little bit more than ‘straight’)
Then get back to us :)
More concrete: 182 columns is a bit much... perhaps you need to increase the buffer block size to allow for at least 100 rows per block

Teradata UDF's in C++

I have gone through the official documentation of Teradata.
I am planning to write a table function (UDF in C++) which accepts 2 columns as input, processes the input and converts it to a std::map< string,string > or array of structs and passes it to some other function which accepts input as array of structs/std::map< string,string >. My questions are:
If I pass 2 columns from a table, how can know the number of rows in new temporary table??How can I accept the values, passed as a column from Teradata query statement into the UDF??
Are the things, given in the appendix of the documentation such as phase checking,like TBL_BUILD, TBL_PRE_INIT etc., mandatory to be included in the code, for building of the table and other purposes??
You cannot return any data structures from a table UDF. For each row, you have value stack entries to put your values for returning columns. And each value must be a Teradata recognized data type.
Also, you need to be careful with STL libraries. In fact, avoid dealing with memory allocation / deallocation and use stack instead. This is because if you have any memory leaks or memory fragmentation, then from time to time you will have to restart your server, which is not good for a production system.
Since the table UDF is called for each row and phase, the only way to save your data is via the context. You also do not want to have spend too much time processing for each row, as that will be a major performance issue.
You may want to take a look at table operator if you intend to maintain a complicated in-memory structure.

Resources