I have session which taking data from AS400 and loading into Teradata. Source having 182 columns and ~19 millions records. I am using TPT write connection for writing data on Teradata. Target table is truncate and load and mapping is straight mapping. Still I am getting 6 throughput (Row/sec).
Break it down: is it a source/transformation/target bottleneck?
Modify the mapping and add a filter (criterion: FALSE) straight after the source and see how fast it runs, do the same right before target (if you have a little bit more than ‘straight’)
Then get back to us :)
More concrete: 182 columns is a bit much... perhaps you need to increase the buffer block size to allow for at least 100 rows per block
Related
I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
I am using OmniSci with pymapd to get ~5M rows of data.
Running a select with
SELECT a, b ,c, d
FROM my_table
ORDER BY a, b
Fails with the following database error:
Sorting the result would be too slow
For this query, I really don't care it would be slow.
Can I make this work on OmniSci (even slowly), or should I leave sorting to pandas/etc?
You can run this query disabling the watchdog that's blocking you, so if You have an omnisci.conf file you have just to add enable-watchdog=false to it;
if you are running the omnisci_server from the command line, to start the server, append --enable-watchdog=false.
Said that the omnisci database is extremely fast filtering, aggregating, and joining large volumes of data, but it's not so fast projecting and/or sorting medium or large volume of data, because the operations are serialized right now; of course, we are working to improve those aspects.
I read that 10-15 mins after insert into a merge-tree table, Clickhouse triggers a merge-operations.
Is there a way to tell it to reduce that interval, to make it merge a bit more often?
also, I noticed that even in old partitions, there are several parts and not only one, how come?
No control. No interval.
You should not rely on a merge process. It has own complicated algorithm to balance number of parts. Merge has no goal to do final merge -- to make 1 part because it's not efficient and wasting of disk I/O and CPU.
You can call unscheduled forced merge using 'optimize table' command.
I'm trying to ascertain if there are any limits to the size of a script passed to Informix via ODBC.
My Informix script size is going to run into a few megabytes (approximately 3.5K INSERT rows to a TEMP table), and is of the form...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...followed by a section to return a SELECT list based on an existing table...
SELECT
t1.field_1,
t1.field_2,
...
t1.field_n,
t2.field_2
FROM
table_1 AS t1
INNER JOIN
temp_table_2 AS t2
ON t1.field_1 = t2.field_1
Are there any limits to the size of the script, or, for that matter, the memory table? I'm estimating (hoping?) that 3.5K rows (we're only looking at one or two columns) would not cause an issue, or affect the server in an adverse way (there's easily be enough memory). Please note that my only communication method is via ODBC, and this is a proprietary database - I cannot create actual data tables on the server.
The reason I'm asking, is that, previously, I generated a script that was a considerable size, but, instead of putting the 3.5k IDs in a TEMP table (with associated data), I used an IN condition to look for the IDs only (processing could take place once the records were located). However, I cannot be certain whether it was the script editor (which was some kind of interface to the database) that baulked, limits to the IN condition, or the size of the script itself, that caused a problem, but basically the script would not run. After this we VIed a script, saving it to a folder and attempted to execute this, with similar (but not the same) results (sorry - I don't have the error messages from either process - this was done a little while ago).
Any Informix oriented tips for in this area would really be appreciated! :o)
Which version of Informix are you using? Assuming it is either 12.10 or 14.10, then there is no specific limit on the size of a set of statements, but a monstrosity like you're proposing is cruel and unusual punishment for a database server (it is definitely abusing your server).
It can also be moderately risky; you have to ensure you quote any data provided by the user correctly to avoid the problem of Little Bobby Tables.
You should be preparing one INSERT statement with two placeholder values:
INSERT INTO table(field_1, field_2) VALUES(?,?)
You should then execute this repeatedly, providing the different values. This will be more effective than making the server parse 3,500 similar statements. In ESQL/C, you can declare an INSERT cursor which will buffer the sets of values, reducing the round trips to the server — that can also be very valuable. I'm not sure whether that's an option in ODBC; probably not.
At the very least, you should experiment with using a prepared statement. Sending 3,500 x 60+ bytes = 210 KiB to the server is doable. But you'd be sending less volume of data to the server (but there'd be more round trips — which can be a factor) if you use the prepared statement and execute it repeatedly with new parameters each time. And you avoid the security risks of converting the values to strings. (Since you've not stated the types of the values, it's not certain there's a risk. If they're numeric, or things like date and time, they're very low risk. If they're character strings, the risk of is considerable — not insuperable, but not negligible.)
Older versions of Informix had smaller limits on the size of a set of statements — 64 KiB, and before that, 32 KiB. You're unlikely to be using an old enough version for that to be a problem, but the rules have changed over time.
What is the most bandwidth efficient way to unidirectionally synchronise a list of data from one server to many clients?
I have sizeable chunk of data (perhaps 20,000, 50-byte records) which I need to periodically synchronise to a series of clients over the Internet (perhaps 10,000 clients). Records may added, removed or updated only at the server end.
Something similar to bittorrent? Or even using bittorrent. Or maybe invent a wrapper around bittorrent.
(Assuming you pay for bandwidth on your server and not the others ...)
Ok, so we've got some detail now - perhaps 10 GB of total (uncompressed) data, every 3 days, so that's 100 GB per month.
That's actually not really a sizeable chunk of data these days. Whose bandwidth are you trying to save - yours, or your clients'?
Does the data perhaps compress very readily? For raw binary data it's not uncommon to achieve 50% compression, and if the data happens to have a lot of repeated patterns within it then 80%+ is possible.
That said, if you really do need a system that can just transfer the changes, my thoughts are:
make sure you've got a well defined primary key field - use that as your key to identify each record
record a timestamp for each record to say when it last changed
have each client tell you the timestamp of the last change it knows of, so you can calculate the deltas
ensure that full downloads are possible too, in case clients get out of sync