I have gone through the official documentation of Teradata.
I am planning to write a table function (UDF in C++) which accepts 2 columns as input, processes the input and converts it to a std::map< string,string > or array of structs and passes it to some other function which accepts input as array of structs/std::map< string,string >. My questions are:
If I pass 2 columns from a table, how can know the number of rows in new temporary table??How can I accept the values, passed as a column from Teradata query statement into the UDF??
Are the things, given in the appendix of the documentation such as phase checking,like TBL_BUILD, TBL_PRE_INIT etc., mandatory to be included in the code, for building of the table and other purposes??
You cannot return any data structures from a table UDF. For each row, you have value stack entries to put your values for returning columns. And each value must be a Teradata recognized data type.
Also, you need to be careful with STL libraries. In fact, avoid dealing with memory allocation / deallocation and use stack instead. This is because if you have any memory leaks or memory fragmentation, then from time to time you will have to restart your server, which is not good for a production system.
Since the table UDF is called for each row and phase, the only way to save your data is via the context. You also do not want to have spend too much time processing for each row, as that will be a major performance issue.
You may want to take a look at table operator if you intend to maintain a complicated in-memory structure.
Related
I have the following situation:
library(RODBC)
channel <- odbcConnect("ABC", uid="DEF", pwd="GHI")
df <- sqlQuery(channel,query)
The number of rows is 10M+. Is there any faster way to read the data?
The data is in oracle database.
This definitely should be a comment but will be too long for the purposes.
When executing SQL there are a few likely bottlenecks
Executing the query itself
Download the data from the database
Converting the data to align with language specific types (eg. R integers rather than BIGINT etc.
If your query runs fast when executed directly on the database UI, it is unlikely that the bottleneck comes when executing the query itself. This is also immediately clear if your query only contains simple [RIGHT/LEFT/INNER/OUTER] JOIN, as these are not "complex" query operators as such. This is more often caused by more complex nested queries using WITH clauses or window functions. The solution here would be to create a VIEW such that the query will be pre-optimized.
Now what is more likely to be the problem is 2. and 3. You state that your table has 10M data points. Lets assume your table is financial and has only 5 columns, which are all 8bit floats ( FLOAT(8) ) or 8 bit integers, the amount of data to be downloaded is (8 * 5 * 10 M / 1024^3) Gbit = 0.37 Gbit, which itself will take some time to download depending on your cconnection. Assuming you have a 10 Mbit connection the download under optimal condition would be taking at least 37 seconds. And this is the case where your data has only 5 columns! It is not unlikely that you have many more.
Now for 3. it is more difficult to predict the amount of time spent without careful code profiling. This is the step where RODBC, odbc or RJDBC will have to convert the data into types that R understands. I am sorry to say that here it becomes a question of "trial and error" to figure out which packages work best. However for oracle specifics, I would assume DBI + odbc + ROracle (seems to be developed by oracle themselves??) would be a rather safe bet for a good contender.
Do however keep in mind, that the total time spent on getting data imported from any database into R is an aggregate of the above measures. Some databases provide optimized methods for downloading queries/tables as flat-files (csv, parquet etc) and this can in some cases speed up the query quite significantly, but at the cost of having to read from disk. This often also becomes more complex compared to executing the query itself, so one has to evaluate whether it is worth the trouble, or whether it is worth just waiting for the original query to finish executing within R.
I'm trying to ascertain if there are any limits to the size of a script passed to Informix via ODBC.
My Informix script size is going to run into a few megabytes (approximately 3.5K INSERT rows to a TEMP table), and is of the form...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...
INSERT INTO table (field_1, field_2) VALUES (value_1, value_2)
...followed by a section to return a SELECT list based on an existing table...
SELECT
t1.field_1,
t1.field_2,
...
t1.field_n,
t2.field_2
FROM
table_1 AS t1
INNER JOIN
temp_table_2 AS t2
ON t1.field_1 = t2.field_1
Are there any limits to the size of the script, or, for that matter, the memory table? I'm estimating (hoping?) that 3.5K rows (we're only looking at one or two columns) would not cause an issue, or affect the server in an adverse way (there's easily be enough memory). Please note that my only communication method is via ODBC, and this is a proprietary database - I cannot create actual data tables on the server.
The reason I'm asking, is that, previously, I generated a script that was a considerable size, but, instead of putting the 3.5k IDs in a TEMP table (with associated data), I used an IN condition to look for the IDs only (processing could take place once the records were located). However, I cannot be certain whether it was the script editor (which was some kind of interface to the database) that baulked, limits to the IN condition, or the size of the script itself, that caused a problem, but basically the script would not run. After this we VIed a script, saving it to a folder and attempted to execute this, with similar (but not the same) results (sorry - I don't have the error messages from either process - this was done a little while ago).
Any Informix oriented tips for in this area would really be appreciated! :o)
Which version of Informix are you using? Assuming it is either 12.10 or 14.10, then there is no specific limit on the size of a set of statements, but a monstrosity like you're proposing is cruel and unusual punishment for a database server (it is definitely abusing your server).
It can also be moderately risky; you have to ensure you quote any data provided by the user correctly to avoid the problem of Little Bobby Tables.
You should be preparing one INSERT statement with two placeholder values:
INSERT INTO table(field_1, field_2) VALUES(?,?)
You should then execute this repeatedly, providing the different values. This will be more effective than making the server parse 3,500 similar statements. In ESQL/C, you can declare an INSERT cursor which will buffer the sets of values, reducing the round trips to the server — that can also be very valuable. I'm not sure whether that's an option in ODBC; probably not.
At the very least, you should experiment with using a prepared statement. Sending 3,500 x 60+ bytes = 210 KiB to the server is doable. But you'd be sending less volume of data to the server (but there'd be more round trips — which can be a factor) if you use the prepared statement and execute it repeatedly with new parameters each time. And you avoid the security risks of converting the values to strings. (Since you've not stated the types of the values, it's not certain there's a risk. If they're numeric, or things like date and time, they're very low risk. If they're character strings, the risk of is considerable — not insuperable, but not negligible.)
Older versions of Informix had smaller limits on the size of a set of statements — 64 KiB, and before that, 32 KiB. You're unlikely to be using an old enough version for that to be a problem, but the rules have changed over time.
I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.
For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.
EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.
I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.
But Badger is not about writing "an" entry:
My writes are really slow. Why?
Are you creating a new transaction for every single key update? This will lead to very low throughput.
To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.
That leads to issue 396:
I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.
I just tested Badger and I got a crazy 10k rq/s. I am just baffled
That is because:
LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.
You can read more about the design here.
One of the main point (hard to replicate with simple read/write of files) is:
Key-Value separation
The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.
Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.
In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.
Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.
The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.
Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.
I need to construct a priority queue in R where i will put the ordered seed objects (or the index of the objects) for the OPTICS clustering algorithm.
One possibility is to implement it with heap with the array representation, and pass the heap array in each insert and decrease key call, and return the changed array and reassign it in the calling function. In which case, the reassign operation will make the performance very poor and every time one insert or decrease operation is executed the entire array needs to be copied twice, once for calling, and another once for returning and reassigning.
Another possibility is to code the heap operations inside the function instead of calling it. This will result in code repetition and cumbersome code.
Is there any pointer like access as we do in C
Can i declare user defined functions in the S3 or S4 classes in R ? In the the case i think the call to these functions still requires the same reassignment after returning (not like C++/Java classes, operates on the object (am i right?) )
Is there any builtin way with which i can insert and extract an object in a queue in O(log(n)) time in R?
Is there any other way with which i can achieve the goal, that is maintain a priority based insertion and removal of the seeds depending on the reachability distance of an object in the OPTICS algorithm, except explicitly sorting after each insertion.
R5 classes
define mutable objects, and very similar to Java classes:
they should allow you to avoid the copies when the object is modified.
Note that you do not just need a priority queue.
It actually needs to support efficient updates, too. A simple heap is not sufficient, you need to synchronize a hashmap to find objects efficiently for updating their values. Then you need to repair the heap at the changed position.
In my code I define a new MPI user-defined data type.
I was wondering if the MPI_Barrier function must follow the MPI_Commit or must be placed at some point where the first use of the new data type appears so that all the processes acknowledge and agree on the defintion of the new data type.
Thanks.
No - there's no communication within the MPI_Type commands, they're completely local. In particular, processes don't necessarily have to agree on the definition of a new type.
If rank 1 sends a new data type to rank 0, all they have to agree on is the amount of data, not the layout of the type. For instance, imagine rank 1 was sending all of it's (say, 2d) local array to rank 0 - it might just choose to send an MPI_Type_contiguous of NX*NY floats. But rank 0 might be receiving this into a larger global array; it might choose to receive it into a Subarray type of the global type. Even if those data types had the same names, they can describe different final layouts in memory, as long as the total amount of data is the same.
MPI datatypes are the private business of the process that creates them. They do not need to match, in fact it is possible and perfectly legal for a receiving process to use a type map that differs from that of the sending process (as long as it doesn't lead to memory corruption of course). As such, there is no synchornization whatsoever when using Define or Commit.