perform in situ updates in teradata - teradata

What does it mean to perform in situ updates in teradata? I could not find anything via google. Any feedback would be very much appreciated.
PS:
This is an excerpt from a best practice doc I am digesting:
"Update processing of large amounts of data is perhaps the most inefficient operation that is routinely performed on Teradata. In almost all cases it is more efficient to insert the changed rows to another table rather than updating them in situ."
DONT'T: Perform in situ updates or deletes unless unavoidable.

Related

Partitioning vs extra database

Where i work, we got a dilemma. We are using a database(MariaDB 10) that has 1 table that is growing very large(107.4GiB as i write this,. so 1.181 million rows..). This does off course affect the performance of the system.
Me and a coworker had a discussion, he suggested using partitions on that table. This will likely increase the performance, but does not reduce the size of the DB.
In previous times, i however, have been working on writing a cronjob that will move data older then 2 years from that table to a exact copy of the database on a other location.
I feel that that is the more effective way. I expect that doing this will not only increase performance(except during the times when the cronjob is running) but i know that it will also reduce the size of the table.
We don't expect that our customers are interested in this old data anyway.
Question is: What would you choose? I prefer my option, because old data is not used anyway and it keeps the main DB a lot cleaner, my coworker prefers his solution because it means less load at all times and customers can still access the old data.
I have read some of the pro's to use partitioning but haven't found a comparison yet between partitioning and moving old data to another database/place
The table in question uses several query's, This is the most important insert:
INSERT INTO ".$defaultDataTable." (
sensor_data_type_id,
sequence_number,
value,
flag,
datetime
) VALUES (
'".Database::esc($sdtid)."',
'".Database::esc($valueSequence)."',
'".Database::esc($value)."',
'".Database::esc($valueSensorDataFlagsExtended)."',
'".Database::esc($valueDateTime)."'
);
The data is selected in several pages of the application, but 1 example is the following.
SELECT
ws_sensor_data_type.sensor_data_type_id as sensor_data_type_id,
ws_sensor_data_type.name as sensor_data_type_name,
ws_sensor_data_type.equation_id as equation_id,
ws_sensor.name as sensor_name,
ws_equation.description as data_type_name,
ws_basestation.network_id as network_id,
ws_basestation.name as basestation_name,
ws_basestation.worldwide_id as worldwide_id,
ws_client.name as client_name,
ws_sensor.device_type_id as device_type,
ws_sensor.device_id as device_id
FROM
ws_sensor_data_type,
ws_sensor,
ws_basestation,
ws_client_basestation,
ws_client,
ws_equation
WHERE ws_sensor.sensor_id = ws_sensor_data_type.sensor_id
AND ws_sensor.basestation_id = ws_basestation.basestation_id
AND ws_basestation.basestation_id = ws_client_basestation.basestation_id
AND ws_client_basestation.client_id = ws_client.client_id
AND ws_sensor_data_type.equation_id = ws_equation.equation_id
AND ws_sensor_data_type.sensor_data_type_id = '".Database::esc($sdtid)."'
");
In this example, the data, along with some other information is being selected to create a .CSV export file.
The create table statement will follow as i am creating a copy of the Development DB right now to test partitioning on.
We do not use UUID's so that should not be a problem.
It depends.
Partitioning does not inherently improve performance. Only a very limited number of use cases show any performance improvement. More details .
If you are only fetching "recent" rows from the table and you have adequate indexing, then "neither" is the answer -- your million rows could grow to a billion without any performance degradation.
If you are using UUIDs, you are doomed. Performance declines terribly once the data is too big to be cached.
You have done some "hand waving". So have I. If you want to continue this discussion, please provide more specifics. CREATE TABLE, sample queries, proposed partition mechanism, proposed mechanism for accessing 'old' data, etc.

Split large sqlite table by sessionid field

I am relatively new to sql(ite), and I'm learning as I go while working on a new project.
We have got millions of transaction rows in one "data" table, one field being a "sessionid" field.
Since I want to concentrate on in-session activity for now, I primarily need to look only at transactions from the same sessions.
My intuition now is, that it would be a lot faster if I separate the database by sessions into many single session tables, than always querying for a single sessionid, and then proceeding. My question: is that correct? will that make a difference?
Even if not: Could you help me out and tell me, how I could split the one "data" table rows into many session-specific tables, the rows staying the same? Plus one table which relates sessionIds to their tables?
Thanks!
A friend just told me, the splitting-into-tables thing would be extremely unflexible, and I should try adding a distinct index instead for the different sessionId rows to access single sessions faster. Any thoughts on that and how to do it best?
First of all, are you having any specific performance bottleneck with it till now? If yes, please describe it.
Having one table per session will probably speed lookups/indexes (for INSERTs) things up.
SQLite doesn't impose a limit on the number of tables, so you should be okay.
One other solution that provides easier maintenance, is if you create one table per day/week.
Depending on how long your sessions last, this could be feasible or not.
Related: https://stackoverflow.com/a/811862/89771

What data store technology/solution allows very fast inserts, lookups and 'selects'

Here's my problem.
I want to ingest lots and lots of data .... right now millions and later billions of rows.
I have been using MySQL and I am playing around with PostgreSQL for now.
Inserting is easy, but before I insert I want to check if that particular records exists or not, if it does I don't want to insert. As the DB grows this operation (obviously) takes longer and longer.
If my data was in a Hashmap the look up would be o(1) so I thought I'd create a Hash index to help with lookups. But then I realised that if I have to compute the Hash again every time I will slow the process down massively (and if I don't compute the index I don't have o(1) lookup).
So I am in a quandry, is there a simple solution? Or a complex one? I am happy to try other datastores, however I need to be able to do reasonably complex queries e.g. something to similar to SELECT statements with WHERE clauses, so I am not sure if no-sql solutions are applicable.
I am very much a novice, so I wouldn't be surprised if there is a trivial solution.
Nosql Stores are good for handling huge inserts and updates
MongoDB has really good feature for update/Insert (called as upsert) based on whether the document is existing.
Check out this page from mongo doc
http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
Also you can checkout the safe mode in mongo connection. Which you can set it as false to get more efficiency in inserts.
http://www.mongodb.org/display/DOCS/Connections
You could use CouchDB. Its no SQL so you can't do queries per se, but you can create design documents that allow you to run map/reduce functions on your data.

Post-processing in SQL vs. in code

I have a general inquiry related to processing rows from a query. In general, I always try to format/process my rows in SQL itself, using numerous CASE WHEN statements to pre-format my db result, limiting rows and filling columns based on other columns.
However, you can also opt to just select all your rows and do the post-processing in code (asp.NET in my case). What do you guys think is the best approach in terms of performance?
Thanks in advance,
Stijn
I would recommend doing the processing in the code, unless you have network bandwidth considerations. The simple reason for this is that is is generally easier to make code changes than database changes. Furthermore, performance is more often related to the actual database query and disk access rather than the amount of data returned.
However, I'm assuming that your are referring to "minor" formatting changes to the result. Standard where clauses should naturally be done in the database.

Which is fastest? Data retrieval

Is it quicker to make one trip to the database and bring back 3000+ plus rows, then manipulate them in .net & LINQ or quicker to make 6 calls bringing back a couple of 100 rows at a time?
It will entirely depend on the speed of the database, the network bandwidth and latency, the speed of the .NET machine, the actual queries etc.
In other words, we can't give you a truthful general answer. I know which sounds easier to code :)
Unfortunately this is the kind of thing which you can't easily test usefully without having an exact replica of the production environment - most test environments are somewhat different to the production environment, which could seriously change the results.
Is this for one user, or will many users be querying the data? The single database call will scale better under load.
Speed is only one consideration among many.
How flexible is your code? How easy is it to revise and extend when the requirements change? How easy is it for another person to read and maintain your code? How portable is your code? what if you change to a diferent DBMS, or a different progamming language? Are any of these considerations important in your case?
Having said that, go for the single round trip if all other things are equal or unimportant.
You mentioned that the single round trip might result in reading data you don't need. If all the data you need can be described in a single result table, then it should be possible to devise a query that will get that result. That result table might deliver some result data in more than one row, if the query denormalizes the data. In that case, you might gain some speed by obtaining the data in several result tables, and composing the result yourself.
You haven't given enough information to know how much programming effort it will be to compose a single query or to compose the data returned by 6 queries.
As others have said, it depends.
If you know which 6 SQL statements you're going to execute beforehand, you can bundle them into one call to the database, and return multiple result sets using ADO or ADO.NET.
http://support.microsoft.com/kb/311274
the problem I have here is that I need it all, i just need it displayed separately...
The answer to your question is 1 query for 3000 rows is better than 6 queries for 500 rows. (given that you are bringing all 3000 rows back regardless)
However, there's no way you're going (to want) to display 3000 rows at a time, is there? In all likelihood, irrespective of using Linq, you're going to want to run aggregating queries and get the database to do the work for you. You should hopefully be able to construct the SQL (or Linq query) to perform all required logic in one shot.
Without knowing what you're doing, it's hard to be more specific.
* If you absolutely, positively need to bring back all the rows, then investigate the ToLookup() method for your linq IQueryable< T >. It's very handy for grouping results in non-standard ways.
Oh, and I highly recommend LINQPad (free) for trying out queries with Linq. It has loads of examples, and it also shows you the sql and lambda forms so you can familiarize yourself with Linq<->lambda form<->Sql.
Well, the answer is always "it depends". Do you want to optimize on the database load or on the application load?
My general answer in this case would be to use as specific queries as possible at the database level, therefore using 6 calls.
Thx
I was kind of thinking "ball park", but it sounds as though its a choice thing...the difference is likely small.
I was thinking that getting all the data and manipulating in .net would be the best - I have nothing concrete to base this on (hence the question), I just tend to feel that calls to the DB are expensive and if I know i need all the data...get it in one hit?!?
Part of the problem is that you have not provided sufficient information to give you a precise answer. Obviously, available resources need to be considered.
If you pull 3000 rows infrequently, it might work for you in the short term. However, if there are say 10,000 people that execute the same query (ignoring cache effects), this could become a problem for both the app and db.
Now in the case of something like pagination, it makes sense to pull in just what you need. But that would be a general rule to try to only pull what is necessary. It's much more elegant to use a scalpel instead of a broadsword. =)
If you are talking about a query that has already been run by SQL (so optimized by SQL Server), working with LINQ or a SqlDataReader might actually have the same performance.
The only difference will be "how hard will it be to maintain your code?"
LINQ doesn't query anything to the database until you ask for the result with ".ToList()" or ".ToArray()" or even ".Count()". LINQ is dynamically building your query so it is exactly the same as having a SqlDataReader but with runtime verification.
Rather than speculating, why don't you try both and measure the results?
It depends
1) if your connector implementation precaches a lot of objects AND you have big rows (for example blobs, contry polygons etc.) you have a problem, you have to download a LOT of data. I've optimalized once a code that had this problem and it was just downloading some megs of garbage all the time via localhost, and my software runs now 10 times faster because i removed the precaching by an option
2) If your rows are small and you have a good chance that you need to read through all the 3000, you're better going on a big resultset
3) If you don't use prepared statements, all queries have to be parsed! Big resultset might be better.
Hope it helped
I always stick to the rule of "bring in what I need" and nothing more...the problem I have here is that I need it all, I just need it displayed separately.
So say...
I have a table with userid and typeid. I want to display all records with a userid, and display on the page in grids say separated by typeid.
At the moment I call sproc that does "select field1, field2 from tab where userid=1",
then on the page set the datasource of a grid to from t in tab where typeid=2 select t;
Rather than calling a different sproc "select field1, field2 from tab where userid=1 and typeid=2" 6 times.
??

Resources