Teradata Fastload - Sequential order from flat file - teradata

A flat file is being ingested into Teradata staging area using Fastload utility. Post this process, a merge operation will be done on this data which will insert/update into a target table, based on the latest timestamp. I encountered a problem when the timestamp was same for a customer. Let me explain that using the following data in the flat file:
Cust1 | 123 | 15-May-2018 13:01:01
Cust1 | 234 | 15-May-2018 13:01:01
Cust2 | 111 | 15-May-2018 13:02:01
This is the order of data in the flat file. As you can see, both records of Cust1 has same timestamp. But The second record in the flat file is the latest, since the sequential write has written this record in the second row.
How do i fetch this record to be used in the MERGE statement? Currently my MERGE statement partitioned this based on the TIMESTAMP value. Is there anyway to find the sequential order when fastload runs? or some kind of row_id to used ?

Related

Fastest way to dump all blobs in PL/SQL

I'm trying to dump all blobs in a large table to the file system. The table is like:
Name Null? Type
---------------------- -------- -------------
ID NOT NULL NUMBER(19)
FILENAME NOT NULL VARCHAR2(256)
OFFSET_BLOB BLOB
I need to dump all OFFSET_BLOBs to the file system in files named [filename].offsets.
My current approach is to create a stored procedure that iterates through all of the rows and then uses UTL_FILE.put_raw() to write the data to the file. It's working fine, but the table has over 250 million rows and the current estimation is to take 5 days to complete. I tried to add parallel hints on the query with /*+ FULL PARALLEL(10) */ but it doesn't improve anything :(.
Does any of you have a better approach to drastically reduce the extract time to hours instead of days? Thank you!

Slow query on table | WHERE x | ORDER by timestamp | DISTINCT a,b,c,d | TAKE 20 when table large

We are experiencing a sudden performance drop with a query structured like this:
table(tablename)
| where MeasurementName in ('ActiveJobId')
and MachineId == machineId
and SourceTimestamp <= from
and isnotnull( Value)
| order by SourceTimestamp desc
| distinct SourceTimestamp, MeasurementName, tostring(Value), SourceTimestampUtc
| take rows
tablename, machineId, from, rows are all query parameters. rows is typically "20". Value column is of type "dynamic"
The table contains 240 Million entries, with about 64,000 matching the WHERE criteria. The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
The query runs smooth in the Staging database system, but started to degrade in performance on the Dev system. Possibly because of increased data amount.
If we remove the distinct clause, or move it behind the TAKE clause, the query completes very fast. (<1s). The data contains about 5-10% duplicate entries.
To our understanding the query should be performed like this:
Prepare a filter for the source table, start at a specific datetime range
Order desc: walk backwards
Walk down the table and stop when you got 20 distinct rows
From the time it sometimes takes it looks almost as if ADX walks down the whole table, performs a distinct, and then only takes the topmost 20 rows.
The problem persists if we swap | order and | distinct around.
The problem disappears if we move | distinct to the end of the query, but then we often receive 1-2 items less than required.
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
This part of the description doesn't match the filter in your query: and SourceTimestamp <= from - did you mean to use >= instead of <= ?
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
If you can't eliminate the duplicates upstream, you can consider setting a materialized view that performs the deduplication, then query the view directly instead of the raw data. Also see Handle duplicate data

Is there a way to clone a table in Kusto?

Is there a way to clone a table in Kusto exactly so it has all the extents of the original table? Even if it's not possible to have extents retained , at least is there a performant way to copy a table to a new table. I tried the following:-
.set new_table <| existing_table;
It was running forever and got timeout error. Is there way to copy so the Kusto engine recognizes that this is just a dump copy so instead of using Kusto engine, it will just do a simple blob copy from back-end and simply point the new table to the copied blob thus bypassing the whole Kusto processing route?
1. Copying schema and data of one table to another is possible using the command you mentioned (another option to copy the data is to export its content into cloud storage, then ingest the result storage artifacts using Kusto's ingestion API or a tool that uses it, e.g. LightIngest or ADF)
Of course, if the source table has a lot of data, then you would want to split this command into multiple ones, each dealing with a subset of the source data (which you can 'partition', for example, by time).
Below is just one example (it obviously depends on how much data you have in the source table):
.set-or-append [async] new_table <| existing_table | where ingestion_time() > X and ingestion_time() < X + 1h
.set-or-append [async] new_table <| existing_table | where ingestion_time() >= X+1h and ingestion_time() < X + 2h
...
Note that the async is optional, and is to avoid the potential client-side-timeout (default after 10 minutes). the command itself continues to run on the backend for up to a non-configurable timout of 60 mins (though it's strongly advised to avoid such long-running commands, e.g. by performing the "partitioning" mentioned above).
2. To your other question: There's no option to copy data between tables without re-ingesting the data (an extent / data shard currently can't belong to more than 1 table).
3. If you need to "duplicate" data being ingestion into table T1 continuously into table T2, and both T1 and T2 are in the same database, you can achieve that using an update policy.

Oracle APEX Search Engine

I'm using Oracle APEX 4.2.6 and Oracle DB 11gR2
I've an interactive report showing the list of clients.
The end user can modify the Name of the client.
My issue is that I have to find a way to allow the end user to find the modified client by seraching it with his old name.
For example, the end user modify the name of client from OLD NAME to NEW NAME
In the serach engine of the interactive report, the end users must be able to find the client by serching it by its old name OLD NAME
Is there a way to manage this situation on the APEX side or Database side.
This is very much a database issue, not an APEX issue. When the user modifies the client name, you will need to record the old name somewhere: this could simply be an OLD_NAME column on the CLIENTS table (which would only support knowing the previous name for a single name change), or it could be a CLIENT_NAME_HISTORY table to which a row is added every time a client name is changed.
Having done that, your interactive report's SQL can then be modified to search both old and new names to find the client - for example:
select ...
from clients
where (name like :P1_NAME or old_name like :P1_NAME)
or
select ...
from clients c
where (c.name like :P1_NAME or exists (select null
from client_name_history h
where h.client_id = c.client_id
and h.name like :P1_NAME)
Note that I think you will need to create a page item for the name filter, because the built-in filter of the IR can only search data that is displayed in the report, which previous names will not be (presumably).
Having additional columns might not be a "scalable" solution. What if another user changes the name again? And again? And again?
A better approach to store this data would be in rows that are uniquely identified by a combination of the primary key of the client along with an object version identifier - this could be a number or a time stamp or a date range. This is an approach that Oracle themselves use in many of its enterprise application.
Example of the data would look like below.
1.) Using Object Version Number
Client Id | Client Name | Object Version Number
1 | Bob | 1
1 | Sam | 2
1 | Ed | 3
Here, every time a user changes the name an additional row is created maintaining the same client_id value but incrementing the object version number by 1. The highest ovn represents the latest value. You could also have a column called "latest_record" and insert a value of Y when creating a new record to show that this is the latest record (resetting the value in the previous latest record to N). Similarly, instead of a number, you can simply store the timestamp and use that to determine the latest record.
Using date range
Client Id | Client Name | Start Date | End Date
1 | Bob | 01-Jan-2017 | 31-Jan-2017
1 | Sam | 02-Feb-2017 | 02-Mar-2017
1 | Ed | 03-Mar-2017 |
In this approach, you are specifying the period of time for which the name was valid. A use case would be an individual taking the adopting the surname of their partner after marriage. In such a case, one name was valid from the time of birth to the date of marriage and another name was valid from the date of marriage onwards.
Once you prepare your datastructure in this format, in the apex report you just need to query on the single name column. I feel additional tables and columns are an unnecessary overhead in this case.
Regards,
SJ

How to create lines/stops relationship

I'm not a database expert and I'm simply building a prototype app, so nothing really important.
Anyway, the app is about a subway: this subway has many lines and sometimes some stops are shared between lines (so, for example, stops 3 and 4 are stops of lines 2, 7 and 9).
So, I made up a SQLite stops table:
+---------+-------------+------+
| Field | Type | Auto |
+---------+-------------+------+
| id | integer | YES |
| name | varchar(20) | NO |
| lines | ? | NO |
+---------+-------------+------+
What's the best way to deal with shared stops? My idea was to create a lines table and then in the lines field of the stops table put a comma separated list of lines.id. I don't know why, but I feel there could be a better way.
Any suggestion is appreciated, and sorry for the really noob question.
I would keep it simple and use a table lines which has an ID (primary key) along with other metadata for a line (such as name):
lines
(id, name)
Then, create a table for the stops:
stops
(id, name)
Finally, you can create a bridge table which will connect lines with stops:
bridge
(lineId, stopId)
Each record in the bridge table represents one line having a given stop.
Note that using CSV to represent a line having multiple stops is totally not the way to go here, as it renders the powers of your relational database useless.
Update:
If you want to record the position of a stop in a given line (and assuming that positions would differ across lines), you could use the following table:
stopNumbers
(lineId, stopId, stopPosition)
The stop position can be obtained knowing the line's ID and the stop's ID.
You need a many-to-many relation, which is stored in a separate table like this:
table lines_to_stops
line_fk
stop_fk
That's the relational world ...
Note that records in the database are not in any specific order. If you need to put the stops into any specific order (which you most probably do), you have to store this order to the database as well:
table lines_to_stops
line_fk
stop_fk
order_in_line

Resources