Datastage challenge - unix

I have multiple txt files, each with 1 million records (Say 10 files) and these files are saved in LIST_OF_FILES.txt.
I have created a sequence and parallel job to extract data from those files to load into the tables (db2).
Just imagine, I am done with first 2 files. While loading the 3rd file (consider 10000 records are loaded into table so far), the parallel job got aborted due some environmental issue.
Now I want to load records from 10001, where the job got aborted.
JOB DESIGN
Execute command activity_1: wc -l LIST_OF_FILES.txt.
Starting loop: Start:1 , Step: 1 , To: output of Execute command activity_1.
Execute command activity_2: head -output_loop_counter LIST_OF_FILES.txt | tail -1.
parallel job: extract job to load records from file to table.
Execute command activity_3: Moving the extracted file to another folder.
End loop: above steps will continue till last file.

This is not an out-of-the-box capability. You need a job design that keeps track of the number of records processed, so that you can restart from a known good point. Keep in mind, too, that any pending transactions are rolled back if a job aborts - your design probably needs to check how many rows were actually loaded into the target table.

I would maintain the above sequence design and alter the extract job to perform a lookup against the target table on your UC/Primary key for the table, assuming that you have one.
Set failed lookup to reject, and connect your load stage to the reject link. You can dump the valid lookups into a copy stage to dead-end them and get rid of them. This slightly mimics the change capture stage, but with no sorting requirement and no warnings on values.

Related

DataFactory copies files multiple times when using wildcard

Hi all complete ADF newbie here - I have a strange issue with DataFactory and surprisingly cant see that anyone else has experienced this same issue.
To summarize:
I have setup a basic copy activity from blob to an Azure SQL database with no transformation steps
I have setup a trigger based on wildcard name. I.e. any files loaded to blob that start with IDT* will be copied to the database
I have loaded a few files to a specific location in Azure Blob
The trigger is activated
As soon as it looks like it all works, a quick assessment of the record count shows that the same files have been imported X number of times
I have analysed what is happening, basically when I load my files to blob, they don't technically arrive at the exact same time. So when file 1 hits the blob, the wildcard search is triggered and it finds 1 file. Then when the 2nd file hits the blob some milliseconds later, the wildcard search is triggered again and this time it processes 2 files (first and second).
The the problem keeps compounding based on the number of files loaded.
I have tried multiple things to get this fixed to no avail, because fundamentally it is behaving "correctly".
I have tried:
Deleting the file once it has processed but again due to the millisecond issue the file is technically still there and can still be processed
I have added a loop to process 1 file at a time then deleting the file before the next is loaded based on file name in the blob but hasn't worked (and cant explain why)
I have limited ADF to only 1 concurrent connection, this reduces the number of times it has duplicated but unfortunately still duplicates it
Tried putting a wait timer at the start of the copy activity, but this causes a resource locking issue. I get an error saying that multiple waits are causing the process to fail
Tried a combination of 1,2 and 3 and i end up with an entirely different issue in that it is trying to find file X, but now no longer exists because it was deleted as part of step 2 above
I am really struggling with something that seems extremely basic. So i am sure it is me overlooking something extremely fundamental as noone else seems to have this issue with ADF.

SQL Server Data Archiving

I have a SQL Azure database on which I need to perform some data archiving operation.
Plan is to move all the irrelevant data from the actual tables into Archive_* tables.
I have tables which have up to 8-9 million records.
One option is to write a stored procedure and insert data in to the new Archive_* tables and also delete from the actual tables.
But this operation is really time consuming and running for more than 3 hrs.
I am in a situation where I can't have more than an hour's downtime.
How can I make this archiving faster?
You can use Azure Automation to schedule execution of a stored procedure every day at the same time, during maintenance window, where this stored procedure will archive the oldest one week or one month of data only, each time it runs. The store procedure should archive data older than X number of weeks/months/years only. Please read this article to create the runbook. In a few days you will have all the old data archived and the Runbook will continue to do the job from now and on.
You can't make it faster, but you can make it seamless. The first option is to have a separate task that moves data in portions from the source to the archive tables. In order to prevent table lock escalations and overall performance degradation I would suggest you to limit the size of a single transaction. E.g. start transaction, insert N records into the archive table, delete these records from the source table, commit transaction. Continue for a few days until all the necessary data is transferred. The advantage of that way is that if there is some kind of a failure, you may restart the archival process and it will continue from the point of the failure.
The second option that does not exclude the first one really depends on how critical the performance of the source tables for you and how many updates are happening with them. It if is not a problem you can write triggers that actually pour every inserted/updated record into an archive table. Then, when you want a cleanup all you need to do is to delete the obsolete records from the source tables, their copies will already be in the archive tables.
In the both cases you will not need to have any downtime.

Controlling read locks on table for multithreaded plsql execution

I have a driver table with a flag that determines whether that record has been processed or not. I have a stored procedure that reads the table, picks a record up using a cursor, does some stuff (inserts into another table) and then updates the flag on the record to say it's been processed. I'd like to be able to execute the SP multiple times to increase processing.
Obvious answer seemed to be to use 'for update skip locked' in the select for the cursor but it seems this means I cannot commit within the loop (to update the processed flag and commit my inserts) without getting the fetch out of sequence error.
Googling tells me Oracle's AQ is the answer but for the time being this option is not available to me.
Other suggestions? This must be a pretty common request but I've been unable to find anything that useful.
TIA!
A

sqlite first executed query slow after opening connection

I create an sqlite3 database (using SQLite Expert Professional) with 1 table and more than 500,000 records;
If I command a simple query like:
select * from tableOne where entry like 'book one'
if it's my first command to be executed after connecting to database, it takes a considerably long time to be executed and retrieve the result(~15seconds) but just after first command, everything comes back to normal and now every command executes with a very acceptable speed;
even if I close my application(I use pure LUA with sqlite modules)(and within it's logic, reasonably close all connections) as long as Windows(8 x64) is running an not restarted, every command even the first one executes very well but after restarting windows, again, like always first command is slow to be executed;
what is the reason?
how can I prevent this?
Most likely after the first time you run this, you've loaded cache up with all your data, so subsequent queries are fast. Do you have an index on entry? An index will allow efficient querying using entry as a filter. You may want to create one:
CREATE INDEX i_tableone_entry ON tableOne( entry );

synchronize multiple map reduce jobs in hadoop

I have a use case where multiple jobs can run at the same time. The output of all the jobs will have to merged with a common master file in HDFS(containing key value pairs) that has no duplicates. I'm not sure how to avoid the race condition that could crop up in this case. As an example both Job 1 and Job 2 simultaneously write the same value to the master file resulting in duplicates. Appreciate your help on this.
Apache Hadoop doesn't support parallel writing to the same file. Here is the reference.
Files in HDFS are write-once and have strictly one writer at any time.
So, multiple maps/jobs can't write to the same file simultaneously. Another job/shell or any other program has to be written to merge the output of multiple jobs.

Resources