srt file is growing upto 16GB in temp directory in Linux - PROGRESS 4GL - openedge

There is a program in production to export audit data from audit tables. The performance of the program is very poor as it is running 4 hrs just to export 5GB data. Before giving the output data there was a out of disk space where the short/temp file was actually growing upto 16GB(too huge). Once a client connecting client-server succeeded then client session runs a query which requires a results-list srt temp-file on the server-side. So the problem here is am sure this srt file generation must have taken 3 to 3:30 hrs for the completion.
Now the question is Will this srt file data is unique for the same query?
If yes then can we do something to use this srt file directly to generate the data? so that we can avoid srt file generating each time which will save us some time. I know its a stupid question but Just want to if its possible or not

No, it is not possible to use an SRT file from a previous session in another session.

Related

sqlite3 multiple inserts really slow

I have build a file archiver in Windows which uses sqlite3 to store files and takes advantage of multicore techniques to complete the archive faster.
I am trying a backup of 100.000 files now and insertion is slow.
When I comment the line which inserts, the app uses 100% CPU which is normal. With the insertion line on, it rarely gets above 25%.
As the archiving progresses, insertion gets more and more slow, processing a few files/second with a cpu usage of 11%. No disk usage is shown, so the bottleneck can't be the disk.
I 've:
PRAGMA temp_store = MEMORY
PRAGMA journal_mode = MEMORY
PRAGMA synchronous = OFF
and the entire insertion is within a transaction.
After further analysis it seems that SQLite's problem is to bind the blob64 (if I pass 0, it seems to be fine).
Why SQLite would have a problem inserting a raw blob of data into the archive?
Any ideas?
Thanks.
Your answer may lie here:
https://www.sqlite.org/threadsafe.html
Because it says there that:
The default mode is serialized.
which might explain your observations.
According to that document, you can either configure this at compile time (which I would most definitely not myself do) or via:
sqlite3_config (SQLITE_CONFIG_MULTITHREAD);
Just how stratospherically it then performs I wouldn't know.

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

SQLITE database WAL file size keeps growing

I am writing continuously into a db file which has PRAGMA journal_mode=WAL, PRAGMA journal_size_limit=0. My C++ program has two threads, one reader(queries at 15 sec intervals) and one writer(inserts at 5 sec intervals).
Every 3 min I am pausing insertion to run a sqlite3_wal_checkpoint_v2() from the writer thread with the mode parameter as SQLITE_CHECKPOINT_RESTART. To ensure that no active read operations are going on at this point, I set a flag that checkpointing is about to take place and wait for reader to complete (the connection is still open) before running checkpoint. After checkpoint completion I again indicate to readers it is okay to resume querying.
sqlite3_wal_checkpoint_v2() returns SQLITE_OK, and pnLog and Ckpt as equal(around 4000), indicating complete wal file has been synced with main db file. So next write should start from beginning according to documentation. However, this does not seem to be happening as the subsequent writes cause the WAL file to grow indefinitely, eventually up to some GBs.
I did some searching and found that that readers can cause checkpoint failure due to open transactions. However, the only reader I'm using is ending its transaction before the checkpoint starts. What else could be preventing the WAL file from not growing?
This is far too late as an answer, but may be useful to other people.
According to the SQLite documentation, your expectations should be correct, but if you read this SO post, problems arise also in case of non-finalized statements. Therefore, if you just sqlite3_reset() your statement, there are chances anyway that the db may look busy or locked for a checkpoint. Note that this may happen also with higher levels of SQLITE_CHECKPOINT_values.
Also, the SQLITE_CHECKPOINT_TRUNCATE value, if checkout is successfully operated, will truncate the -wal file to zero length. That may help you check that all pages have been inserted in the db.
Another discussion in which -wal files grow larger and larger due to unfinalized statements is this.

Is there any size/row limit in .txt file?

The question may looks duplicate. But i am not getting the answer which i am looking.
The problem is, in unix, one of the 4GL binary is fetching data from the table using cursor and writing the data in .txt file.
The table contains around 50 Million records.
The binary took lot of time and not completing. the .txt file is also 0 byte.
I want to know the possibilities why the records are not written in the .txt file.
Note: There is enough disk space available.
Also, for 30 Million records, i can get the data in the .txt file as i expected.
The information you provide is insufficient to tell for sure why the file is not written.
In UNIX, a text file is just like any another file - a collection of bytes. No specific limit (or structure) is enforced on "row size" or "row count," although obviously, some programs might have certain limits on maximum supported line sizes and such (depending on their implementation).
When a program starts writing data to a file (i.e. once the internal buffer is flushed for the first time) the file will no longer be zero size, so clearly your binqary is doing something else all that time (unless it wipes out the file as part of the cleanup).
Try running your executable via strace to see the file I/O activity - that would give some clues as to what is going on.
Try closing the writer if you are using one to write to the file. It achieves the dual purpose of closing the resource along with flushing the remaining contents of the buffer.
CPU calculated output needs to be flushed if you are using any mechanism of buffered writer. I have encountered such situations a few times and in almost all cases, the issue was that of flushing the output.
In java specifically, usually the best practice of writing data involves buffers. So when the buffer limit is reached, it gets written to the file but doesn't get written to the file when the end of buffer has not been reached yet. This happens when program closes without flushing the buffered writer.
So, in your case, if the processing time that it takes is reasonable and still the output is not on the file, it may mean that the output has been calculated and put on the RAM but could not be written to the file (which represents the disk) due to the output not being flushed.
You can also consider the answers to this question.

encrypting and/or decrypting large files (AES) on a memory and storage constrained system, with "catastrophe recovery"

I have a fairly generic question, so please pardon if it is a bit vague.
So, let's a assume a file of 1GB, that needs to be encrypted and later decrypted on a given system.
Problem is that the system has less than 512 mb of free memory and about 1.5 GB storage space (give or take), so, with the file "onboard" we have about ~500 MB of "hard drive scratch space" and less than 512 mb RAM to "play with".
The system is not unlikely to experience an "unscheduled power down" at any moment during encryption or decryption, and needs to be able to successfully resume the encryption/decryption process after being powered up again (and this seems like an extra-unpleasant nut to tackle).
The questions are:
1) is it at all doable :) ?
2) what would be the best strategy to go about
a) encrypting/decrypting with so little scratch space (can't have the entire file lying around while decrypting/encrypting, need to truncate it "on the fly" somehow...)
and
b) implementing a disaster recovery that would work in such a constrained environment?
P.S.:
The cipher used has to be AES.
I looked into AES-CTR specifically but it does not seem to bode all that well for the disaster recovery shenanigan in an environment where you can't keep the entire decrypted file around till the end...
[edited to add]
I think I'll be doing it the Iserni way after all.
It is doable, provided you have a means to save the AES status vector together with the file position.
Save AES status and file position P to files STAGE1 and STAGE2
Read one chunk (say, 10 megabytes) of encrypted/decrypted data
Write the decrypted/encrypted chunk to external scratch SCRATCH
Log the fact that SCRATCH is completed
Write SCRATCH over the original file at the same position
Log the fact that SCRATCH has been successfully copied
Goto 1
If you get a hard crash after stage 1, and STAGE1 and STAGE2 disagree, you just restart and assume the stage with the earliest P to be good.
If you get a hard crash during or after stage 2, you lose 10 megabytes worth of work: but the AES and P are good, so you just repeat stage 2.
If you crash at stage 3, then on recovery you won't find the marker of stage 4, and so will know that SCRATCH is unreliable and must be regenerated. Having STAGE1/STAGE2, you are able to do so.
If you crash at stage 4, you will BELIEVE that SCRATCH must be regenerated, even if you could avoid this -- but you lose nothing in regenerating except a little time.
By the same token, if you crash during 5, or before 6 is committed to disk, you just repeat stages 5 and 6. You know you don't have to regenerate SCRATCH because stage 4 was committed to disk. If you crash after stage 1, you will still have a good SCRATCH to copy.
All this assumes that 10 MB is more than a cache's (OS + hard disk if writeback) worth of data. If it is not, raise to 32 or 64 MB. Recovery will be proportionately slower.
It might help to flush() and sync(), if these functions are available, after every write-stage has been completed.
Total write time is a bit more than twice normal, because of the need of "writing twice" in order to be sure.
You have to work with the large file in chunks. Break a piece of the file off, encrypt it, and save it to disk; once saved, discard the unencrypted piece. Repeat. To decrypt, grab an encrypted piece, decrypt it, store the unencrypted chunk. Discard the encrypted piece. Repeat. When done decrypting the pieces, concatenate them.
Surely this is doable.
The "largest" (not large at all however) problem is that when you encrypt say 128 Mb of original data, you need to remove them from the source file. To do this you need to copy the remainder of the file to the beginning and then truncate the file. This would take time. During this step power can be turned off, but you don't care much -- you know the size of data you've encrypted (if you encrypt data by blocks with size multiple to 16 bytes, the size of the encrypted data will be equal to size that was or has to be removed from the decrypted file). Unfortunately it seems to be easier to invent the scheme than to explain it :), but I really see no problem other than extra copy operations which will slowdown the process. And no, there's no generic way to strip the data from the beginning of the file without copying the remainder to the beginning.

Resources