Hadoop - job submission time on large data - dictionary

Did anyone face any problem with submitting job on large data. Data is around 5-10 TB uncompressed, it is in approximate 500K files. When we try to submit a simple java map reduce job, it's mostly spend more than hour on getsplits() function call. And takes multiple hour to appear in job tracker. Is there any possible solution to solve this problem?

with 500k files, you are spending a lot of time tree walking to find all these files, which then need to be assigned to list of InputSplits (the result of getSplits).
As Thomas points out in his answer, if your machine performing the job submission has a low amount of memory assigned to the JVM, then you're going to see issues with the JVM performing garbage collection to try and find the memory required to build up the splits for these 500K files.
To makes matters worse, if these 500K files are splittable, and larger than a single block size, then you'll get even more input splits to process the files (a file of size say 1GB, with a block size of 256MB, you'll by default get 4 map tasks to process this file, assuming the input format and file compression supports splitting the file). If this is applicable to your job (look at the number of map tasks spawned for your job, are there more than 500k?), then you can force less mappers to be created by amending the mapred.min.split.size configuration property to a size larger then the current block size (setting it to 1GB for the previous example means you'll get a single mapper to process the file, rather than 4). This will help the performance of getSplits method the resultant list of getSplits will be smaller, requiring less memory.
The second symptom of your problem is the time is takes to serialize the input splits to a file (client side), and then the deserialization time at the job tracker end. 500K+ splits is going to take time, and the jobtracker will have similar GC issues if it has a low JVM memory limit.

It largely depends on how "strong" your submission server is (or your laptop client), maybe you need to upgrade RAM and CPU to make the getSplits call faster.
I believe you ran into swap issues there and the computation takes therfore multiple times longer than usual.

Related

Univocity - Parsing a fixedwidth flat file with one row - performance impact with 300 parallel threads

We have a project that deals with millions of transactions everyday which has some tight SLAs. As part of parsing the flat file that comes as input to a bean , we used beanio which was working better with out load. But with load its taking around 250ms to parse a flat file to a bean.
Requirement: Simple string has to converted to a single bean(nested and converted)
Heard the univocity can do better here - and tried the same with below settings.
FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.getFormat().setLineSeparator("\n");
settings.setRecordEndsOnNewline(false);
settings.setHeaderExtractionEnabled(false);
settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);
settings.setMaxColumns(100);
settings.setMaxCharsPerColumn(100);
settings.setNumberOfRecordsToRead(1);
settings.setReadInputOnSeparateThread(false);
settings.setInputBufferSize(10*1024);
settings.setLineSeparatorDetectionEnabled(false);
settings.setColumnReorderingEnabled(false);
When running with jmeter, with 200 parallel threads - the average time taken is 10ms(to parse and convert around 10 fields where in actual use case we have to the same for around 500 fields)
but when we increased it to 300 or 350 parallel threads , the average time was around 300ms. But our total SLA is around 10ms.
Any help here is highly appreciated!
Probably you are running out of memory on your JVM. Try increasing it with the -Xms and -Xmx flags. Also too many threads won't help you if you don't have enough cores available.

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I noticed is that:
The write speed into datastore is at most around 25-30 entities per second.
I tried to use --autoscalingAlgorithm=THROUGHPUT_BASED --numWorkers=10 --maxNumWorkers=100 but the execution seems to prefer one worker (see graph below: the target workers once increased to 2 but reduced to 1 "based on the ability to parallelize the work in the currently running step").
I did not use ancestor path for the keys; all the entities are the same kind.
The pipeline code looks like below:
def write_to_datastore(project, user_options, pipeline_options):
"""Creates a pipeline that writes entities to Cloud Datastore."""
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read vcf files' >> vcfio.ReadFromVcf(user_options.input)
| 'Create my entity' >> beam.ParDo(
ToEntityFn(), user_options.kind)
| 'Write to datastore' >> WriteToDatastore(project))
Because I have millions of rows to write into the datastore, it would take too long to write with a speed of 30 entities/sec.
Question: The input is just one huge gzipped file. Do I need to split it into multiple small files to trigger multiple workers? Is there any other way I can make the importing faster? Do I miss something in the num_workers setup? Thanks!
I'm not familiar with apache beam, the answer is from the general flow perspective.
Assuming there are no dependencies to be considered between entity data in various input file sections then yes, working with multiple input files should definitely help as all these files could then be processed virtually in parallel (depending, of course, on the max number of available workers).
You might not need to split the huge zipfile beforehand, it might be possible to simply hand off segments of the single input data stream to separate data segment workers for writing, if the overhead of such handoff itself is neglijible compared to the actual data segment processing.
The overall performance limitation would be the speed of reading the input data, splitting it in segments and handoff to the segment data workers.
A data segment worker would further split the data segment it receives in smaller chunks of up to the equivalent of the max 500 entities that can be converted to entities and written to the datastore in a single batch operation. Depending of the datastore client library used it may be possible to perform this operation asyncronously, allowing the split into chunks and conversion to entities to continue without waiting for the previous datastore writes to complete.
The performance limitation at the data segment worker would then be the speed at which the data segment can be split into chunks and the chunk converted to entities
If async ops aren't available or for even higher throughput, yet another handoff of each chunk to a segment worker could be performed, with the segment worker performing the conversion to entities and datastore batch write.
The performance limitation at the data segment worker level would then be just the speed at which the data segment can be split into chunks and handed over to the chunk workers.
With such approach the actual conversion to entities and batch writing them to the datastore (async or not) would no longer sit in the critical path of splitting the input data stream, which is, I believe, the performance limitation in your current approach.
I looked into the design of vcfio. I suspect (if I understand correctly) that the reason I always get one worker when the input is a single file is due to the limit of the _VcfSource and the VCF format constraint. This format has a header part that defines how to translate the non-header lines. This causes that each worker that reads the source file has to work on an entire file. When I split the single file into 5 separate files that share the same header, I successfully get up to 5 workers (but not any more probably due to the same reason).
One thing I don't understand is that the number of workers that read can be limited to 5 (in this case). But why we are limited to have only 5 workers to write? Anyway, I think I have found the alternative way to trigger multiple workers with beam Dataflow-Runner (use pre-split VCF files). There is also a related approach in gcp variant transforms project, in which the vcfio has been significantly extended. It seems to support the multiple workers with a single input vcf file. I wish the changes in that project could be merged into the beam project too.

How Redis RDB persistance actually works behind the scene?

i was going through Redis RDB persistence. I having some doubts regarding RDB persistence related to its disadvantage.
Understanding So far:
We should use rdb persistence when we need to save the snapshot of dataset currently in memory at some regular interval.
I can understand that in this way we can lose some data in case of server break down. But another disadvantage that i can't understand is how fork can be time consuming when persisting large dataset using rdb.
Quoting from Documentation
RDB needs to fork() often in order to persist on disk using a child
process. Fork() can be time consuming if the dataset is big, and may
result in Redis to stop serving clients for some millisecond or even
for one second if the dataset is very big and the CPU performance not
great. AOF also needs to fork() but you can tune how often you want to
rewrite your logs without any trade-off on durability.
I know how fork works as per my knowledge When parent process forks it create a new Child process and we can allow some code that child process will execute based on its pid or we can provide it some new executable that it will work on using exec() system call.
but things that i don't understand how it will be heavy task when size of dataset is larger?
I think i know the answer but i m not sure about that
Quoted from this link https://www.bottomupcs.com/fork_and_exec.xhtml
When a process calls fork then
the operating system will create a new process that is exactly the same as the parent process. This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
As per above statement whole dataset of redis will be copied to child.
Am i understanding right?
When standard fork is called with copy-on-write the OS must still copy all the page table entries, which can take time time if you have small 4k pages and a huge dataset, this is what makes the actual fork() time slow.
You can also find a lot of time and memory is required if your dataset is changing a lot in a sparse way, as copy-on-write semantics triggers the actual memory pages to be copied as changes are made to the original. Redis also performs incremental rehashing and maintains expiry etc. so an instance that is more active will typically take longer to save to disk.
More reading:
Faster forking of large processes on Linux?
http://kirkwylie.blogspot.co.uk/2008/11/linux-fork-performance-redux-large.html

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

Asynchronous programs showing locality of reference?

I was reading this excellent article which gives an introduction to Asynchronous programming here http://krondo.com/blog/?p=1209 and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.

Resources