icCube incremental vs single load performance - iccube

I'm using flat file data sources with the incremental load functionality and am seeing different performance depending on how I do the load. I have 3 datasets {d1,d2,d3} with d1 and d2 being the same size and d3 being 3 times larger. I am doing the following test on a machine with 16GB memory:
Load d1 - time: 1m07s
incrementally load d2 - time: 2m53s
incrementally load d3 - runs out of memory
On the other hand, if I do a single load of d1+d2+d3, the total time is 5m29s and there are no memory issues.
Is this just a matter of memory overhead when doing incremental vs single load or should I be better managing the performance?

Incremental load has been implemented for supporting real time and it has not the same logic as a normal load.
Additional data is pre-loaded into memory, that's why it takes more memory. During this pre-load the schema is still available, once the new data is fully preloaded and a first quality check is done, the schema is write locked and the actual load is done. This allows for having the schema locked for a few miliseconds.
The incremental load is suitable for for real-time, 'small' amount of additional data, not really for your scenario.
Slow times are not due to the fact you're running out of memory ( a lot of GC's )
?
Hope that helps.
PS: If you need an additional support please contact support directly.

Related

Univocity - Parsing a fixedwidth flat file with one row - performance impact with 300 parallel threads

We have a project that deals with millions of transactions everyday which has some tight SLAs. As part of parsing the flat file that comes as input to a bean , we used beanio which was working better with out load. But with load its taking around 250ms to parse a flat file to a bean.
Requirement: Simple string has to converted to a single bean(nested and converted)
Heard the univocity can do better here - and tried the same with below settings.
FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.getFormat().setLineSeparator("\n");
settings.setRecordEndsOnNewline(false);
settings.setHeaderExtractionEnabled(false);
settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);
settings.setMaxColumns(100);
settings.setMaxCharsPerColumn(100);
settings.setNumberOfRecordsToRead(1);
settings.setReadInputOnSeparateThread(false);
settings.setInputBufferSize(10*1024);
settings.setLineSeparatorDetectionEnabled(false);
settings.setColumnReorderingEnabled(false);
When running with jmeter, with 200 parallel threads - the average time taken is 10ms(to parse and convert around 10 fields where in actual use case we have to the same for around 500 fields)
but when we increased it to 300 or 350 parallel threads , the average time was around 300ms. But our total SLA is around 10ms.
Any help here is highly appreciated!
Probably you are running out of memory on your JVM. Try increasing it with the -Xms and -Xmx flags. Also too many threads won't help you if you don't have enough cores available.

sparklyr sdf_collect and dplyr collect function on large tables in Spark takes ages to run?

I am running R Studio and R 3.5.2.
I have loaded around 250 parquet files using sparklyr::spark_read_parquet from S3a.
I need to collect the data from Spark (installed by sparklyr):
spark_install(version = "2.3.2", hadoop_version = "2.7")
But for some reason it takes ages to do the job. Sometimes the task is distributed to all CPU's and sometimes only one works:
Please advise how would you solve the dplyr::collect or sparklyr::sdf_collect "running ages" issue.
Please also understand that I can't provide you with the data and if it's a small amount it will work significantly fast.
That is an expected behavior. dplyr::collect, sparklyr::sdf_collect or Spark's native collect will bring all data to the driver node.
Even if feasible (you need at least 2-3 times more memory than the actual size of the data, depending on a scenario) it is bound to take a long time - with drivers network interfaces being the most obvious bottleneck.
In practice if you're going to collect all the data it typically makes more sense to skip network and platform overhead and load data directly using native tools (given the description it would be to download data to the driver and convert to R friendly format file by file).

How Redis RDB persistance actually works behind the scene?

i was going through Redis RDB persistence. I having some doubts regarding RDB persistence related to its disadvantage.
Understanding So far:
We should use rdb persistence when we need to save the snapshot of dataset currently in memory at some regular interval.
I can understand that in this way we can lose some data in case of server break down. But another disadvantage that i can't understand is how fork can be time consuming when persisting large dataset using rdb.
Quoting from Documentation
RDB needs to fork() often in order to persist on disk using a child
process. Fork() can be time consuming if the dataset is big, and may
result in Redis to stop serving clients for some millisecond or even
for one second if the dataset is very big and the CPU performance not
great. AOF also needs to fork() but you can tune how often you want to
rewrite your logs without any trade-off on durability.
I know how fork works as per my knowledge When parent process forks it create a new Child process and we can allow some code that child process will execute based on its pid or we can provide it some new executable that it will work on using exec() system call.
but things that i don't understand how it will be heavy task when size of dataset is larger?
I think i know the answer but i m not sure about that
Quoted from this link https://www.bottomupcs.com/fork_and_exec.xhtml
When a process calls fork then
the operating system will create a new process that is exactly the same as the parent process. This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
As per above statement whole dataset of redis will be copied to child.
Am i understanding right?
When standard fork is called with copy-on-write the OS must still copy all the page table entries, which can take time time if you have small 4k pages and a huge dataset, this is what makes the actual fork() time slow.
You can also find a lot of time and memory is required if your dataset is changing a lot in a sparse way, as copy-on-write semantics triggers the actual memory pages to be copied as changes are made to the original. Redis also performs incremental rehashing and maintains expiry etc. so an instance that is more active will typically take longer to save to disk.
More reading:
Faster forking of large processes on Linux?
http://kirkwylie.blogspot.co.uk/2008/11/linux-fork-performance-redux-large.html

Virtuoso 7 crashes while bulk loading

I am trying to create a local SPARQL endpoint for Freebase for running some local experiments. While using Virtuoso 7, I regularly see server getting killed by OOM killer. I have followed all the required steps as mentioned here. I have also made the required changes to my virtuoso.ini file as mentioned in RDF Performance Tuning.
My system configuration is:
8 CPU 2.9 Ghz
16 GB RAM
I have enough hard disk too.
Regarding data dumps, I have split the freebase data dump (23GB gzipped, approx 250 GB uncompressed) into 10 smaller gzipped files containing 200,000,000 triples each.
Following are the changes I made to virtuoso.ini
NumberOfBuffers = 1360000
MaxDirtyBuffers = 1000000
MaxCheckpointRemap = 340000 # (1/4th of NumberOfBuffers)
Along with this I have set vm.swapiness = 10 as mentioned in 2.
Am I missing something obvious?
P.S.:
I did try virtuoso-opensource-6.1 too. But it appeared to be too slow.
One interesting observation I had was that during bulk loading process, virtuoso-6.1 memory consumption was rising too slowly, but it might be because general indexing itself was too slow.
Another observation I had was the virtuoso-6.1 at start time occupies almost negligible memory (order of 500MB) whereas virtuoso-7 starts with approx 6500 MB and grows quickly.
Any help in this regard would be highly appreciated.
Numbers of buffers you are using is little bit too high. Do not forget that some memory is also consume by OS and other processes.
Which exact version do you use? (development or stable branch?)
Do you use disk striping ?
I load freebase to Virtuoso 7 too, but I used smaller files. Circa 260 gzipped files, 10mil triples each = circa 100M. A commit is executed after every file load.
Maybe would be easier for you to use images with Virtuoso preloaded by Freebase

Hadoop - job submission time on large data

Did anyone face any problem with submitting job on large data. Data is around 5-10 TB uncompressed, it is in approximate 500K files. When we try to submit a simple java map reduce job, it's mostly spend more than hour on getsplits() function call. And takes multiple hour to appear in job tracker. Is there any possible solution to solve this problem?
with 500k files, you are spending a lot of time tree walking to find all these files, which then need to be assigned to list of InputSplits (the result of getSplits).
As Thomas points out in his answer, if your machine performing the job submission has a low amount of memory assigned to the JVM, then you're going to see issues with the JVM performing garbage collection to try and find the memory required to build up the splits for these 500K files.
To makes matters worse, if these 500K files are splittable, and larger than a single block size, then you'll get even more input splits to process the files (a file of size say 1GB, with a block size of 256MB, you'll by default get 4 map tasks to process this file, assuming the input format and file compression supports splitting the file). If this is applicable to your job (look at the number of map tasks spawned for your job, are there more than 500k?), then you can force less mappers to be created by amending the mapred.min.split.size configuration property to a size larger then the current block size (setting it to 1GB for the previous example means you'll get a single mapper to process the file, rather than 4). This will help the performance of getSplits method the resultant list of getSplits will be smaller, requiring less memory.
The second symptom of your problem is the time is takes to serialize the input splits to a file (client side), and then the deserialization time at the job tracker end. 500K+ splits is going to take time, and the jobtracker will have similar GC issues if it has a low JVM memory limit.
It largely depends on how "strong" your submission server is (or your laptop client), maybe you need to upgrade RAM and CPU to make the getSplits call faster.
I believe you ran into swap issues there and the computation takes therfore multiple times longer than usual.

Resources