Garbage Collection in R process running inside docker - r

I have a process where we currently large amount of data per day, perform a map-reduce level of functions and use only the output of the function. We currently run a code sequence that looks like the below
lapply(start_times, function(start_time){
<get_data>
<setofoperations>
}
so currently we loop through start times , which helps us get data for a particular day , analyse and output dataframes of results per output per day. set of operations is a series of functions that keep working on and return dataframes.
While running this on a docker container with a memory limit , we often see that the process runs out of memory when its dealing with large data (around 250-500MB) over periods of days and R isnt able to effectively do garbage collection.
Im trying an approach to monitor each process using cadvisor and notice spikes , but not really able to understand better.
If R does a lazy gc, ideally the process should be able to reuse the memory over and over, is there something that is not being captured through the gc process?
How can an R process reclaim more memory when its the only primary process running in the docker container ?

Related

Why the memory in my pc decreases when it almost reaches the limit?

I am running a piece of code in R. Its parallelized, running in 8 cores. Interestingly enough, when my memory usage reaches 15 and something GB, it drops to 10GB (my max memory is 16GB). I am curious of what is actually happening in the background? In the end, I get the complete data from all 8 cores, so I assume that data doesn't get lost. Does the pc stores it somewhere in SSD to free memory?
For more information, I loop over a time series data and perform a lot calculations, which I store in multiple vectors. When code finishes looping, it stores all the previous vectors in a list.
While running code, if I start opening many chrome tabs, which require a lot of memory, my code running time may take longer but still retrieves all data (sometimes crashes).
Very curious of what is happening?
It's impossible to say without the specific code, but most likely, it's due to R's garbage collection running only when necessary and only when more memory needs to be allocated - unlike other languages like Python, R does not immediately garbage-collect objects when they reach out of scope, and in particular if the R objects have an underlying pointer to a C/C++ object, garbage collection can he held out until very late after the object is unreachable.
If this variable memory usage is a problem, you can try adding explicit calls to gc() at key points in your code.
Yes, you are right pc sometimes usage the hard disk as memory. it is known as Swap memory. When your ram gets overloaded it sends some of the data to the hard and stores them there temporarily.

InfluxDB 2.0 has slow write performance with Python client compared to MariaDB

I am new to InfluxDB and I am trying to compare the performance of MariaDB and InfluxDB 2.0. Therefore I perform a benchmark of about 350.000 rows which are stored in a txt file (30mb).
I use ‘executemany’ to write multiple rows into the database when using MariaDB which took about 20 seconds for all rows (using Python).
So, I tried the same with InfluxDB using the Python client, attached are the major steps of how i do it.
#Configuring the write api
write_api = client.write_api(write_options=WriteOptions(batch_size=10_000, flush_interval=5_000))
#Creating the Point
p = Point(“Test”).field(“column_1”,value_1).field(“column_2”,value_2) #having 7 fields in total
#Appending the point to create a list
data.append(p)
#Then writing the data as a whole into the database, I do this after collecting 200.000 points (this had the best performance), then I clean the variable “data” to start again
write_api.write(“bucket”, “org”, data)
When executing this it takes about 40 seconds which is double the time of MariaDB.
I am stuck with this problem for quite some time now because the documentation suggests that I write it in batches, which I do and in theory it should be faster than MariaDB.
But probably I am missing something
Thank you in Advance!
It takes some time to shovel 20MB of anything onto the disk.
executemany probably does batching. (I don't know the details.)
It sounds like InfluxDB does not do as good a job.
To shovel lots of data into a table:
Given a CSV file, LOAD DATA INFILE is the fastest. But if you have to first create that file, it may not win the race.
"Batched" INSERTs are very fast: INSERT ... VALUE (1,11), (2, 22), ... For 100 rows, that runs about 10 times as fast as single-row INSERTs. Beyond 100 or so rows, it gets into "diminishing returns".
Combining separate INSERTs into a "transaction" avoids transactional overhead. (Again there is "diminishing returns".)
There are a hundred packages between the user and the database; InfluXDB is yet another one. I don't know the details.

What is task parallelism in the context of airflow?

Task parallelism in general is when multiple tasks run on the same or different set of data. But what is it in the context of airflow, when I change the parallelism parameter in the airflow.cfg file?
For instance, say I want to run a data processor on a batch of data. Will setting parallelism to 32, split the data into 32 sub-batches and run the same task on those sub-batches?
Or maybe, if somehow have 32 batches of data originally, instead of 1, I am able to run the data processor on all 32 batches(ie 32 task runs at the same time).
The setting won't "split the data" within your DAG.
From the docs:
parallelism: This variable controls the number of task instances that
runs simultaneously across the whole Airflow cluster
If you want to parallel execution of a task you will need to break it further meaning create more tasks but each task does less work. That can be come handy for some ETLs.
For example:
Lets say you want to copy yesterday records from MySQL to S3.
You could do it with a single MySQLToS3Operator that reads yesterday data in a single query. However you can also break it to 2 MySQLToS3Operator reading 12 hours data or 24 operators each reading hourly data. That is up to you and the limitation of the services you are working with.

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

Hadoop - job submission time on large data

Did anyone face any problem with submitting job on large data. Data is around 5-10 TB uncompressed, it is in approximate 500K files. When we try to submit a simple java map reduce job, it's mostly spend more than hour on getsplits() function call. And takes multiple hour to appear in job tracker. Is there any possible solution to solve this problem?
with 500k files, you are spending a lot of time tree walking to find all these files, which then need to be assigned to list of InputSplits (the result of getSplits).
As Thomas points out in his answer, if your machine performing the job submission has a low amount of memory assigned to the JVM, then you're going to see issues with the JVM performing garbage collection to try and find the memory required to build up the splits for these 500K files.
To makes matters worse, if these 500K files are splittable, and larger than a single block size, then you'll get even more input splits to process the files (a file of size say 1GB, with a block size of 256MB, you'll by default get 4 map tasks to process this file, assuming the input format and file compression supports splitting the file). If this is applicable to your job (look at the number of map tasks spawned for your job, are there more than 500k?), then you can force less mappers to be created by amending the mapred.min.split.size configuration property to a size larger then the current block size (setting it to 1GB for the previous example means you'll get a single mapper to process the file, rather than 4). This will help the performance of getSplits method the resultant list of getSplits will be smaller, requiring less memory.
The second symptom of your problem is the time is takes to serialize the input splits to a file (client side), and then the deserialization time at the job tracker end. 500K+ splits is going to take time, and the jobtracker will have similar GC issues if it has a low JVM memory limit.
It largely depends on how "strong" your submission server is (or your laptop client), maybe you need to upgrade RAM and CPU to make the getSplits call faster.
I believe you ran into swap issues there and the computation takes therfore multiple times longer than usual.

Resources