Maximum memory size for an XCOM in Airflow - airflow

I was wondering if there is any memory size limit for an XCOM in airflow ?

Airflow is NOT a processing framework. It is not Spark, neither Flink. Airflow is an orchestrator, and it the best orchestrator. There is no optimisations to process big data in Airflow neither a way to distribute it (maybe with one executor, but this is another topic). If you try to exchange big data between your tasks, you will end up with a memory overflow error! Oh, and do you know the xcom limit size in Airflow?
It depends on the database you use:
SQLite: 2 GB
Postgres: 1 GB
MySQL: 64 KB
Yes, 64 Kilobytes for MySQL! Again, use XComs only for sharing small amount of data.
ref: https://marclamberti.com/blog/airflow-xcom/

After looking at the source code it looks there is none, the type is a large binary in SQLAlchemy. Code
So according to the documentation is an unlengthed binary type for the target platform, such as BLOB on MySQL and BYTEA for PostgreSQL.

According to the source code check this source code link, maximum XCOM size is 48KB.

Related

dotnet-gcdump unexpected dump size and impact

I'm running a simple CRUD app built with ASP.NET Core and EF Core 3.1 in a docker swarm cluster on ubuntu. I'm only using managed code.
The container has a 10GB memory limit specified. I can inspect a running container and verify that this limit is actually set, I also see that DOTNET_RUNNING_IN_CONTAINER is set to true. When the app is started the memory consumption is about 700MB and it slowly builds up. Once it reaches 7GB (I see it in container generic stats) I start getting OutOfMemoryExceptions and it stays at this level for days. So the first question is
Why doesn't it go up to 10 GB?
Anyway I expect memory leaks so I have a dotnet-gcdump tool installed in this same container so I go ahead and collect the dump for future analysis with dotnet-gcdump collect. Once I execute this command I see the memory consumption of the running container drops from 7GB to 3GB and stays at this level. The resulting .gcdump file itself size though is only ~200MB with nothing suspicious in it. So next questions are
How does the collection of a dump reduce memory consumption? I'd assume it's doing GC with LOH compaction but it doesn't mention it in the docs.
Why isn't this memory freed automatically if the tool is able to do it?
Why is a resulting dump only 200 MB in size?
As the gcdump documentations explains: "GC dumps are created by triggering a GC in the target process, turning on special events, and regenerating the graph of object roots from the event stream".
Thus, it directly answers your question 2 - it triggers full GC, which may or may not be compacting, but it collects gen2 for sure. It also answers question 4 - it is not a "memory dump" but a special kind of diagnostics data about the objects graph (depndencies and typenames), without the data itself.
And regards to the questions 1 and 3 - it is an example of the GC being "not aggressive" enough. It is kind of the "living on the edge" problem when the process almost meets the containers limits and GC sometimes is not able to interpret it. In other words, it thinks it has enough space, but it doesn't. Please, be warned that this is a super-oversimplification. In such a case full GCs may not happen or happen too late. I would confirm that by observing the process by the dotnet-trace with gc-collect profile.
As a solution, consider setting the limit manually, by using GCHeapHardLimit, to some clearly smaller value like 8GB.

Query execution taking time in Presto with pinot connector

We are using Apache pinot as source system. We have loaded 10GB TPCH data into pinot. We are using Presto as query execution engine, using pinot connector.
We are trying with simple configuration. Presto installed on CentOS machine with 8CPUs and 64GB RAM. Only one instance of worker running with embedded coordinator. Pinot is installed on CentOS machine with 4 CPUs and 64 GB RAM. One Controller,one broker,one server and one zookeeper are running.
Running a query on Lineitem table involving group by roll-up, is taking 23 seconds. Around 20 seconds is spent in transferring 2.3GB data from pinot to presto.
In another query, involving join between Lineitem,Nation,Partsupply,Region with group by cube is taking around 2 minutes. Data transfer is taking around 25 seconds in this. Most of the remaiy time is spent in join and aggregation computation.
Is this normal performance with presto-pinot?
If not,what am I missing?
Do I, need to increase hardware? Increase number of presto/pinot processes?
Any specific presto properties I should consider modifying?
Thanks for your help in advance
Please list the queries so that we can provide a better answer. At a high level, Presto Pinot connector tries to pushdown most of the computation (filter, aggregation, group by) to Pinot and minimize the amount of data needed to pull from Pinot.
There are always queries that require a full table scan and computation cannot be pushed to pinot. Query latency can be higher in such cases. Pinot recently added a streaming api that can improve the latency further.

How do I know whether an active PostgreSQL query is still "working"?

I am appending a large data frame (20 million rows) from R to PostgreSQL (9.5) using caroline::dbWriteTable2. I can see that this operation has created an active query, with waiting flag f using:
select *
from pg_stat_activity
where datname = 'dbname'
The query has been running for a long time (more than an hour) and I am wondering whether is is stalled. In my Windows 7 Resource Monitor I can see that the PostgreSQL server process is using CPU, but it is not listed in Disk Activity.
What other things can I do to check that the query has not been stalled for whatever reason?
Basically, if the backend is using CPU time, it is not stalled. SQL queries can run for a very long time.
There is no comfortable way to determine what a working PostgreSQL backend is currently doing; you can use something like strace on Linux to monitor the system calls issued or gdb to get a stack trace. If you know your way around the PostgreSQL source, and you know the plan of the active query, you can then guess what it is doing.
My advice is to take a look at the query plan (EXAMINE) and look if there are some expensive operations (high cost). That may cause the long execution time.

SQLite Abnormal Memory Usage

We are trying to Integrate SQLite in our Application and are trying to populate as a Cache. We are planning to use it as a In Memory Database. Using it for the first time. Our Application is C++ based.
Our Application interacts with the Master Database to fetch data and performs numerous operations. These Operations are generally concerned with one Table which is quite huge in size.
We replicated this Table in SQLite and following are the observations:
Number of Fields: 60
Number of Records: 1,00,000
As the data population starts, the memory of the Application, shoots up drastically to ~1.4 GB from 120MB. At this time our application is in idle state and not doing any major operations. But normally, once the Operations start, the Memory Utilization shoots up. Now with SQLite as in Memory DB and this high memory usage, we don’t think we will be able to support these many records.
Q. Is there a way to find the size of the database when it is in memory?
When I create the DB on Disk, the DB size sums to ~40MB. But still the Memory Usage of the Application remains very high.
Q. Is there a reason for this high usage. All buffers have been cleared and as said before the DB is not in memory?
Any help would be deeply appreciated.
Thanks and Regards
Sachin
A few questions come to mind...
What is the size of each record?
Do you have memory leak detection tools for your platform?
I used SQLite in a few resource constrained environments in a way similar to how you're using it and after fixing bugs it was small, stable and fast.
IIRC it was unclear when to clean up certain things used by the SQLite API and when we used tools to find the memory leaks it was fairly easy to see where the problem was.
See this:
PRAGMA shrink_memory
This pragma causes the database connection on which it is invoked to free up as much memory as it can, by calling sqlite3_db_release_memory().

Why are SQLite transactions bound to harddisk rotation?

There's a following statement in SQLite FAQ:
A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.
As I know there's a cache on the harddisk and there might be also an extra cache in the disk driver that abstract the operation that is perceived by the software from the actual operation against the disk platter.
Then why and how exactly are transactions so strictly bound to disk platter rotation?
From Atomic Commit In SQLite
2.0 Hardware Assumptions
SQLite assumes that the operating
system will buffer writes and that a
write request will return before data
has actually been stored in the mass
storage device. SQLite further assumes
that write operations will be
reordered by the operating system. For
this reason, SQLite does a "flush" or
"fsync" operation at key points.
SQLite assumes that the flush or fsync
will not return until all pending
write operations for the file that is
being flushed have completed. We are
told that the flush and fsync
primitives are broken on some versions
of Windows and Linux. This is
unfortunate. It opens SQLite up to the
possibility of database corruption
following a power loss in the middle
of a commit. However, there is nothing
that SQLite can do to test for or
remedy the situation. SQLite assumes
that the operating system that it is
running on works as advertised. If
that is not quite the case, well then
hopefully you will not lose power too
often.
Because it ensures data integrity by making sure the data is actually written on to the disk rather than held in memory. Thus if the power goes off or something, the database is not corrupted.
This video http://www.youtube.com/watch?v=f428dSRkTs4 talks about reasons why (e.g. because SQLite is actually used in a lot of embedded devices where the power might well suddenly go off.)

Resources