Impala via JDBC admitting queries sequentially from threads in same Java Process - cloudera

I am using Impala JDBC driver 2.5.32 on Cloudera 5.8.1. I run my Java code which spawns 10 different threads. Each thread instantiates it's own java.sql.Connection from DriverManager and makes one "insert into taget_table select field1,field2...fieldN, count(*) from source table group by field1,field2...fieldN" query. The query is big enough to run for a few minutes. Each query has a peak memory usage
I see that, even though, the threads have executed the query and waiting for the response, On the Imapala side I see that only one query is being executed at a time.
However, instead of threads, if I run them as different Java processes, The queries are run in Parallel. Could someone throw some light on why this could happen? Please let me know any additional details that I can provide..

Related

Apache Airflow Connection hook is instantiating multiple connection

Background: Apache Airflow documentation reads:
Hooks Hooks act as an interface to communicate with the external
shared resources in a DAG. For example, multiple tasks in a DAG can
require access to a MySQL database. Instead of creating a connection
per task, you can retrieve a connection from the hook and utilize it.
I have tried spawning 10 tasks using different DB: MYSQL, POSTGRES, MONGODB. Please note that I am using one DB (ex: MYSQL) in one DAG (consisting of 10 tasks).
But, All tasks are instantiating a new connection.
Example of my task:
conn_string = kwargs.get('conn_id')
pg = PostgresHook(conn_string)
pg_query ="...."
records = pg.get_records(pg_query)
why is airflow instantiating a new connection when airflow documentation itself reads (..... multiple tasks in a DAG can require access to a MySQL database. Instead of creating a connection per task, you can retrieve a connection from the hook and utilize it...........)
What is being missed here...
I believe what they mean to say with that part of documentation is that hooks prevent you from redefining the same credentials over and over again. With connection, they are referring to an Airflow connection you define in the web interface and not an actual internet connection to a host.
If you think about it this way:
A task can be scheduled in any of the 3 airflow worker nodes.
Your 10 tasks are divided between these 3.
How would they be able to share the same internet connection if it comes from different hosts? It would be very very hard to maintain those internet connections across workers.
But don't worry, it also took ages for me to understand what they meant there.

R services for SQL and T-SQL, memory issues when running multiple stored procedures

We are running R services for SQL on an Azure VM to create a web interface to a modeling algorithm. We are using T-SQL to call a very complicated and memory intensive R script. It runs fine when we submit a single job but we get a memory allocation error when we submit another job before the 1st has finished. Eventually we will need to queue hundreds of jobs that will run over many hours. We assume that it is initiating two R processes which use up the resources. What is the best way to force it to run the R jobs sequentially rather than simultaneously? We have looked at updating the resource pool to MAX_PROCESSES = 0, with no success (we have already adjusted memory resources in this way) . We are considering using the Azure Service Bus Queue but we are hoping there might be simpler options. We would appreciate any advice on how others have dealt with this sort of issue.
Thanks 10^6.

How fast CEL events are written to the database?

My program should notify its subscribers about new calls, ended calls, transfers and so on.
I can listen to CEL events in the AMI, but simpler solution would be to query database every X seconds and handle records from there, since I'll have to do it anyway to handle calls that took place when my program wasn't running.
(Yeah, I know, usually pushed events are better then polling, but not in this case, IMO)
But I'm not sure how fast CEL events are dumped into the database. Is there any delay or queue?
I've tested on my local Asterisk and events appeared in the database right away, but on some highly loaded instances this may not be so.
There is no queue.
When the backend database CEL driver is loaded, it initializes the connection to the database. When an event happens, it basically blocks the execution of that call, untill the database operation finishes (succeeds, fails, times out).
If ODBC is used, that is a bit different as I remember. It handles database transactions and cursor, but still no queue. I'm not sure about connection pooling.

How do I know whether an active PostgreSQL query is still "working"?

I am appending a large data frame (20 million rows) from R to PostgreSQL (9.5) using caroline::dbWriteTable2. I can see that this operation has created an active query, with waiting flag f using:
select *
from pg_stat_activity
where datname = 'dbname'
The query has been running for a long time (more than an hour) and I am wondering whether is is stalled. In my Windows 7 Resource Monitor I can see that the PostgreSQL server process is using CPU, but it is not listed in Disk Activity.
What other things can I do to check that the query has not been stalled for whatever reason?
Basically, if the backend is using CPU time, it is not stalled. SQL queries can run for a very long time.
There is no comfortable way to determine what a working PostgreSQL backend is currently doing; you can use something like strace on Linux to monitor the system calls issued or gdb to get a stack trace. If you know your way around the PostgreSQL source, and you know the plan of the active query, you can then guess what it is doing.
My advice is to take a look at the query plan (EXAMINE) and look if there are some expensive operations (high cost). That may cause the long execution time.

sqlite database connection/locking question

Folks
I am implementing a file based queue (see my earlier question) using sqlite. I have the following threads running in background:
thread-1 to empty out a memory structure into the "queue" table (an insert into "queue" table).
thread-1 to read and "process" the "queue" table runs every 5 to 10 seconds
thread-3 - runs very infrequently and purges old data that is no longer needed from the "queue" table and also runs vacuum so the size of the database file remains small.
Now the behavior that I would like is for each thread to get whatever lock it needs (waiting with a timeout if possible) and then completing the transaction. It is ok if threads do not run concurrently - what is important is that the transaction once begins does not fail due to "locking" errors such as "database is locked".
I looked at the transaction documentation but there does not seem to be a "timeout" facility (I am using JDBC). Can the timeout be set to a large quantity in the connection?
One solution (untried) I can think of is to have a connection pool of max 1 connection. Thus only one thread can connect at a time and so we should not see any locking errors. Are there better ways?
Thanx!
If it were me, I'd use a single database connection handle. If a thread needs it, it can allocate it within a critical section (or mutex, or similar) - this is basically a poor man's connection pool with only one connection in the pool:) It can do its business with the databse. When done, it exits the critical section (or frees the mutex or ?). You won't get locking errors if you carefully use the single db connection.
-Don

Resources