Multiple query execution in cloudera impala - cloudera

Is it possible to execute multiple queries at the same time in impala ? If yes, how does impala handle it?

I would certainly do some tests on your own, but I was not able to get multiple queries to execute:
I was using Impala connection, and reading query from a .sql file. This works for single commands.
from impala.dbapi import connect
# actual server and port changed for this post for security
conn=connect(host='impala server', port=11111,auth_mechanism="GSSAPI")
cursor = conn.cursor()
cursor.execute((open("sandbox/z_temp.sql").read()))
This is the error I received.
HiveServer2Error: AnalysisException: Syntax error in line 2:
This is what the SQL looked like in the .sql file.
Select * FROM database1.table1;
Select * FROM database1.table2;
I was able to run multiple commands with the SQL commands in separate .sql files iterating over all .sql files in a specified folder.
#Create list of file names for recon .sql files this will be sorted
#Numbers at begining of filename are important to sort so that files will be executed in correct order
file_names = glob.glob('folder/.sql')
asc_names = sorted(file_names, reverse = False)
filename = ""
for file_name in asc_names:
str_filename = str(file_name)
print(filename)
query = (open(str_filename).read())
cursor = conn.cursor()
# creates an error log dataframe to print, or write to file at end of job.
try:
# Each SQL command must be executed seperately
cursor.execute(query)
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'PASS'}])
df_log = df_log.append(df_id, ignore_index=True)
except:
df_id= pd.DataFrame([{'test_name': str_filename[-40:], 'test_status': 'FAIL'}])
df_log = df_log.append(df_id, ignore_index=True)
continue
Another way to do this would be to have all of the SQL statements in one .sql file separated by ; then loop thru the .sql file splitting statements out by ; running one at a time.
from impala.dbapi import connect
from impala.util import as_pandas
conn=connect(host='impalaserver', port=11111, auth_mechanism='GSSAPI')
cursor = conn.cursor()
# split SQL statements from one file seperated by ';', Note: last command will not have semicolon at end.
sql_file = open("sandbox/temp.sql").read()
sql = sql_file.split(';')
for cmd in sql:
# This gets rid of the non printing characters you may have
cmd = cmd.replace('/r','')
cmd = cmd.replace('/n','')
# This runs your SQL commands one at a time.
cursor.execute(cmd)
print(cmd)

Impala can execute multiple queries at the same time as long as it doesn't hit the memory cap.

You can issue a command like impala-shell -f <<file_name>>, where the file has multiple queries each complete query separated by a semi colon (;)

If you are a python geek, you can even try the impyla package to create multiple connections and run all your queries at once.
pip install impyla

Related

How to use the result of a query (bigquery operator) in another task-airflow

I have a project in google composser that aims to submit on a daily basis.
The code below does that, it works fine.
with models.DAG('reporte_prueba',
schedule_interval=datetime.timedelta(weeks=4),
default_args=default_dag_args) as dag:
make_bq_dataset = bash_operator.BashOperator(
task_id='make_bq_dataset',
# Executing 'bq' command requires Google Cloud SDK which comes
# preinstalled in Cloud Composer.
bash_command='bq ls {} || bq mk {}'.format(
bq_dataset_name, bq_dataset_name))
bq_audit_query = bigquery_operator.BigQueryOperator(
task_id='bq_audit_query',
sql=query_sql,
use_legacy_sql=False,
destination_dataset_table=bq_destination_table_name)
export_audits_to_gcs = bigquery_to_gcs.BigQueryToCloudStorageOperator(
task_id='export_audits_to_gcs',
source_project_dataset_table=bq_destination_table_name,
destination_cloud_storage_uris=[output_file],
export_format='CSV')
download_file = GCSToLocalFilesystemOperator(
task_id="download_file",
object_name='audits.csv',
bucket='bucket-reportes',
filename='/home/airflow/gcs/data/audits.csv',
)
email_summary = email_operator.EmailOperator(
task_id='email_summary',
to=['aa#bb.cl'],
subject="""Reporte de Auditorías Diarias
Institución: {institution_report} día {date_report}
""".format(date_report=date,institution_report=institution),
html_content="""
Sres.
<br>
Adjunto enviamos archivo con Reporte Transacciones Diarias.
<br>
""",
files=['/home/airflow/gcs/data/audits.csv'])
delete_bq_table = bash_operator.BashOperator(
task_id='delete_bq_table',
bash_command='bq rm -f %s' % bq_destination_table_name,
trigger_rule=trigger_rule.TriggerRule.ALL_DONE)
(
make_bq_dataset
>> bq_audit_query
>> export_audits_to_gcs
>> delete_bq_table
)
export_audits_to_gcs >> download_file >> email_summary
With this code, I create a table (which is later deleted) with the data that I need to send, then I pass that table to storage as a csv.
then I download the .csv to the local airflow directory to send it by mail.
The question I have is that if I can avoid the part of creating the table and taking it to storage. since I don't need it.
for example, execute the query with BigqueryOperator and access the result in ariflow, thereby generating the csv locally and then sending it.
I have the way to generate the CSV but my biggest doubt is how (if it is possible) to access the result of the query or pass the result to another airflow task
Though I wouldn't recommend passing results of sql queries across tasks, XComs in airflow are generally used for the communication between tasks.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
Also you need to create a custom operator to return query results, as I "believe" BigQueryOperator doesn't return query results.

Using Beeline as an example (vs hive cli)?

I have a sqoop job ran via oozie coordinator. After a major upgrade we can no longer use hive cli and were told to use beeline. I'm not sure how to do this? Here is the current process:
I have a hive file: hive_ddl.hql
use schema_name;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET mapreduce.map.memory.mb=16384;
SET mapreduce.map.java.opts=-Xmx16G;
SET hive.exec.compress.output=true;
SET mapreduce.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
drop table if exists 'table_name_stg' purge;
create external table if not exists 'table_name_stg'
(
col1 string,
col2 string,
...
)
row format delimited
fields terminated by '\001'
stored as textfile
location 'my/location/table_name_stg';
drop table if exists 'table_name' purge;
create table if not exists 'table_name'
stored as parquet
tblproperties('parquet.compress'='snappy') as
select * from schema.tablename_stg
drop table if exists 'table_name_stg' purge;
This is pretty straight forward, make a stage table, then use that to make the final table stuff...
it's then called in a .sh file as such:
hive cli -f $HOME/my/path/hive_ddl.hql
I'm new to most of this and not sure what beeline is, and couldn't find any examples of how to use it to accomplish the same thing my hivecli is. I'm hoping it's as simple as calling the hive_ddl.hql file differently, versus having to rewrite everything.
Any help is greatly appreciated.
Beeline is a command line shell supported in hive. In your case you can replace hive cli with a beeline command in the same .sh file. Would look roughly like the one given below.
beeline -u hiveJDBCUrl and -f test.hql
You can explore more about the beeline command options by going to the below link
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93CommandLineShell

Pass .sql file into DBI::dbGetQuery with comments as the first line into R / RStudio

I am trying to pass a .sql file into DBI::dbGetQuery, but I run into problems whenever there is comment in the first line. For example:
runs_fine.sql
SELECT * FROM mytable
does_not_work.sql
-- Some comment
SELECT * FROM mytable
Then in R:
library(odbc)
library(tidyverse)
con <- dbConnect(odbc(), "my_connection")
dbGetQuery(con, read_file("sql/runs_fine.sql")) # success
dbGetQuery(con, read_file("sql/does_not_work.sql")) # fails
Motivation: I would like to leverage the convenient sql preview function from RStudio, but this requires the first line be the following for any .sql file:
-- !preview conn=con
...which causes the dbGetQuery to fail. I suppose a workaround would be to use some sort of read.file function that can skip lines or avoid comment lines.
NOTE: If a comment line is in the middle of a .sql file, dbGetQuery succeeds as expected and avoids the appropriate commented lines.
DBMS is CHD Impala, kerberized. The return value from read_file() begins with:
"-- !preview conn=con\r\n\r\nSELECT \r\n ..."

How to export the query results from a Database Manager to a CVS file in SQLite?

I am using DataGrip or SQLiteStudio (database managers) to run a series of queries in a database which guide me to find the information that I require. The queries works well and the results are shown in the console of the Dabase Manager. However, I need to export the results that appears in the database manager console into a CVS file.
I have seen everybody works directly in the shell, but I need (I have to) to use a DB manager to run the queries (so far the queries that I need to run in one step are about 600 lines).
In the sqlite3 shell I am able to run (and works)
(.headers on)
(.mode csv)
(.output C:/filename.csv)
(select * from "6000_1000_Results";)
(.output stdout)
However, running this code in the sql editor of the DB manager, doesn´t work at all.
--(.....)
--(around 600 lines before)
--(.....)
"Material ID",
"Material Name",
SUM("Quantity of Material") Quantity
FROM
"6000_1000_Results_Temp"
GROUP BY
"DataCenterID", "Material ID";
------------------------------------------------------------
--(HERE IS WHERE I NEED TO EXPORT THE RESULTS IN A CVS FILE)
------------------------------------------------------------
.headers on
.mode csv
.output C:/NextCloudLuis/TemproDB.git/csvtest.csv
select * from "6000_1000_Results";
.output stdout
.show
DROP TABLE IF EXISTS "6000_1000_Results_Temp";
DROP TABLE IF EXISTS "6000_1000_Results";
Datagrip do not show any error, it runs the queries in a few seconds, but there´s no file anywhere, SQLiteStudio gives a syntax error.
Finally I solve this issue in the next steps:
I run over python all the queries using the lybrary sqlite3. Then the result of all the queries are saved in a pandas dataframe. Then the pandas dataframe is exported to a cvs and xlxs file.
Here is the python code:
import sqlite3
import queries
conn = sqlite3.connect("tempro.db") #make the database connection to python
level4000 = queries.level4000to1000(conn) #I call the function in queries.py
level4000.to_csv('Level4000to1000.csv') #export result ro cvs
level4000.to_excel('Level4000to1000.xlsx') #export result ro xlsx
conn.close()
and here is the python file where I save all the queries (queries.py)
import sqlite3
from sqlite3 import Error
import pandas as pd
def level4000to1000(conn):
cur = conn.cursor()
cur.executescript(
"""
/* Here I put all the 600 lines of queries */
DROP TABLE IF EXISTS "4000_1000_Results_Temp";
DROP TABLE IF EXISTS "4000_1000_Results";
/* Here more and more lines */
--To keep the results from all queries
CREATE TABLE "4000_1000_Results_Temp" (
"DeviceID" INTEGER,
"Device Name" TEXT,
SUM("Quantity of Material") Quantity
FROM
"4000_1000_Results_Temp"
GROUP BY
"DeviceID", "Material ID";
""")
df = pd.read_sql_query('''SELECT * FROM "4000_1000_Results";''', conn)
cur.executescript("""DROP TABLE IF EXISTS "4000_1000_Results_Temp";
DROP TABLE IF EXISTS "4000_1000_Results";""")
return df #returns a dataframe with the info results from the queries
In the end, it seems that there is no way to export results to file formats as cvs using SQL coding.

Create a stored procedure using RMySQL

Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)

Resources