Using Beeline as an example (vs hive cli)? - oozie

I have a sqoop job ran via oozie coordinator. After a major upgrade we can no longer use hive cli and were told to use beeline. I'm not sure how to do this? Here is the current process:
I have a hive file: hive_ddl.hql
use schema_name;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET mapreduce.map.memory.mb=16384;
SET mapreduce.map.java.opts=-Xmx16G;
SET hive.exec.compress.output=true;
SET mapreduce.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
drop table if exists 'table_name_stg' purge;
create external table if not exists 'table_name_stg'
(
col1 string,
col2 string,
...
)
row format delimited
fields terminated by '\001'
stored as textfile
location 'my/location/table_name_stg';
drop table if exists 'table_name' purge;
create table if not exists 'table_name'
stored as parquet
tblproperties('parquet.compress'='snappy') as
select * from schema.tablename_stg
drop table if exists 'table_name_stg' purge;
This is pretty straight forward, make a stage table, then use that to make the final table stuff...
it's then called in a .sh file as such:
hive cli -f $HOME/my/path/hive_ddl.hql
I'm new to most of this and not sure what beeline is, and couldn't find any examples of how to use it to accomplish the same thing my hivecli is. I'm hoping it's as simple as calling the hive_ddl.hql file differently, versus having to rewrite everything.
Any help is greatly appreciated.

Beeline is a command line shell supported in hive. In your case you can replace hive cli with a beeline command in the same .sh file. Would look roughly like the one given below.
beeline -u hiveJDBCUrl and -f test.hql
You can explore more about the beeline command options by going to the below link
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93CommandLineShell

Related

Informatica IPC - UNIX script fail

I have created a unix script to be executed after the session finished.
The script basically counts the lines of specific file and then creates a trailer with this specific structure:
T000014800000000000000000000000000000
T - for trailer
0000148 - number of lines
00000000000000000000000000000 - filler
I have tested the script in Mac, I know already that environments are totally different, but I want to know what is needed to be changed in order to execute this script successfully in IPC.
After execution I get the following error message:
The shell command failed with exit code 126.
I invoke the script as follows:
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
#! /bin/sh
TgtFiles=$1
TgtFilesBody=$TgtFiles/body.txt
TgtFilesTrailer=$TgtFiles/trailer.txt
string1=$(sed -n '$=' $TgtFilesBody)
pad=$(printf '%0.1s' "0"{1..8})
padlength=8
string2='T'
string3=$(printf '%s%*.*s%s\n' "$string2" 0 $((padlength - ${#string1} - ${#string2} )) "$pad" "$string1")
string4='00000000000000000000000000000'
string5=$(printf '%s%*.*s%s\n' "$string3" 0 $((${#string3} - ${#string4} )) "$string4")
echo $string5 > $TgtFilesTrailer
Any idea would be great.
Thanks in advance.
Please check below points.
it looks like permission issue. Please login using informatica user(the user that runs infa demon) and run this command. You should be able to get the errors.
sh -c "$PMRootDir/scripts/exec_trailer_unix.sh $PMRootDir/TgtFiles"
Sometime the server variable $PMRootDir in UNIX doesnt get interpreted and can result null value. Please use echo $PMRootDir to check if its working after logging into UNIX using above user.
You can create trailer file using Infa easily.
Just add an aggregator transformation right before actual target( group by a dummy field to calculate count(*)). Then add an expression transformation to create those strings. And then trailer file target. Just 3 more transformations.
| --> AGG --> EXP --> Trailer Target file
Final Tr --|--> Final Target

How to avoid Jupyter cell-ids from changing all the time and thereby spamming the VCS diffs?

As discussed in q/66678305, newer Jupyter versions store in addition to the source code and output of cells an ID for the purpose of e.g. linking to a cell.
However, these IDs aren't stable but often change even when the cell's source code was not touched. As a result, if you have the .ipynb file under version control with e.g. git, the commits end up having lots of rather funny sounding “changed lines” that don't correspond to any actual change made in the commit. Like,
{
"cell_type": "code",
"execution_count": null,
- "id": "respected-breach",
+ "id": "incident-winning",
"metadata": {},
"outputs": [],
Is there a way to prevent this?
Answer for Git on Linux. Probably also works on MacOS, but not Windows.
It is good practice to not VCS the .ipynb files as saved by Jupyter, but instead a filtered version that does not contain all the volatile information. For this purpose, various git hooks are available; the one I'm using is based on https://github.com/toobaz/ipynb_output_filter/blob/master/ipynb_output_filter.py.
Strangely enough, it turns out this script can not be modified to remove the "id" field from cells. Namely, if you try to remove that field in the filtering loop, like with
for field in ("prompt_number", "execution_number", "id"):
if field in cell:
del cell[field]
then the write function from jupyter_nbformat will just put an id back in. It is possible to merely change the id to something constant, but then Jupyter will complain about nonunique ids.
As a hack to circumvent this, I now use this filter with a simple grep to delete the ID:
#!/bin/bash
grep -v '^ *"id": "[a-z\-]*",$'
Store that in e.g. ~/bin/ipynb_output_filter.sh, make it executable (chmod +x ~/bin/ipynb_output_filter.sh) and ensure you have the following ~/.gitattributes file:
*.ipynb filter=dropoutput_ipynb
and in your git config (either global ~/.gitconfig or project)
[core]
attributesfile = ~/.gitattributes
[filter "dropoutput_ipynb"]
clean = ~/bin/ipynb_output_filter.sh
smudge = cat
If you want to use a standard python filter in addition to that, you can invoke it before the grep in ~/bin/ipynb_output_filter.sh, like
#!/bin/bash
~/bin/ipynb_output_filter.py | grep -v '^ *"id": "[a-z\-]*",$'

How to include enviroment variables in SQLite3 query

I'm working on writing a batch file to pull data from multiple SQLite3 databases. I need to set 2 System environment variables then use them two environment variables within my SELECT statement.As this will run against multiple databases that are the same schema, but for different site locations. Currently 14 differen locations. What I have come up with so far is my main queryrun.bat file:
#echo off
setx /m DateStart '20200101'
setx /m DateEnd '20200103'
sqlite3 SiteID40.db < query40.dat
TIMEOUT /t 3
sqlite3 SiteID41.db < query41.dat
Then the query#.dat files look like this:
.mode list
.separator ,
.output results40.csv
SELECT * FROM FilesActive WHERE CompactDate >= %DateStart% and CompactDate <= %DateEnd%;
.quit
The %DateStart% and %DateEnd% is where I am wanting to insert the system variables of the same name. I have tried using $DateStart and %DateStart% both with no luck.
Any constructive help is appreciated. I must RE-ITERATE though that this is SQLite3 NOT MS SQL Server or mySQL.
Using the .param command in sqlite, something like this should work:
Change the .bat call from sqlite3 SiteID40.db < query40.dat to
sqlite3 SiteID40.db ".param set :DateStart %DateStart%" ".param set :DateEnd %DateEnd%" ".read query.dat"
Change the query#.dat file SELECTs to:
SELECT * FROM FilesActive WHERE CompactDate >= :DateStart and CompactDate <= :DateEnd;

How to display PostgreSQL System variables

I am looking for a bash utility such as mysqladmin that could list all system variables values on the Postgres running instance.
Is there an utility that could be used as mysqladmin : mysqladmin -pxxxxx variables?
Like, say:
psql -qAtc 'select * from pg_settings';
?
or, if you just want key/value:
psql -qAtc 'SELECT name, setting FROM pg_settings';
?
Note that these will show settings as they apply to the current user running the command. So if there are ALTER DATABASE ... SET ... or ALTER USER ... SET ... options in effect you'll see those values, not the underlying ones from postgresql.conf.
For more details on formatting and control over output, see man psql.
If you want human-readable output rather than machine-friendly output, use psql -qc, leaving out the -At (meaning "unaligned, tuples-only").

Avoid message "-​- Loading resources from .sqliterc"

Minor problem, nevertheless irritating : Is there a way to avoid the following message from appearing each time I make a query :
-- Loading resources from /Users/ThG/.sqliterc
As a stupid workaround, this works:
<. sqlite your_sqlite.db 'select * from your_table'
This is because the current code does this:
if( stdin_is_interactive ){
utf8_printf(stderr,"-- Loading resources from %s\n",sqliterc);
}
Forcing a stdin redirect thwarts this due to this piece of code:
stdin_is_interactive = isatty(0);
This works as well:
sqlite -batch your_sqlite.db 'select * from your_table'
due to this piece of code:
}else if( strcmp(z,"-batch")==0 ){
/* Need to check for batch mode here to so we can avoid printing
** informational messages (like from process_sqliterc) before
** we do the actual processing of arguments later in a second pass.
*/
stdin_is_interactive = 0;
}
but it's longer, so kind of defeats the purpose.
I know that this question is PRETTY old now, but simply deleting '/Users/ThG/.sqliterc' should solve the problem. '.sqliterc' is a configuration file for sqlite's interactive command line front-end. If you don't spend a lot of time in there, you won't miss the file.
That resource msg comes out on stderr, and it's followed by a blank line, so you could get rid of it with something like this (wrapped up in a script file of its own):
#!/bin/bash
sqlite3 -init /your/init/file /your/sqlite3/file.db "
your
SQL
cmds
" 2>/dev/null | sed -e1d
When using sqlite in shell scripts, you usually don't even want your ~/.sqliterc to be loaded at all. This works well for me:
sqlite3 -init <(echo)
Explanation:
-init specifies the file to load instead of ~/.sqliterc.
<(echo) uses Process Substitution to provide a path to a temporary empty file.
A bit late but #levant pied almost had the solution, you need to pass an additional -interactive to silence the --loading resources from.
$ sqlite3 -batch -interactive
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> .q
You can simply rename your config file to disable the warning. And revert the rename to keep the configuration after use.
I use the following:
#suppress default configuration warnings
mv $HOME/.sqliterc $HOME/.backup.sqliterc
# sqlite3 scripts...
#revert
mv $HOME/.backup.sqliterc $HOME/.sqliterc

Resources