I need to perform a count on Hive table and output the result into a text file and drop it at another location as a trigger.
The hive output currently looks like this:
+-------------+----------+
| _c0 | _c1 |
+-------------+----------+
| 2020-03-01 | 3203500 |
+-------------+----------+
I tried options like following:
hive -e 'select CURRENT_DATE, count(*) from db.table;' | sed 's/[[:space:]]\+/,/g' > /trigger/trigger_file.txt
But its not giving the expected the result. What else can i try?
The expected outcome inside the .txt file is as follows:
2020-03-01,3203500
You may replace your sed command with
awk -F'[| ]+' '$2 ~ /[0-9]{4}-[0-9]{2}-[0-9]{2}/{print $2","$3}'
The -F'[| ]+' sets the field separator to a [| ]+ regex that matches one or more occurrences of a space or pipe chars, then grabs all records where the second field matches a datelike pattern ([0-9]{4}-[0-9]{2}-[0-9]{2}, see demo), and prints their second and third column values with a comma and space in between.
To avoid all replacing results using sed ..etc, try with this approach using concat_ws(',',col1,col2...etc) and results output will have , separated data!
hive -e 'select CONCAT_WS(',',CURRENT_DATE, count(*)) from Mytable' > /home/user/Mycsv.csv
Hive provides inbuilt command to write into files
INSERT OVERWRITE LOCAL DIRECTORY '/home/docs/temp' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select * from db.table;
Other way
hive -S -e 'set hive.cli.print.header=false; select * from db.table' | sed 's/[[:space:]]\+/,/g' > /home/docs/temp.csv
Related
I have a .sql file in which there are 100s of hive queries and i want their output in a multiple files, like for 1st query abc.txt file gets created for 2nd query xyz.txt file gets created and so on....for 100 queries 100 output file with their result respectively
If your main .sql file has semicolon separated sql queries, you could use an awk command like this to generate separate hive commands with output files.
tr '\n' ' ' < yourqueryfile | awk 'BEGIN {RS=";"} \
{gsub(/(^ +| +$)/, "", $0);printf "hive -e \"%s\" >OUT_"NF".txt\n",$0}'
RS=";" - sets record separator to ";"
tr - to replace newlines between queries to single space.
gsub - to trim leading and trailing spaces.
The command will generate multiple hive command lines like this.
hive -e "select 1" >OUT_2.txt
hive -e "select 3 from ( select 4 )" >OUT_7.txt
hive -e "select name from t union select n from t2" >OUT_9.txt
hive -e "select * from c" >OUT_4.txt
I have a file that has lines that look like this
LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;
I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;
If additional information is required, do let me know, I'll post them in edits. Thanks!!
This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!
awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file
Output
FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|
The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.
Pure awk:
awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file
Updated to give proper formatting and to omit LINEID11, etc...
Output:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
Explanation:
awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS
/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...
gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS
gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS
To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).
To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.
First, let's get only the lines with LINEID1 (assuming the input is in a file called input).
grep '^LINEID1' input
will output all the lines beginning with LINEID1.
Next, extract the columns we care about:
grep '^LINEID1' input | # extract lines with LINEID1 in them
cut -d: -f2 | # extract column 2 (after ':')
tr ',&' '\n\n' | # turn ',' and '&' into newlines
egrep 'FIELD[1249]' | # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|' | # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'
The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.
That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.
Here's the output from the above command:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
This command took only a couple of minutes to cobble up.
Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.
The same script in ruby might look like:
#!/usr/bin/env ruby
#
while line = gets do
if line.chomp =~ /^LINEID1:(.*)$/
f1, others = $1.split(',')
fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
puts [f1, fields].flatten.join("|")
end
end
Run this script on the same input file and the same output as above will occur:
$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
I am new to this.
I have a file with morethan 1000 records. Each column is seperated with a delimiter pipe "|".
I would like to see only the records which has empty columns.
Waiting for the Valuable Answer.
Thanks,
Sandeep
With egrep you can grep for several strings in one time.
You would like to find lines starting with an enmpty colum (^|),
lines with an empty column in the middle (||) and lines
ending witch an empty line (|$).
egrep uses | as an OR delimiter in the grep string, so all | characters above must be escaped by a \ character.
Result:
$ cat file
a1|b1|c1
|b2|c2
a3||c3
a4|b4|
$ egrep "^\||\|\||\|$" file
|b2|c2
a3||c3
a4|b4|
If I have sample file data like this:
|void|bar
foo|void|bar
foo||bar
foo|void|
I have to look for ||, | at the beginning and | in the end. This work for me:
cat data | grep "^|\|||\||$"
Explanation of grep:
^| - beginning of line
\| - OR
|| - two of them
\| - OR
|$ - end of line
Output:
|void|bar
foo||bar
foo|void|
I have a large list of ids I want to use for a query in sqlite3. I can loop one by one in the shell, perl or R, or do clever hacking with xargs or ',' concatenating in order to query by batches in order to be more efficient but I was wondering if there is a way of loading directly the file in a temp table or do a 'where in ([read file]). Which is the standard way of dealing with this common situation?
You mean sqlite3 the command-line shell, right?
Yes, that one has an option to load file, in something-separated format, in a (temporary) table.
create temporary table ids (id integer primary key);
.import ids.txt ids
Where ids.txt would have each id on separate line. If you wanted to import multiple columns, you'd have to set separator using the .separator command beforehand, but for only one column it shouldn't matter.
Note that integer primary key is faster than any other column type because it aliases row id and sqlite stores tables in b-trees indexed by rowid. So use it if you can.
I came with a large one line solution: create a string of comma separated and quoted elements, and pipe to xargs. xargs will put as many of them as they are allowed in the comand line and repeat the command as many times as necessary.
If I have a tab separated file with my elements in the second column:
perl -lane ' push #x,$F[2]; END{ print join(", ",map{q{\"}.$_.q{\"}}#x)}' file.txt |\
xargs -t -I{} sqlite3 -separator $'\t' -header database.db \
'select * from xtable where mycolumn in ('{}') and myothercolumn="foo" order by bar"
I need to join two files on two fields. However i should retrieve all the values in file 1 even if the join fails its like a left outer join.
File 1:
01|a|jack|d
02|b|ron|c
03|d|tom|e
File 2:
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
output:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||
It's join -t '|' file1 file2 -a1
Options used:
t: Delimiter.
a: Decides the file number from which the unpaired lines have to be printed.
join -t '|' file1 file2 -a2 would do a right outer join.
Sample Run
[aman#aman test]$ cat f1
01|a|jack|d
02|b|ron|c
03|d|tom|e
[aman#aman test]$ cat f2
01|a|nemesis|f
02|b|brave|d
04|d|gorr|h
[aman#aman test]$ join -t '|' f1 f2 -a1
01|a|jack|d|a|nemesis|f
02|b|ron|c|b|brave|d
03|d|tom|e
To do exactly what the question asks is a bit more complicated than previous answer and would require something like this:
sed 's/|/:/2' file1 | sort -t: >file1.tmp
sed 's/|/:/2' file2 | sort -t: >file2.tmp
join -t':' file1.tmp file2.tmp -a1 -e'|' -o'0,1.2,2.2' | tr ':' '|'
Unix join can only join on a single field AFAIK so you must use files that use a different delimiter to "join two files on two fields", in this case the first two fields. I'll use a colon :, however if : exists in any of the input you would need to use something else, a tab character for example might be a better choice for production use. I also re-sort the output on the new compound field, sort -t:, which for the example input files makes no difference but would for real world data. sed 's/|/:/2' replaces the second occurrence of pipe with colon on each line in file.
file1.tmp
01|a:jack|d
02|b:ron|c
03|d:tom|e
file2.tmp
01|a:nemesis|f
02|b:brave|d
04|d:gorr|h
Now we use join output filtered by tr with a few more advanced options:
-t':' specify the interim colon delimiter
-a1 left outer join
-e'|' specifies the replacement string for failed joins, basically the final output delimiter N-1 times where N is the number of pipe delimited fields joined to the right of the colon in file2.tmp. In this case N=2 so one pipe character.
-o'0,1.2,2.2' specifies the output format:
0 join field
1.2 field 2 of file1.tmp, i.e. everything right of colon
2.2 field 2 of file2.tmp
tr ':' '|' Finally we translate the colons back to pipes for the final output.
The output now matches the question sample output exactly which the previous answer did not do:
01|a|jack|d|nemesis|f
02|b|ron|c|brave|d
03|d|tom|e||
I recently had this issue with a very simple input file , just one field, hence no considerations of delimiters.
cat file1 > k1
cat file2 >> k1
sort k1 | uniq -c | grep "^.*1 "
will give you lines that occur in only 1 file
This is a special case, it may not be applicable or comparable to the above techniques posted here, but putting out there in case its useful to someone, who's looking for left outer joins (i.e. unmatched cases only). Grepping for "^.*2 " will give you matched cases. In case you have a multi-field file (the more common case), but you only care about a single join field, you can use Awk to create a key-only file (for each file) and then process as above.