extract set of lines using separator from file in UNIX - unix

I have below 3 queries in a single flat file. i want to extract 1 query from input file(ex: 2nd query) at a time. Separator for each query is ";"(semi colon). Please suggest how can i do this?
input file: query.sql
select * from
DBNAME.table1;
select * from
DBNAME.table2
;
select * from
DBNAME.table3
WHERE date<= current_date-30;
output should be
Outputfile: query_out.sql
select * from
DBNAME.table2
;

You can use this awk,
awk 'BEGIN{RS=";"} NR==2{print $0}' yourfile.sql > output.sql

sat's answer does not trim blank lines between two sql requests, nor does it output the semicolon ending a request.
Provided you are using gawk (or any flavour of awk allowing RS to be a regular expression), the following will probably suit your needs:
awk 'BEGIN {RS=";[[:space:]]*"} NR==2 {printf "%s;\n",$0}' yourfile.sql

Related

Using awk on all columns for just part of column content

I trying to find a solution for the following. I have a list of gene IDs in my first column and in all the other columns the related GO terms. The number of columns behind each gene ID is therefor variable. As follows the first few lines:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
The GO terms are delimited with a tab. I want to keep the first column, with the IDs and all the columns that contain "biological_process". But how do I do that using awk, without a specific column to search in.
I basically want to use grep for columns, so was trying something with awk (but I am not experienced in awk at all):
awk '/biological_process/' -> I get the full line
awk '{ print "biological_process" }' -> I only get biological process
Can someone help me out? THanks!
AWK:
awk -F"GO:" '{printf "%s",$1}{for(i=2;i<=NF;i++) if ($i~/biological_process/)printf FS"%s",$i ;print ""}' file
1) -F"GO:" - use "GO:" string as separator
2) {printf "%s",$1} - print the first column (without new line)
3) for(i=2;i<=NF;i++) - run on all columns beside the first one
4) ($i~/biological_process/) - check if string exists in col
5) printf FS"%s",$i - if string exists in column print the separator and the string
6) print "" - print new line
input file used:
TRINITY_DN173118_c0_g1 GO:0000139^cellular_component^Golgi membrane
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0003677^molecular_function^DNA binding GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
output
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination
Thanks to Ed Morton for the feedback , I have edit the Answer :).
another similar awk
$ awk 'BEGIN {FS=OFS="\t"}
{line=$1;
for(i=2;i<=NF;i++) if($i~/biological_process/) line=line OFS $i;
print line}' file
TRINITY_DN173118_c0_g1
TRINITY_DN49436_c2_g1 GO:0006351^biological_process^transcription, DNA-templated
TRINITY_DN47442_c0_g1 GO:0006302^biological_process^double-strand break repair GO:0006310^biological_process^DNA recombination

unix command to select file using delimiter "--"

sample1 presentation -- www.test0.com
command line input -- www.test1.com
...
In unix, which command I can use to only select the second half using delimiter " -- ". I tried 'cut' command, but cut -d only take one char delimiter. so ' -- ' won't work since it has 4 chars.
You can use many tools to do this, here is an example in awk:
awk -F"--" '{ print $2 }' <infilename>
-F allows you to specify a delimiter to split each line on, $2 is the second element of that line when it is split by --
sed 's/^.*-- \(.*\)/\1/' filename
will get you the field after -- in all lines of filename.

Need help parsing a file via UNIX commands

I have a file that has lines that look like this
LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;
I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;
If additional information is required, do let me know, I'll post them in edits. Thanks!!
This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!
awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file
Output
FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|
The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.
Pure awk:
awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file
Updated to give proper formatting and to omit LINEID11, etc...
Output:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
Explanation:
awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS
/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...
gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS
gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS
To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).
To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.
First, let's get only the lines with LINEID1 (assuming the input is in a file called input).
grep '^LINEID1' input
will output all the lines beginning with LINEID1.
Next, extract the columns we care about:
grep '^LINEID1' input | # extract lines with LINEID1 in them
cut -d: -f2 | # extract column 2 (after ':')
tr ',&' '\n\n' | # turn ',' and '&' into newlines
egrep 'FIELD[1249]' | # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|' | # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'
The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.
That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.
Here's the output from the above command:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
This command took only a couple of minutes to cobble up.
Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.
The same script in ruby might look like:
#!/usr/bin/env ruby
#
while line = gets do
if line.chomp =~ /^LINEID1:(.*)$/
f1, others = $1.split(',')
fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
puts [f1, fields].flatten.join("|")
end
end
Run this script on the same input file and the same output as above will occur:
$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

Unix Concat multiple lines between specific keywords into a single line

My team gets Teradata DDL files generated through a front end tool. These files need to be corrected before executing.
A step in this is getting the DDL command on a single line
E.g.
create table ABC
(column A varchar2(100),
column B number(10)
);
replace view ABC_v as
select columnA, column B from
ABC;
should change to
create table ABC (column A varchar2(100),column B number(10));
replace view ABC_v as select columnA, column B from ABC;
In short, I am looking to replace every new line character with single space in a multi-line string.
The string can start with either create, replace or drop and it will always end with a ; (semicolon)
Thanks in advance for your help
Here's a simple solution in shell:
#!/bin/sh
while read first rest; do
case "$first" in
create|replace|drop) echo "" ;;
esac
printf "%s %s " "$first" "$rest"
done < inputfile
echo ""
This adds a blank line to the beginning of the output because I'm lazy. But you see the logic, I'm sure. To avoid the blank line, you can use a temporary variable to determine whether you've actually started pulling in input data yet.
You could do something sort-of similar using awk:
awk '
BEGIN {
a["create"];
a["replace"];
a["drop"];
}
$1 in a && h {
print substr(h,2);h="";
}
{
h=h" "$0;
}
END {
print substr(h,2);
}
' inputfile
Instead of simply prepending a newline before keywords, this solution builds lines of output in variables, then prints them when they're complete.
Alternately, you could use sed to implement the same idea:
sed -rne '/^(create|replace|drop) /{;x;s/\n/ /g;/./p;d;};H;${;x;s/\n/ /g;p;}' inputfile
In all three of these solutions, I haven't bothered to check whether the input string ends in a semicolon. You can add that check to each of them once you decide how you want to handle that failure. (Report an error? Send the command via email? Ignore it?)
Note also that DDL, like SQL, should be able to interpret commands provided on multiple lines. SQL is whitespace agnostic -- an unquoted newline should be the same as a space (though perhaps Teradata behaves differently).

Field spearator to used if they are not escaped using awk

i have once question, suppose i am using "=" as fiels seperator, in this case if my string contain for example
abc=def\=jkl
so if i use = as fields seperator, it will split into 3 as
abc def\ jkl
but as i have escaped 2nd "=" , my output should be as
abc def\=jkl
Can anyone please provide me any suggestion , if i can achieve this.
Thanks in advance
I find it simplest to just convert the offending string to some other string or character that doesn't appear in your input records (I tend to use RS if it's not a regexp* since that cannot appear within a record, or the awk builtin SUBSEP otherwise since if that appears in your input you have other problems) and then process as normal other than converting back within each field when necessary, e.g.:
$ cat file
abc=def\=jkl
$ awk -F= '{
gsub(/\\=/,RS)
for (i=1; i<=NF; i++) {
gsub(RS,"\\=",$i)
print i":"$i
}
}' file
1:abc
2:def\=jkl
* The issue with using RS if it is an RE (i.e. multiple characters) is that the gsub(RS...) within the loop could match a string that didn't get resolved to a record separator initially, e.g.
$ echo "aa" | gawk -v RS='a$' '{gsub(RS,"foo",$1); print "$1=<"$1">"}'
$1=<afoo>
When the RS is a single character, e.g. the default newline, that cannot happen so it's safe to use.
If it is like the example in your question, it could be done.
awk doesn't support look-around regex. So it would be a bit difficult to get what you want by setting FS.
If I were you, I would do some preprocessing, to make the data easier to be handled by awk. Or you could read the line, and using other functions by awk, e.g. gensub() to remove those = s you don't want to have in result, and split... But I guess you want to achieve the goal by playing field separator, so I just don't give those solutions.
However it could be done by FPAT variable.
awk -vFPAT='\\w*(\\\\=)?\\w*' '...' file
this will work for your example. I am not sure if it will work for your real data.
let's make an example, to split this string: "abc=def\=jkl=foo\=bar=baz"
kent$ echo "abc=def\=jkl=foo\=bar=baz"|awk -vFPAT='\\w*(\\\\=)?\\w*' '{for(i=1;i<=NF;i++)print $i}'
abc
def\=jkl
foo\=bar
baz
I think you want that result, don't you?
my awk version:
kent$ awk --version|head -1
GNU Awk 4.0.2

Resources