Unix Concat multiple lines between specific keywords into a single line - unix

My team gets Teradata DDL files generated through a front end tool. These files need to be corrected before executing.
A step in this is getting the DDL command on a single line
E.g.
create table ABC
(column A varchar2(100),
column B number(10)
);
replace view ABC_v as
select columnA, column B from
ABC;
should change to
create table ABC (column A varchar2(100),column B number(10));
replace view ABC_v as select columnA, column B from ABC;
In short, I am looking to replace every new line character with single space in a multi-line string.
The string can start with either create, replace or drop and it will always end with a ; (semicolon)
Thanks in advance for your help

Here's a simple solution in shell:
#!/bin/sh
while read first rest; do
case "$first" in
create|replace|drop) echo "" ;;
esac
printf "%s %s " "$first" "$rest"
done < inputfile
echo ""
This adds a blank line to the beginning of the output because I'm lazy. But you see the logic, I'm sure. To avoid the blank line, you can use a temporary variable to determine whether you've actually started pulling in input data yet.
You could do something sort-of similar using awk:
awk '
BEGIN {
a["create"];
a["replace"];
a["drop"];
}
$1 in a && h {
print substr(h,2);h="";
}
{
h=h" "$0;
}
END {
print substr(h,2);
}
' inputfile
Instead of simply prepending a newline before keywords, this solution builds lines of output in variables, then prints them when they're complete.
Alternately, you could use sed to implement the same idea:
sed -rne '/^(create|replace|drop) /{;x;s/\n/ /g;/./p;d;};H;${;x;s/\n/ /g;p;}' inputfile
In all three of these solutions, I haven't bothered to check whether the input string ends in a semicolon. You can add that check to each of them once you decide how you want to handle that failure. (Report an error? Send the command via email? Ignore it?)
Note also that DDL, like SQL, should be able to interpret commands provided on multiple lines. SQL is whitespace agnostic -- an unquoted newline should be the same as a space (though perhaps Teradata behaves differently).

Related

Can sort command be used to sort file based on multiple columns in a csv file

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

How to select multiple columns which are not next to each other?

I have a dataset which I am trying to select the first 10 columns from, and the last 27 columns from (from the 125th column onwards to the final 152nd column).
awk 'BEGIN{FS="\t"} { printf $1,$2,$3,$4,$5,$6,$7,$8,$9,$10; for(i=125; i<=NF; ++i) printf $i""FS; print ""}' Bigdata.txt > Smalldata.txt
With trying this code it gives me the first 12 columns (with their data) and all the headers for all 152 columns from my original big data file. How do I select both columns 1-10 and 125-152 to go into a new file? I am new to linux and any guidence would be appreciated.
don't reinvent the wheel, if you already know the number of columns cut is the tool for this task.
$ cut -f1-10,125-152 bigdata
tab is the default delimiter.
If you don't know the number of columns, awk comes to the rescue!
$ cut -f1-10,$(awk '{print NF-27"-"NF; exit}' file) file
awk will print the end range by reading the first line of the file.
Using the KISS principle
awk 'BEGIN{FS=OFS="\t"}
{ c=""; for(i=1;i<=10;++i) { printf c $i; c=OFS}
for(i=NF-27;i<=NF;++i) { printf c $i }
printf ORS }' file
Could you please try following, since no samples produced so couldn't test it. You need NOT to manually write 1...10 field values you could use a loop for that too.
awk 'BEGIN{FS=OFS="\t"}{for(i=1;i<=10;i++){printf("%s%s",$i,OFS)};for(i=(NF-27);i<=NF;i++){printf("%s%s",$i,i==NF?ORS:OFS)}}' Input_file > output_file
Also you need NOT to worry about headers here, since we are simply printing the lines and no logic specifically applied for lines so no need to add any specific entry for 1st line or so.
EDIT: 1 more point here seems you meant that different column values(in different ranges) should come in single line(for a single line from Input) if this is the case then my above code should handle it, since I am printing spaces as separator for their values and printing a new only when their last field's value is printed, by this each line from Input_file fields will be on same line(as Input_file's entry).
Explanation: Adding detailed explanation here.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here, which will be executed before Input_file is getting read.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
} ##Closing BEGIN section here for this awk code.
{ ##Starting a new BLOCK which will be executed when Input_file is being read.
for(i=1;i<=10;i++){ ##Running a for loop which will run 10 times from i=1 to i=10 value.
printf("%s%s",$i,OFS) ##Printing value of specific field with OFS value.
} ##Closing for loop BLOCK here.
for(i=(NF-27);i<=NF;i++){ ##Starting a for loop which will run for 27 last fields only as per OP requirements.
printf("%s%s",$i,i==NF?ORS:OFS) ##Printing field value and checking condition i==NF, if field is last field of line print new line else print space.
} ##Closing block for, for loop now.
}' Input_file > output_file ##Mentioning Input_file name here, whose output is going into output_file.

Passing variables to grep command in Tcl Script

I'm facing a problem while trying to pass a variable value to a grep command.
In essence, I want to grep out the lines which match my pattern and the pattern is stored in a variable. I take in the input from the user, and parse through myfile and see if the pattern exists(no problem here).
If it exists I want to display the lines which have the pattern i.e grep it out.
My code:
if {$a==1} {
puts "serial number exists"
exec grep $sn myfile } else {
puts "serial number does not exist"}
My input: SN02
My result when I run grep in Shell terminal( grep "SN02" myfile):
serial number exists
SN02 xyz rtw 345
SN02 gfs rew 786
My result when I try to execute grep in Tcl script:
serial number exists
The lines which match the pattern are not displayed.
Your (horrible IMO) indentation is not actually the problem. The problem is that exec does not automatically print the output of the exec'ed command*.
You want puts [exec grep $sn myfile]
This is because the exec command is designed to allow the output to be captured in a variable (like set output [exec some command])
* in an interactive tclsh session, as a convenience, the result of commands is printed. Not so in a non-interactive script.
To follow up on the "horrible" comment, your original code has no visual cues about where the "true" block ends and where the "else" block begins. Due to Tcl's word-oriented nature, it pretty well mandates the one true brace style indentation style.

How do you pass a parameter on awk command?

I tried this but it does not seem to work.
Please help thanks
TEST_STRING= test
echo Starting awk command
awk -v testString=$TEST_STRING'
BEGIN {
}
{
print testString
}
END {}
' file2
There are two problems here: You aren't actually assigning to TEST_STRING, and you're passing the program code in the same argument as the variable value. Both of these are caused by whitespace and quoting being in the wrong places.
TEST_STRING= test
...does not assign a value to TEST_STRING. Instead, it runs the command test, with an environment variable named TEST_STRING set to an empty value.
Perhaps instead you want
TEST_STRING=test
or
TEST_STRING=' test'
...if the whitespace is intentional.
Second, passing a variable to awk with -v, the right-hand side should be double-quoted, and there must be unquoted whitespace between that value and the program to be passed to awk (or other values). That is to say:
awk -v testString=$TEST_STRING' BEGIN
...will, if TEST_STRING contains no whitespace, pass the BEGIN as part of the value of testString, not as a separate argument!
awk -v testString="$TEST_STRING" 'BEGIN
...on, the other hand, ensures that the value of TEST_STRING is passed as part of the same argument as testString=, even if it contains whitespace -- and ensures that the BEGIN is passed as part of a separate argument.

Field spearator to used if they are not escaped using awk

i have once question, suppose i am using "=" as fiels seperator, in this case if my string contain for example
abc=def\=jkl
so if i use = as fields seperator, it will split into 3 as
abc def\ jkl
but as i have escaped 2nd "=" , my output should be as
abc def\=jkl
Can anyone please provide me any suggestion , if i can achieve this.
Thanks in advance
I find it simplest to just convert the offending string to some other string or character that doesn't appear in your input records (I tend to use RS if it's not a regexp* since that cannot appear within a record, or the awk builtin SUBSEP otherwise since if that appears in your input you have other problems) and then process as normal other than converting back within each field when necessary, e.g.:
$ cat file
abc=def\=jkl
$ awk -F= '{
gsub(/\\=/,RS)
for (i=1; i<=NF; i++) {
gsub(RS,"\\=",$i)
print i":"$i
}
}' file
1:abc
2:def\=jkl
* The issue with using RS if it is an RE (i.e. multiple characters) is that the gsub(RS...) within the loop could match a string that didn't get resolved to a record separator initially, e.g.
$ echo "aa" | gawk -v RS='a$' '{gsub(RS,"foo",$1); print "$1=<"$1">"}'
$1=<afoo>
When the RS is a single character, e.g. the default newline, that cannot happen so it's safe to use.
If it is like the example in your question, it could be done.
awk doesn't support look-around regex. So it would be a bit difficult to get what you want by setting FS.
If I were you, I would do some preprocessing, to make the data easier to be handled by awk. Or you could read the line, and using other functions by awk, e.g. gensub() to remove those = s you don't want to have in result, and split... But I guess you want to achieve the goal by playing field separator, so I just don't give those solutions.
However it could be done by FPAT variable.
awk -vFPAT='\\w*(\\\\=)?\\w*' '...' file
this will work for your example. I am not sure if it will work for your real data.
let's make an example, to split this string: "abc=def\=jkl=foo\=bar=baz"
kent$ echo "abc=def\=jkl=foo\=bar=baz"|awk -vFPAT='\\w*(\\\\=)?\\w*' '{for(i=1;i<=NF;i++)print $i}'
abc
def\=jkl
foo\=bar
baz
I think you want that result, don't you?
my awk version:
kent$ awk --version|head -1
GNU Awk 4.0.2

Resources