Text file formatting using regular expression

Text file formatting using regular expression - unix

I am trying to format a below text file, record order will be always like this
Dept 0100 Batch Load Errors for 8/16/2016 4:45:56 AM
Case 1111111111
Rectype: ABCD
Key:UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
Case 2222222222
Rectype: ABCD
Key:UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
Case 3333333333
Rectype: WXYZ
Key:UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
as output
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID|NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
I tried regular expression as below
sed -r '/^Case/!d;$!N;/\nRectype/!D;s/\s+$/ /;s/(.*)\n(.*)/\2\1\n\1/;P;D' file.txt
but this is working only till Rectype row, not able to achieve rest.
Thank you.

It seems to me that you're not really looking for a regular expression. You're looking for text reformatting, and you appear to have selected regular expression matching in sed as the method by which you'll process fields.
Read about XY problems here. Thankfully, you've included raw data and expected output, which is AWESOME for a new StackOverflow member. (Really! Yay you!) So I can recommend an alternative that will likely work better for you.
It is awk. Another command-line tool which, like sed, is installed on virtually every unix-like system on the planet.
$ awk -v RS= -v OFS="~" '!/^Case/{next} {sub(/^Key:/,"",$5); key=$5; for (f=6;f<=NF;f++) { if ($f=="NTNB") key=key "|"; else if ($f=="UMSV") key=key OFS; else key=key " "; key=key $f } print $2,$4,key}' inp2
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID|NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
Here's what's going on.
awk -v RS= - This is important. It sets a "null" record separator, which tells awk that we're dealing with multi-line records. Records are terminated by a blank line, and fields within this record are separated by whitespace. (Space, tab, newline.)
-v OFS="~" - Set an output field separator of tilde, for convenience.
$1!="Case"{next} - If the current record doesn't have the word "Case" as its first field, it's not a line we can handle, so skip it.
sub(/^Key:/,"",$5); key=$5; - Trim the word Key from the beginning of the fifth field, save the field to a variable.
for (f=6;f<=NF;f++) { - Step through the remaining fields
if ($f=="NTNB") key=key "|"; - setting the appropriate field separator.
else if ($f=="UMSV") key=key OFS; - ...
else key=key " "; - Or space if the text doesn't look like a new field.
key=key $f } - Finally, add the current field to our our running variable,
print $2,$4,key} - and print everything.
NOTE: One thing this doesn't do is maintain spacing as you've shown in your "expected output" in your question. Two or more spaces will always be shrunk to just one space, since within each record, fields are separated by whitespace.
UPDATE per comments
Windows uses \r\n (CRLF) to end lines, whereas unix/linux use just \n (LF). Since your file is being generated in Windows, the "blank" lines actually contain an invisible CR, and awk never sees a record separator.
To see the "real" contents of your file, you can use tools like hexdump or od. For example:
$ printf 'foo\r\nbar\r\n' | od -c
0000000 f o o \r \n b a r \r \n
0000012
In your case, simply run:
$ od -c filename | less
(Or use more if less isn't available.)
Many systems have a package available called dos2unix which can convert text format.
If you don't have dos2unix available, you should be able to achieve the same thing using other tools. In GNU sed:
sed -i 's/\r$//' filename
Or in other sed variants, but with a shell (like bash) that supports format substitution (read man sed to see if you have a -i option):
sed $'s/\r$//' inputfile > outputfile
Or a little less precisely, as it will remove all CRs even if they're not at the end of the line, you could use tr:
tr -d '\015' < inputfile > outputfile
Or if perl is available, you can use a substitution expression that's almost identical to the one for sed (perl documentation is readily available to tell you what the options do):
perl -i -pe 's/\r\n$/\n/g' filename
Good luck!

Related

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda

First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

How to read a value from recursive xml attribute in Unix using sed/awk/grep only

I have config.xml. Here I need to retrieve the value of the attribute from the xpath
/domain/server/name
I can only use grep/sed/awk. Need Help
The content of the xml is below where I need to retrieve the Server Name only.
<domain>
<server>
<name>AdminServer</name>
<port>1234</port>
</server>
<server>
<name>M1Server</name>
<port>5678</port>
</server>
<machine>
<name>machine01</name>
</machine>
<machine>
<name>machine02</name>
</machine>
</domain>
The output should be :
AdminServer
M1Server
I tried to do,
sed -ne '/<\/name>/ { s/<[^>]*>(.*)<\/name>/\1/; p }' config.xml

sed is only for simple substitutions on individual lines, doing anything else with sed is strictly for mental exercise, not for real code. That's not what you are trying to do so you shouldn't even be considering sed. Just use awk:
$ awk -F'[<>]' 'p=="server" && $2=="name"{print $3} {p=$2}' file
AdminServer
M1Server
That will work with any awk on any UNIX box. If that's not all you need then edit your question to provide more truly representative sample input and expected output.

Try this command. Name your xml and supply that file as an input.
awk '/<server>/,/<\/server>/' < name.xml | grep "name" | cut -d ">" -f2 | cut -d "<" -f1
OutPut:
AdminServer
M1Server

Based on your sample Input_file shown, could you please try following.
awk -F"[><]" '/<\/server>/{a="";next} /<server>/{a=1;next} a && /<name>/{print $3}' Input_file

sed -n '/<server>/{n;s/\s*<[^>]*>//gp}'
for example. for the first match
1. /<server>/
match the line that contains "<server>" got " <server>"
2. n
the "n" command will go to next line. after executed "n" command got " <name>AdminServer</name>"
3.s/\s*<[^>]*>//gp
replece all "\s*<[^>]*>" as "". then print the pattern space
type "info sed" for more sed command

You can get the desired output with just sed:
sed -n 's:.*<name>\(.*\)</name>.*:\1:p' config.xml

I feel dirty parsing XML in awk.
The following finds the correct depth of entry with the right tag name. It does not verify the path, though it depends on the elements you specified. While this works on your example data, it makes certain ugly assumptions and it's not guaranteed to work elsewhere:
awk -F'[<>]' '$2~/^(domain|server|name)$/{n++} $1~/\// {n--} n==3&&$2=="name"{print $3}' input.xml
A better solution would be to parse the XML itself.
$ awk -F'[<>]' -v check="domain.server.name" '$2~/^[a-z]/ { path=path "." $2; closex="</"$2">" } $0~closex { sub(/\.[^.]$/,"",path) } substr(path,2)==check {print path " = " $3}' input.xml
.domain.server.name = AdminServer
Here it is split out for easier commenting.
$ awk -F'[<>]' -v check="domain.server.name" '
# Split fields around pointy brackets. Supply a path to check.
$2~/^[a-z]/ { # If we see an open tag,
path=path "." $2 # append the current tag to our path,
closex="</"$2">" # compose a close tag which we'll check later.
}
$0~closex { # If we see a close tag,
sub(/\.[^.]$/,"",path) # truncate the path.
}
substr(path,2)==check { # If we match the given path,
print path " = " $3 # print the result.
}
' input.xml
Note that this solution barfs horribly if you feed it badly formatted XML. The recognition of tags could be improved, but may be sufficient if you have consistently formatted XML. It may barf horribly for other reasons too. Do not do this. Install the correct tools to parse XML properly.

Unix add a comma in the hundredths place and a $ to the last field

I have a number in the last field of my text file and I need to add a dollar sign to each line and a comma in the hundredths place of the number. So 10000 would now be $10,000.
one of the lines looks like this
World fair:399-454-9999:832 ponce Drive, Gary, IN 87878:3/22/62:24500
need it to look like this
World fair:399-454-9999:832 ponce Drive, Gary, IN 87878:3/22/62:$24,500

You can use the ' printf format flag to get the thousands groupings.
(I can't find a good reference for it but it is in the printf man page at least.)
The SUSv2 specifies one further flag character.
'
For decimal conversion (i, d, u, f, F, g, G) the output is to be grouped with thousands' grouping characters if the locale information indicates any. Note that many versions of gcc(1) cannot parse this option and will issue a warning. SUSv2 does not include %'F.
Then you just need a fairly simple application of awk.
awk -F : -v OFS=: '{$NF="$"sprintf("%\047d", $NF)}7' file
-F : sets the field separator to : so we get just the number in the final field
-v OFS=: sets the output field separator to : so awk puts the colons back for us
\047 is the octal code for a single quote to embed it in the single-quoted string easily
7 is a truth-y value to cause awk to print the line

The Perl Cookbook offers this regex solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
This can be incorporated into a specific solution:
perl -lpe 'BEGIN{sub commify {$t=reverse shift; $t=~s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; reverse $t}} s/(\d+)$/chr(044).commify($1)/e' file
output:
World fair:399-454-9999:832 ponce Drive, Gary, IN 87878:3/22/62:$24,500
A solution using unpack:
perl -lpe 'BEGIN{sub commify {$b=reverse shift; #c=unpack("(A3)*", $b); reverse join ",", #c}} s/(\d+)$/chr(044).commify($1)/e' file
If you have the Number::Format library installed, there is a shorter solution:
perl -lpe 'BEGIN{use Number::Format "format_number"} s/(\d+)$/chr(044).format_number($1)/e' file
All of the above solutions use Perl's s/foo/bar/e substitute operator with the e flag, which eval's the bar section.
chr(044) is used to print the $ (otherwise it would be eval'd)

You can add the dollar signs and the comma separately:
sed -i "s/:\([0-9]*\)$/:\$\1/g" file.txt
sed -i "s/\([0-9]\)\([0-9]\{3\}\)$/\1,\2/g" file.txt
sed -i "s/\([0-9]\)\([0-9]\{3\}\)\([0-9]\{3\}\)$/\1,\2,\3/" zeros.txt

Need help parsing a file via UNIX commands

I have a file that has lines that look like this
LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;
LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;
I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;
If additional information is required, do let me know, I'll post them in edits. Thanks!!

This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!
awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file
Output
FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|
The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.

Pure awk:
awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file
Updated to give proper formatting and to omit LINEID11, etc...
Output:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
Explanation:
awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS
/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...
gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS
gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS

To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).
To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.
First, let's get only the lines with LINEID1 (assuming the input is in a file called input).
grep '^LINEID1' input
will output all the lines beginning with LINEID1.
Next, extract the columns we care about:
grep '^LINEID1' input | # extract lines with LINEID1 in them
cut -d: -f2 | # extract column 2 (after ':')
tr ',&' '\n\n' | # turn ',' and '&' into newlines
egrep 'FIELD[1249]' | # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|' | # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'
The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.
That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.
Here's the output from the above command:
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;
This command took only a couple of minutes to cobble up.
Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.
The same script in ruby might look like:
#!/usr/bin/env ruby
#
while line = gets do
if line.chomp =~ /^LINEID1:(.*)$/
f1, others = $1.split(',')
fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
puts [f1, fields].flatten.join("|")
end
end
Run this script on the same input file and the same output as above will occur:
$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

How to extract multiple string occurences between two matching patterns using sed or grep commands

I am newbie to unix and playing around with sed and awk commands.
My sample snort rule has multiple occurrences of keyword "content". I need to extract all data between content:" and "; to a file.
This sample contains one rule in single line. My actual file contains 30k of such rules.
1rule file contains
alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:"APP-DETECT Absolute Software Computrace outbound connection - search.namequery.com"; flow:to_server,established; content:"Host|3A| search.namequery.com|0D 0A|"; fast_pattern:only; http_header; content:"TagId: "; http_header; metadata:policy security-ips drop, ruleset community, service http; reference:url,absolute.com/support/consumer/technology_computrace; reference:url,www.blackhat.com/presentations/bh-usa-09/ORTEGA/BHUSA09-Ortega-DeactivateRootkit-PAPER.pdf; classtype:misc-activity; sid:26287; rev:4;) cat 4rules|sed 's/.*content:"\([^";]*\)".*/\1/'sdfjklhaskl;jdf;kljasdfsjkdfhnkl;asdjfklasdfja'sjkdsdfh;askldjf`
Expected output:
Host|3A| search.namequery.com|0D 0A|
TagId
\([^
I tried my with sed and grep commands.
grep -Po '(?<=content:").*(?=";)' 1rule
sed 's/.*content:"\([^";]*\).*/\1/' 1rule
The output I got is not as expected:
Using grep, I could see all contents but there is intermediate data between them
sed gives me the last occurrence in a line along with non matching lines after the occurrence.
Please tell me know how can i solve this problem.

With GNU grep (as in your question, taking advantage of the -P option for Perl-compatible regular expressions):
grep -Po 'content:"\K[^"]+' 1rule
\K drops what's been matched so far: the field label and the opening ".
[^"]+ then matches the content of the string up to, but excluding, the closing ".
Alternatively, try awk with the following:
awk -F'content:' '{
for (i=2;i<=NF;++i) {
split($i, a, /"/); print a[2]
}
}' 1rule
Splits the input line(s) into fields by separator content:
Loops over files starting with index 2 (because field 1 is the string preceding the first content: substring).
Splits the field into tokens by " and prints the 2nd token, which is the string enclosed in "..." at the start of the field.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex