Extracting a random pattern after matching a word in following lines - unix

Extract household data corresponding to a keyword.
Z1/NEW "THE_PALM" 769 121003 1545
NEW HOUSE IN
SOMETHING SOMETHING
SN HOUSE CLASS
FIRST PSD93_PU 1579
CHAIRS
WOOD
SILVER SPOON
GREEN GARDEN
Z1/OLD "THE_ROSE" 786 121003 1343
NEW HOUSE OUT
SOMETHING NEW
SN HOUSE CLASS
FIRST_O PSD1000_ST 1432
CHAIRS
WOOD
GREEN GARDEN
BLACK PAINT
Z1/OLD "The_PURE" 126 121003 3097
NEW HOUSE IN
SOMETHING OLD
SN HOUSE CLASS
LAST_O JD4_GOLD 1076
CHAIRS
SILVER SPOON
I have a very large sized file. There is a list of items about the house at the end of every description. Corresponding to the houses containing SILVER SPOON, I want to extract the HOUSE ID as in data PSD93_PU and date 121003. I tried the following:
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=7 a=0 s="SILVER" infile > outfile
But the problem is that the number of lines above the keyword SILVER are so random, that I can't figure out the solution.

assuming each new house starts with Z1
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } \
$1 == "SN" { f=1; next; } \
f == 1 { id=$2; f=0; next; } \
$1" "$2 == "SILVER SPOON" { print id,date }' file
that, on a new house, reset all vars and get the date
if an SN is matched then the next line contains the id
get the id from the line
if "SILVER SPOON" is found print the id and date
if it is not found, a new house will be met and the vars are reset.
test with given data:
$ awk '$1 ~ /^Z1/ { date=$4; id=""; f=0; next; } $1 == "SN" { f=1; next; } f == 1 { id=$2; f=0; next; } $1 == "SILVER SPOON" && $2 == "SPOON" { print id,date }' file
PSD93_PU 121003
JD4_GOLD 121003
note :
if anybody knows how and if $1 == "SILVER" && $2 == "SPOON" can be merge together in one statement that'd be nice :) -- like: $1,$2 == "SILVER SPOON"
edit:
it can be done with $1" "$2 == "SILVER SPOON".
one could possibly omit the space and do $1$2 == "SILVERSPOON" but that would match even if $2 was empty and $1 contained the whole string, or $1 was SILVERSPO and $2 was ON. So the space in acts as a strict match.

Using sed:
sed -n -e 's/^Z1[^"]*"[^"]*"[ \t]*[0-9]*[ \t]*\([0-9]*\).*/\1/p'
-e '/^SN[ \t]*HOUSE/ { n; s/^[^ \t]*[ \t]*\([^ \t]*\).*/\1/p }'
Firstly, we call sed with the -n option in order to tell it to print only what we tell it to.
The first command will search for a particular pattern to extract the date. The pattern consists of:
^Z1: A line starting with the string "Z1".
[^"]*: zero or more characters that aren't double quotes
": double quote character
[^"]*: zero or more characters that aren't double quotes
[ \t]*: zero or more characters that are either tabs or spaces
[0-9]*: zero or more digits
[ \t]*: zero or more characters that are either tabs or spaces
\([0-9]*\): zero or more digits. The backslashed parenthesis are used in order to capture the match, ie. the match is stored into an auxiliary variable \1.
.*: zero or more characters, effectively skipping all characters until the end of the line.
This matched line is then replaced with \1, which holds our captured content: the date. The p after the command tells sed to print the result.
The second line contains two commands grouped together (inside braces) so that they are only executed on the "address" before the braces. The address is a pattern, so that it is executed on every line that matches the pattern. The pattern consists of a line that starts with "SN" followed by a sequence of spaces or tabs, followed by the string "HOUSE".
When the pattern matches, we first execute the n next command, which loads the next line from input. Then, we extract the ID from the new line, in a way analogous to extracting the date. The substitute pattern to match is:
^[^ \t]*: a string that starts with zero or more characters that aren't spaces or tabs (whitespace).
[ \t]*: then has a sequence of zero or more spaces and/or tabs.
\([^ \t]*\): a sequence of non whitespace characters is then captured
.*: the remaining characters are matched so that they are skipped.
The replacement becomes the captured ID, and again we tell sed to print it out.
This will print out a line containing the date, followed by a line containing the ID. If you want a line in the format ID date, you can pipe the output of sed into another sed instance, as follows:
sed -n -e [...] | sed -e 'h;n;G;s/\n/ /'
This sed instance performs the following operations:
Reads a line, and the h command tells it to store the line into the hold space (an auxiliary buffer).
Read the next line with the n command.
The G get command will append the contents of the hold space into the pattern space (the working buffer), so now we have the ID line followed by the date line.
Finally, we replace the new line character by a space, so the lines are joined into a single line.
Hope this helps =)

If your records are separated by two or three blank lines and the line spacing before the household items is consistent, you could use GNU awk like this:
awk -r 'BEGIN { RS="\n{3}\n*"; FS="\n" } /SILVER SPOON/ { split($1, one, OFS); split($6, two, OFS); print two[2], one[4] }' file.txt
Results:
PSD93_PU 121003
JD4_GOLD 121003

Related

Use sed to replace all occurrences of strings which start with 'xy' and of length 5 or more

I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat

Split line into multiple lines of 42 Unix after last given char

I have a text file in unix formed from multiple long lines
ALTER Tit como(titel('42423432;434235111;757567562;2354679;5543534;6547673;32322332;54545453'))
ALTER Mit como(Alt('432322;434434211;754324237562;2354679;5543534;6547673;32322332;54545453'))
I need to split each line in multiple lines of no longer than 42 characters.
The split should be done at the end of last ";", and
so my ideal output file will be :
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
I used fold -w 42 givenfile.txt | sed 's/ $/ -/g'
it splits the line but doesnt add the "-" at the end of the line and doesnt split after the ";".
any help is much appreciated.
Thanks !
awk -F';' '
w{
print""
}
{
w=length($1)
printf "%s",$1
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
}
END{
print""
}
' file
This produces the output:
ALTER Tit como(titel('42423432;434235111; -
757567562;2354679;5543534;6547673; -
32322332;54545453'))
ALTER Mit como(Alt('432322;434434211; -
754324237562;2354679;5543534;6547673; -
32322332;54545453'))
How it works
Awk implicitly loops through each line of its input and each line is divided into fields. This code uses a single variable w to keep track of the current width of the output line.
-F';'
Tell awk to break fields on semicolons.
`w{print""}
If the last line was not completed, w>0, then print a newline to terminate it before we start with a new line.
w=length($1); printf "%s",$1
Print the first field of the new line and set w according to its length.
Loop over the remaining fields:
for (i=2;i<=NF;i++){
if ((w+length($i)+1)<42){
w+=length($i)+1
printf";%s",$i
} else {
w=length($i)
printf"; -\n%s",$i
}
}
This loops over the second to final fields of this line. Whenever we reach the point where we can't print another field without exceeding the 42 character limit, we print ; -\n.
END{print""}
Print a newline at the end of the file.
This might work for you (GNU sed):
sed -r 's/.{1,42}$|.{1,41};/& -\n/g;s/...$//' file
This globally replaces 1 to 41 characters followed by a ; or 1 to 42 characters followed by end of line with -\n. The last string will have three characters too many and so they are deleted.

How do I replace empty strings in a tsv with a value?

I have a tsv, file1, that is structured as follows:
col1 col2 col3
1 4 3
22 0 8
3 5
so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?
awk is designed for this scenario (among a million others ;-) )
awk -F"\t" -v OFS="\t" '{
for (i=1;i<=NF;i++) {
if ($i == "") $i="NA"
}
print $0
}' file > file.new && mv file.new file
-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".
NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.
We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".
For each line of input, we print the line's ($0) value.
Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.
IHTH.
A straightforward approach is
sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
sed -i edit file in place;
s/a/b/ replace a with b;
s/^\t/\tNA/ replace \t in the beginning of the line with NA\t
(the first column becomes NA);
s/\t$/\tNA/ the same for the last column;
s/\t\t/\tNA\t/ insert NA in between \t\t;
:0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

using sed to turn paragraph to lines

Using sed and any basic commands, I'm trying to count the number of words in each separate passage that has many separate passages. Each passage begins with a specific number and increases. Example:
0:1.1 This is the first passage...
0:1.2 This is the second passage...
The difficult thing is that each passage is a paragraph that is word wrapped and not a single line. I could count the words in each passage if they were on single lines. How can I do this?Thanks for the help
I did figure how to count each passage with:
grep '[0-9]:[0-9]' file | wc -l
This awk solution might work for you:
awk '/^[0-9]:[0-9]\.[0-9]/{
if (pass_num) printf "%s, word count: %i\n", pass_num, word_count
pass_num=$1
word_count=-1
}
{ word_count+=NF }
END { printf "%s, word count: %i\n", pass_num, word_count }
' file
Test input:
# cat file
0:1.1 I am le passage one.
There are many words in me.
0:1.2 I am le passage two.
One two three four five six
Seven
0:1.3 I am "Hello world"
Test output:
0:1.1, word count: 11
0:1.2, word count: 12
0:1.3, word count: 4
How it works:
Each word is separated by empty space, so each word can be represented by each field in awk, i.e. word count in a line is equal to NF. The word count is summed up every line until the next passage.
When it encounters a new passage (indicated by the presence of a passage number), it
prints out the previous passage's number and word count.
set passage number to this new passage number
reset passage word count (-1 because we don't want the passage number be counted)
The END{..} block is needed because the final passage doesn't have a trigger that causes it to print out the passage number and word count.
The if (pass_num) is to suppress printf when awk encounters the first passage.
This might work for you (GNU sed):
sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file
It forms each section into a single line then counts the words in the section less the section number (newlines are replaced by spaces).
$ cat file
0:1.1 This is the first passage...
welcome to the SO, you leart a lot of things here.
0:1.2 This is the second passage...
wer qwerqrq ewqr e
0:1.3 This is the second passage...
Using sed and GNU grep:
$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*' | wc -l
11
0:1.1 -> Give the passage number here in which you want to count.
Here's one way with GNU awk:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }'
If it is run on the file listed by doubledown, the output is:
0:1.1: 11
0:1.2: 12
0:1.3: 4
Explanation
This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.
Edit - only count fields containing [:alnum:]
The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' '
NF > 0 {
wc = NF-2
for(i=2; i<NF; i++)
if($i !~ /[[:alnum:]]+/)
wc--
print R ": " wc
}
{ R = RT }'

How do I delete a matching line and the previous one?

I need delete a matching line and one previous to it.
e.g In file below I need to remove lines 1 & 2.
I tried "grep -v -B 1 "page.of." 1.txt
and I expected it to not print the matchning lines and the context.
I tried the How do I delete a matching line, the line above and the one below it, using sed? but could not understand the sed usage.
---1.txt--
**document 1** -> 1
**page 1 of 2** -> 2
testoing
testing
super crap blah
**document 1**
**page 2 of 2**
You want to do something very similar to the answer given
sed -n '
/page . of ./ { #when pattern matches
n #read the next line into the pattern space
x #exchange the pattern and hold space
d #skip the current contents of the pattern space (previous line)
}
x #for each line, exchange the pattern and hold space
1d #skip the first line
p #and print the contents of pattern space (previous line)
$ { #on the last line
x #exchange pattern and hold, pattern now contains last line read
p #and print that
}'
And as a single line
sed -n '/page . of ./{n;x;d;};x;1d;p;${x;p;}' 1.txt
grep -v -B1 doesnt work because it will skip those lines but will include them later on (due to the -B1. To check this out, try the command on:
**document 1** -> 1
**page 1 of 2** -> 2
**document 1**
**page 2 of 2**
**page 3 of 2**
You will notice that the page 2 line will be skipped because that line won't be matched and the next like wont be matched.
There's a simple awk solution:
awk '!/page.*of.*/ { if (m) print buf; buf=$0; m=1} /page.*of.*/ {m=0}' 1.txt
The awk command says the following:
If the current line has that "page ... of ", then it will signal that you haven't found a valid line. If you do not find that string, then you print the previous line (stored in buf) and reset the buffer to the current line (hence forcing it to lag by 1)
grep -vf <(grep -B1 "page.*of" file | sed '/^--$/d') file
Not too familiar with sed, but here's a perl expression to do the trick:
cat FILE | perl -e '#a = <STDIN>;
for( $i=0 ; $i <= $#a ; $i++ ) {
if($i > 0 && $a[$i] =~ /xxxx/) {
$a[$i] = "";
$a[$i-1] = "";
}
} print #a;'
edit:
where "xxxx" is what you are trying to match.
Thanks, I was trying to use the awk command given by Foo Bah
to delete the matching line and the previous one. I have to use it multiple times, so for the matching part I use a variable. The given awk command works, but when using a variable it does not work (i.e. it does not delete the matching & prev. line). I tried:
awk -vvar="page.*of.*" '!/$var/ { if (m) print buf; buf=$0; m=1} /$var/ {m=0}' 1.txt

Resources