How do I delete a matching line and the previous one? - unix

I need delete a matching line and one previous to it.
e.g In file below I need to remove lines 1 & 2.
I tried "grep -v -B 1 "page.of." 1.txt
and I expected it to not print the matchning lines and the context.
I tried the How do I delete a matching line, the line above and the one below it, using sed? but could not understand the sed usage.
---1.txt--
**document 1** -> 1
**page 1 of 2** -> 2
testoing
testing
super crap blah
**document 1**
**page 2 of 2**

You want to do something very similar to the answer given
sed -n '
/page . of ./ { #when pattern matches
n #read the next line into the pattern space
x #exchange the pattern and hold space
d #skip the current contents of the pattern space (previous line)
}
x #for each line, exchange the pattern and hold space
1d #skip the first line
p #and print the contents of pattern space (previous line)
$ { #on the last line
x #exchange pattern and hold, pattern now contains last line read
p #and print that
}'
And as a single line
sed -n '/page . of ./{n;x;d;};x;1d;p;${x;p;}' 1.txt

grep -v -B1 doesnt work because it will skip those lines but will include them later on (due to the -B1. To check this out, try the command on:
**document 1** -> 1
**page 1 of 2** -> 2
**document 1**
**page 2 of 2**
**page 3 of 2**
You will notice that the page 2 line will be skipped because that line won't be matched and the next like wont be matched.
There's a simple awk solution:
awk '!/page.*of.*/ { if (m) print buf; buf=$0; m=1} /page.*of.*/ {m=0}' 1.txt
The awk command says the following:
If the current line has that "page ... of ", then it will signal that you haven't found a valid line. If you do not find that string, then you print the previous line (stored in buf) and reset the buffer to the current line (hence forcing it to lag by 1)

grep -vf <(grep -B1 "page.*of" file | sed '/^--$/d') file

Not too familiar with sed, but here's a perl expression to do the trick:
cat FILE | perl -e '#a = <STDIN>;
for( $i=0 ; $i <= $#a ; $i++ ) {
if($i > 0 && $a[$i] =~ /xxxx/) {
$a[$i] = "";
$a[$i-1] = "";
}
} print #a;'
edit:
where "xxxx" is what you are trying to match.

Thanks, I was trying to use the awk command given by Foo Bah
to delete the matching line and the previous one. I have to use it multiple times, so for the matching part I use a variable. The given awk command works, but when using a variable it does not work (i.e. it does not delete the matching & prev. line). I tried:
awk -vvar="page.*of.*" '!/$var/ { if (m) print buf; buf=$0; m=1} /$var/ {m=0}' 1.txt

Related

Use sed to replace all occurrences of strings which start with 'xy' and of length 5 or more

I am running AIX 6.1
I have a file which contains strings/words starting with some specific characters, say 'xy' or 'Xy' or 'Xy' or 'XY' (case insensitive) and I need to mask the entire word/string with asterisks '*' if the word is greater than say 5 characters.
e.g. I need a sed command which when run against a file containing the below line...
This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings
should give below as the output
This is a test line xy12 which I need to replace specific strings
I tried the below commands (did not yet come to the stage where I restrict to word lengths) but it does not work and displays the full line without any substitutions.
I tried using \< and > as well as \b for word identification.
sed 's/\<xy\(.*\)\>/******/g' result2.csv
sed 's/\bxy\(.*\)\b******/g' result2.csv
You can try with awk:
echo 'This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings' | awk 'BEGIN{RS=ORS=" "} !(/^[xX][yY]/ && length($0)>=5)'
The awk record separator is set to a space in order to be able to get the length of each word.
This works with GNU awk in --posix and --traditional modes.
With sed for the mental exercice
sed -E '
s/(^|[[:blank:]])([xyXY])([xyXY].{2}[^[:space:]]*)([^[:space:]])/\1#\3#/g
:A
s/(#[^#[:blank:]]*)[^#[:blank:]](#[#]*)/\1#\2/g
tA
s/#/*/g'
This need to not have # in the text.
A simple POSIX awk version :
awk '{for(i=1;i<=NF;++i) if ($i ~ /^[xX][yY]/ && length($i)>=5) gsub(/./,"*",$i)}1'
This, however, does not keep the spacing intact (multiple spaces are converted to a single one), the following does:
awk 'BEGIN{RS=ORS=" "}(/^[xX][yY]/ && length($i)>=5){gsub(/./,"*")}1'
You may use awk:
s='This is a test line xy12345 xy12 Xy123 Xy11111 which I need to replace specific strings xy123 xy1234 xy12345 xy123456 xy1234567'
echo "$s" | awk 'BEGIN {
ORS=RS=" "
}
{
for(i=1;i<=NF;i++) {
if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/)
gsub(/./,"*", $i);
print $i;
}
}'
A one liner:
awk 'BEGIN {ORS=RS=" "} { for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^[Xx][Yy][a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } }'
# => This is a test line ******* xy12 ***** ******* which I need to replace specific strings ***** ****** ******* ******** *********
See the online demo.
Details
BEGIN {ORS=RS=" "} - start of the awk: set the output record separator equal to the space record separator
{ for(i=1;i<=NF;i++) {if(length($i) >= 5 && $i~/^xy[a-zA-Z0-9]+$/) gsub(/./,"*", $i); print $i; } } - iterate over each field (with for(i=1;i<=NF;i++)) and if the current field ($i) length is equal or more than 5 (length($i) >= 5) and it matches a Xy and (&&) 1 or more alphanumeric chars pattern ($i~/^[Xx][Yy][a-zA-Z0-9]+$/), then replace each char with * (with gsub(/./,"*", $i)) and then print the current field value.
This might work for you (GNU sed):
sed -r ':a;/\bxy\S{5,}\b/I!b;s//\n&\n/;h;s/[^\n]/*/g;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/;ta' file
If the current line does not contain a string which begins with xy case insensitive and 5 or more following characters, then there is no work to be done.
Otherwise:
Surround the string by newlines
Copy the pattern space (PS) to the hold space (HS)
Replace all characters other than newlines with *'s
Append the PS to the HS
Replace the PS with the HS
Swap the strings between the newlines retaining the remainder of the first line
Repeat

How do I replace empty strings in a tsv with a value?

I have a tsv, file1, that is structured as follows:
col1 col2 col3
1 4 3
22 0 8
3 5
so that the last line would look something like 3\t\t5, if it was printed out. I'd like to replace that empty string with 'NA', so that the line would then be 3\tNA\t5. What is the easiest way to go about this using the command line?
awk is designed for this scenario (among a million others ;-) )
awk -F"\t" -v OFS="\t" '{
for (i=1;i<=NF;i++) {
if ($i == "") $i="NA"
}
print $0
}' file > file.new && mv file.new file
-F="\t" indicates that the field separator (also known as FS internally to awk) is the tab character. We also set the output field separator (OFS) to "\t".
NF is the number of fields on a line of data. $i gets evaluated as $1, $2, $3, ... for each value between 1 and NF.
We test if the $i th element is empty with if ($i == "") and when it is, we change the $i th element to contain the string "NA".
For each line of input, we print the line's ($0) value.
Outside the awk script, we write the output to a temp file, i.e. file > file.new. The && tests that the awk script exited without errors, and if OK, then moves the file.new over the original file. Depending on the safety and security use-case your project requires, you may not want to "destroy" your original file.
IHTH.
A straightforward approach is
sed -i 's/^\t/NA\t/;s/\t$/\tNA/;:0 s/\t\t/\tNA\t/;t0' file
sed -i edit file in place;
s/a/b/ replace a with b;
s/^\t/\tNA/ replace \t in the beginning of the line with NA\t
(the first column becomes NA);
s/\t$/\tNA/ the same for the last column;
s/\t\t/\tNA\t/ insert NA in between \t\t;
:0 s///; t0 repeat s/// if there was a replacement (in case there are other missing values in the line).

Delete all lines after a string but the last few

I have file which contains some lines and some numbers. I want to delete all the lines after particular string, let us say the string is 'angles'. I want to delete all the lines after angles, I can do this using sed command as follows:
sed -n '/angles/q;p' input file
My question is, I want to delete all the lines after angles but want to retain the last four lines? How can I do this in sed?
The following should work for you:
tac inputfile | sed '5,/angles/d' | tac
Explanation: Reverse the file, keep first 4 lines and delete everything until the desired pattern is encounterd. Reverse the result.
This might work for you (GNU sed):
sed '1,/angles/b;:a;$!{N;s/\n/&/4;Ta;D}' file
Print out all the lines from the first till encountering the string angles. Then make a moving window of four lines and print them at end of file.
In awk :
awk '{
if(p!=1)print;
else{a[NR]=$0}
if($0~/angles/)p=1
}
END{
for(i=NR-3;i<=NR;i++)print a[i]
}' your_file
Tested below:
> cat temp
wbj
ergthet
gheheh
wergergkn angles gerg
er
ergnmer
rgnerk
1
2
3
4
> awk '{if(p!=1)print;else{a[NR]=$0};if($0~/angles/)p=1}END{for(i=NR-3;i<=NR;i++)print a[i]}' temp
wbj
ergthet
gheheh
wergergkn angles gerg
1
2
3
4
>
awk -v size=$(wc -l < file) '
/angles/ { skip=1 }
NR >= (size-4) { skip=0 }
!skip
' file
I'm going to assume the string you're looking for may occur more than once, and you only want up to the first one. If that's true, reversing the file isn't the best strategy.
{ sed '/angles/q' file; tail -4 file; } > result

using sed to turn paragraph to lines

Using sed and any basic commands, I'm trying to count the number of words in each separate passage that has many separate passages. Each passage begins with a specific number and increases. Example:
0:1.1 This is the first passage...
0:1.2 This is the second passage...
The difficult thing is that each passage is a paragraph that is word wrapped and not a single line. I could count the words in each passage if they were on single lines. How can I do this?Thanks for the help
I did figure how to count each passage with:
grep '[0-9]:[0-9]' file | wc -l
This awk solution might work for you:
awk '/^[0-9]:[0-9]\.[0-9]/{
if (pass_num) printf "%s, word count: %i\n", pass_num, word_count
pass_num=$1
word_count=-1
}
{ word_count+=NF }
END { printf "%s, word count: %i\n", pass_num, word_count }
' file
Test input:
# cat file
0:1.1 I am le passage one.
There are many words in me.
0:1.2 I am le passage two.
One two three four five six
Seven
0:1.3 I am "Hello world"
Test output:
0:1.1, word count: 11
0:1.2, word count: 12
0:1.3, word count: 4
How it works:
Each word is separated by empty space, so each word can be represented by each field in awk, i.e. word count in a line is equal to NF. The word count is summed up every line until the next passage.
When it encounters a new passage (indicated by the presence of a passage number), it
prints out the previous passage's number and word count.
set passage number to this new passage number
reset passage word count (-1 because we don't want the passage number be counted)
The END{..} block is needed because the final passage doesn't have a trigger that causes it to print out the passage number and word count.
The if (pass_num) is to suppress printf when awk encounters the first passage.
This might work for you (GNU sed):
sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file
It forms each section into a single line then counts the words in the section less the section number (newlines are replaced by spaces).
$ cat file
0:1.1 This is the first passage...
welcome to the SO, you leart a lot of things here.
0:1.2 This is the second passage...
wer qwerqrq ewqr e
0:1.3 This is the second passage...
Using sed and GNU grep:
$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*' | wc -l
11
0:1.1 -> Give the passage number here in which you want to count.
Here's one way with GNU awk:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }'
If it is run on the file listed by doubledown, the output is:
0:1.1: 11
0:1.2: 12
0:1.3: 4
Explanation
This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.
Edit - only count fields containing [:alnum:]
The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:
awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' '
NF > 0 {
wc = NF-2
for(i=2; i<NF; i++)
if($i !~ /[[:alnum:]]+/)
wc--
print R ": " wc
}
{ R = RT }'

Maximum number of characters in a field of a csv file using unix shell commands?

I have a csv file. In one of the fields, say the second field, I need to know maximum number of characters in that field. For example, given the file below:
adf,jlkjl,lkjlk
jf,j,lkjljk
jlkj,lkejflkj,adfafef,
jfje,jj,lkjlkj
jjee,eeee,ereq
the answer would be 8 because row 3 has 8 characters in the second field. I would like to integrate this into a bash script, so common unix command line programs are preferred. Imaginary bonus points for explaining what the command is doing.
EDIT: Here is what I have so far
cut --delimiter=, -f 2 test.csv | wc -m
This gives me the character count for all of the fields, not just one, so I still have progress to make.
I would use awk for the task. It uses a comma to split line in fields and for each line checks if the length of second field is bigger that the value already saved.
awk '
BEGIN {
FS = ","
}
{ c = length( $2 ) > c ? length( $2 ) : c }
END {
print c
}
' infile
Use it as a one-liner and assign the return value to a variable, like:
num=$(awk 'BEGIN { FS = "," } { c = length( $2 ) > c ? length( $2 ) : c } END { print c }' infile)
Well #oob, you basically provided the answer with your last edit, and it's the most simple of all answers given. However, I also like #Birei's answer just because I enjoy AWK. :-)
I too had to find the longest possible value for a given field inside a text file today. Tested with your sample and got the expected 8.
cut -d, -f2 test.csv | wc -L
As you see, just a matter of using the correct option for wc (which I hope you have already figured by now).
My solution is to loop over the lines. Than I exchange the commas with new lines to loop over the words than I check which is the longest word and save the data.
#!/bin/bash
lineno=1
matchline=0
matchlen=0
for line in $(cat input.txt); do
words=`echo $line | sed -e 's/,/\n/g'`
for word in $words; do
# echo "line: $lineno; length: ${#word}; input: $word"
if [ $matchlen -lt ${#word} ]; then
matchlen=${#word}
matchline=$lineno
fi
done;
lineno=$(($lineno + 1))
done;
echo max length is $matchlen in line $matchline
Bash and Coreutils Solution
There are a number of ways to solve this, but I vote for simplicity. Here's a solution that uses Bash parameter expansion and a few standard shell utilities to measure each line:
cut -d, -f2 /tmp/foo |
while read; do
echo ${#REPLY}
done | sort | tail -n1
The idea here is to split the CSV file, and then use the parameter length expansion of the implicit REPLY variable to measure the characters on each line. When we sort the measurements, the last line of the sorted output will hold the length of the longest line found.
cut out the desired column
print each line length
sort the line lengths
grab the max line length
cut -d, -f2 test.csv | awk '{print length($0);}' | sort -n | tail -n 1

Resources