Extracting specific delimited fields using sed - unix

Learning sed and patterns.
I have lines of input that look like this:
1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1000001,P00248942,F,0-17,10,A,2,0,1,6,14,15200
1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
1000001,P00085442,F,0-17,10,A,2,0,12,14,,1057
1000002,P00285442,M,55+,16,C,4+,0,8,,,7969
1000003,P00193542,M,26-35,15,A,3,0,1,2,,15227
I need to extract the first, second, and last fields. The output for the first line would be something like
1000001 P00069042 8370
I have tried sed -n 's/,.*,.*,/ /p' but it only returns the first and last fields.
I have also tried sed -n 's/\([^,]*,[^,]*,\).*,/ /p' but it only returns the last field.
My approach is to delete everything between the second comma and the last comma, but I don't know how to specify the second comma.
I'm aware this can be done with cut or awk, but I'm trying to figure out sed.

sed is great for unstructured/raw stream data -- this isn't one of those.
Nonetheless, the trick to using sed to pick out "fields" is to:
Create a regex that matches the whole line
Use capture groups \(..\) to pick out the parts you want to save
Use the [^<c>]*<c> idiom to enforce non-greedy behavior where needed (<c> is any char)
Replace the entire line using back-references from the capture groups you saved
$ echo 1000001,P00069042,F,0-17,10,A,2,0,3,,,8370 |\
sed 's/^\([^,]*\),\([^,]*\),.*,\(.*\)$/\1 \2 \3/'
1000001 P00069042 8370

Related

Replace double consonant letters with one using sed command

How to replace double consonants with only one letter using sed Linux command. Example: WILLIAM -> WILIAM. grep -E '(.)\1+' commands finds the words that follow two same consonants in a row pattern, but how do I replace them with only one occurrence of the letter?
I tried
cat test.txt | head | tr -s '[^AEUIO\n]' '?'
tr is all or nothing; it will replace all occurrences of the selected characters, regardless of context. For regex replacement, look at sed - you even included this in your question's tags, but you don't seem to have explored how it might be useful?
sed 's/\(.\)\1/\1/g' test.txt
The dot matches any character; to restrict to only consonants, change it to [b-df-hj-np-tv-xz] or whatever makes sense (maybe extend to include upper case; perhaps include accented characters?)
The regex dialect understood by sed is more like the one understood by grep without -E (hence all the backslashes); though some sed implementations also support this option to select the POSIX extended regular expression dialect.
Neither sed not tr need cat to read standard input for them (though tr obscurely does not accept a file name argument). See tangentially also Useless use of cat?
Match one consonant, remember it in \( \), then match is again with \1 and substitute it for itself.
sed 's/\([bcdfghjklmnpqrstvxzBCDFGHJKLMNPQRSTVXZ]\)\1/\1/'

Sed line match plus line below

I can find my lines with this pattern, but in some case the info is on the line after the match. How can I also get the line following my match line?
sed -n '/SQL3227W Record token/p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Firstly, this looks like a job for grep:
grep -A 1 'SQL3227W Record token' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
(-A 1 means to print an additional 1 line After the match).
Secondly, if you're using GNU sed, you can use a second address of +1 thus:
sed -n '/SQL3227W Record token/,+1p' /log/PLAN_2015-08-16*.MSG >ERRORS.txt
Otherwise, (if you really must use non-Gnu sed), then each time you match, append the following line to your pattern space. Delete the first line, before continuing loop (in case the second line is also a match).
Untested code:
#!/bin/sed -nf
/SQL3227W Record token/{
N
P
D
}
sed is for simple substitutions on individual lines, that is all. For anything even slightly more interesting just use awk:
awk '/SQL3227W Record token/{c=2} c&&c--' file
See Printing with sed or awk a line following a matching pattern for other related idioms.

How to find a pattern using sed?

How can I combine multiple filters using sed?
Here's my data set
sex,city,age
male,london,32
male,manchester,32
male,oxford,64
female,oxford,23
female,london,33
male,oxford,45
I want to identify all lines which contain MALE AND OXFORD. Here's my approach:
sed -n '/male/,/oxford/p' file
Thanks
You can associate a block with the first check and put the second in there. For example:
sed -n '/male/ { /oxford/ p; }' file
Or invert the check and action:
sed '/male/!d; /oxford/!d' file
However, since (as #Jotne points out) lines that contain female also contain male and you probably don't want to match them, the patterns should at least be amended to contain word boundaries:
sed -n '/\<male\>/ { /\<oxford\>/ p; }' file
sed '/\<male\>/!d; /\<oxford\>/!d' file
But since that looks like comma-separated data and the check is probably not meant to test whether someone went to male university, it would probably be best to use a stricter check with awk:
awk -F, '$1 == "male" && $2 == "oxford"' file
This checks not only if a line contains male and oxford but also if they are in the appropriate fields. The same can be achieved, somewhat less prettily, with sed by using
sed '/^male,oxford,/!d' file
A single sed command command can be used to solve this. Let's look at two variations of using sed:
$ sed -e 's/^\(male,oxford,.*\)$/\1/;t;d' file
male,oxford,64
male,oxford,45
$ sed -e 's/^male,oxford,\(.*\)$/\1/;t;d' file
64
45
Both have the essentially the same regex:
^male,oxford,.*$
The interesting features are the capture group placement (either the whole line or just the age portion) and the use of ;t;d to discard non matching lines.
By doing it this way, we can avoid the requirement of using awk or grep to solve this problem.
You can use awk
awk -F, '/\<male\>/ && /\<oxford\>/' file
male,oxford,64
male,oxford,45
It uses the word anchor to prevent hit on female.

Add " to the end of any line that ends in This or this using sed in unix

I have a file where a few lines end with tux. How do I add " to the end of any line that ends in words like this or This?
You could visit this site for more examples and help about using sed in overall. Also check it's "Regular expressions" tab or search the web for something like "unix anchor characters".
For this actual problem, these are the relevant parts of the site:
Sed has the ability to specify which lines are to be examined and/or modified, by specifying addresses before the command. I will just describe the simplest version for now - the /PATTERN/ address. When used, only lines that match the pattern are given the command after the address. Briefly, when used with the /p flag, matching lines are printed twice:
sed '/PATTERN/p' file
And of course PATTERN is any regular expression.
According to these, you could use a sed command like this to get the lines ending with "this" or "This" in your file, or "tux" if you meant that:
$ sed '/[tT]his$/p' yourfile
or
$ sed '/tux$/p' yourfile
For putting the double quotes at the end of these lines, you also need to understand:
$ has a special meaning (end of the input line) as an anchor character in regular expressions
... and the character "$" is the end anchor. The expression "A$" will match all lines that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. The "$" is only an anchor if it is the last character.
how to use sed for substitution of characters (see the linked page)
Sed has several commands, but most people only learn the substitute command: s. The substitute command changes all occurrences of the regular expression into a new value. A simple example is changing "day" in the "old" file to "night" in the "new" file:
$ sed 's/day/night/' newfile
Or another way (for UNIX beginners),
$ sed 's/day/night/' old >new
and for those who want to test this:
$ echo day | sed 's/day/night/'
This will output "night".
After these you can construct your own sed command, knowing that you can use this two parts together in one command like this:
$ sed '/[pP]atternAtTheEndOfLine$/s/$/patternToAddToEndOfTheLine/' yourfile

excluding first and last lines from sed /START/,/END/

Consider the input:
=sec1=
some-line
some-other-line
foo
bar=baz
=sec2=
c=baz
If I wish to process only =sec1= I can for example comment out the section by:
sed -e '/=sec1=/,/=[a-z]*=/s:^:#:' < input
... well, almost.
This will comment the lines including "=sec1=" and "=sec2=" lines, and the result will be something like:
#=sec1=
#some-line
#some-other-line
#
#foo
#bar=baz
#
#=sec2=
c=baz
My question is: What is the easiest way to exclude the start and end lines from a /START/,/END/ range in sed?
I know that for many cases refinement of the "s:::" claws can give solution in this specific case, but I am after the generic solution here.
In "Sed - An Introduction and Tutorial" Bruce Barnett writes: "I will show you later how to restrict a command up to, but not including the line containing the specified pattern.", but I was not able to find where he actually show this.
In the "USEFUL ONE-LINE SCRIPTS FOR SED" Compiled by Eric Pement, I could find only the inclusive example:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive
This should do the trick:
sed -e '/=sec1=/,/=sec2=/ { /=sec1=/b; /=sec2=/b; s/^/#/ }' < input
This matches between sec1 and sec2 inclusively and then just skips the first and last line with the b command. This leaves the desired lines between sec1 and sec2 (exclusive), and the s command adds the comment sign.
Unfortunately, you do need to repeat the regexps for matching the delimiters. As far as I know there's no better way to do this. At least you can keep the regexps clean, even though they're used twice.
This is adapted from the SED FAQ: How do I address all the lines between RE1 and RE2, excluding the lines themselves?
If you're not interested in lines outside of the range, but just want the non-inclusive variant of the Iowa/Montana example from the question (which is what brought me here), you can write the "except for the first and last matching lines" clause easily enough with a second sed:
sed -n '/PATTERN1/,/PATTERN2/p' < input | sed '1d;$d'
Personally, I find this slightly clearer (albeit slower on large files) than the equivalent
sed -n '1,/PATTERN1/d;/PATTERN2/q;p' < input
Another way would be
sed '/begin/,/end/ {
/begin/n
/end/ !p
}'
/begin/n -> skip over the line that has the "begin" pattern
/end/ !p -> print all lines that don't have the "end" pattern
Taken from Bruce Barnett's sed tutorial http://www.grymoire.com/Unix/Sed.html#toc-uh-35a
I've used:
sed '/begin/,/end/{/begin\|end/!p}'
This will search all the lines between the patterns, then print everything not containing the patterns
you could also use awk
awk '/sec1/{f=1;print;next}f && !/sec2/{ $0="#"$0}/sec2/{f=0}1' file

Resources