Join line based on delimiter using awk, sed in unix - unix

I'd like to join lines that end with a | with the next line.
Example:
abcd|
ef
123456|
78|
90
Desired output:
abcdef
1234567890

This might work for you (GNU sed):
sed ':a;N;s/|\s*\n//;ta;P;D' file

With GNU sed:
$ sed ':a;/| *$/{N;ba};s/| *\n//g' infile
abcdef
1234567890
This does the following:
:a # Label to branch to
/| *$/{ # If the line ends with a pipe and optional spaces
N # Append next line to pattern space
ba # Branch to label
}
s/| *\n//g # Remove all pipes followed by optional spaces and a newline
To get this working under BSD sed, the command has to be split up after labels and branching commands:
sed -e ':a' -e '/| *$/{N;ba' -e '};s/| *\n//g' infile

With GNU awk for multi-char RS:
$ awk -v RS='[|]\\s*\\n' -v ORS= '1' file
abcdef
1234567890
With other awks one way would be:
$ awk 'sub(/\|[[:blank:]]*$/,"") {s = s $0; next} {print s $0; s=""}' file
abcdef
1234567890
If your last input line can end with | you'd need to tell us how to handle that and add sample input/output including that case to your question.

You can try this sed,
sed ':loop; /| *$/{N;s/\n//g}; t loop' yourfile | sed 's/ *| *//g'

awk to the rescue! The following script removes pipe characters and newlines when they occur.
$ awk '{c=sub(/\|/,"")?"":"\n"; $1=$1; printf "%s%s",$0,c}' file
abcdef
1234567890

I started with this
awk -F '|' 'NF>1 {printf "%s", $1} NF==1' file
But then I thought that made too many assumptions about how many fields you might have. So I ended up with
awk -F '|' '{if (NF>1 && $NF ~ /^[[:blank:]]*$/) {NF--; printf "%s", $0} else print}' file
which is much sturdier but less terse.

Considering the input text is stored in a file name infile
You can use this GNU sed command:
sed ':begin;$!N;s/\n//;s/\s//g;s/|//g' infile

awk '{sub(/\|/,"")}{printf "%s",$1}/f/{print "\r"}END {print ""}' file
abcdef
1234567890

Related

AWK or bash script to get the rows of a file where the specific column is equal to the given variable [duplicate]

I found some ways to pass external shell variables to an awk script, but I'm confused about ' and ".
First, I tried with a shell script:
$ v=123test
$ echo $v
123test
$ echo "$v"
123test
Then tried awk:
$ awk 'BEGIN{print "'$v'"}'
$ 123test
$ awk 'BEGIN{print '"$v"'}'
$ 123
Why is the difference?
Lastly I tried this:
$ awk 'BEGIN{print " '$v' "}'
$ 123test
$ awk 'BEGIN{print ' "$v" '}'
awk: cmd. line:1: BEGIN{print
awk: cmd. line:1: ^ unexpected newline or end of string
I'm confused about this.
#Getting shell variables into awk
may be done in several ways. Some are better than others. This should cover most of them. If you have a comment, please leave below.                                                                                    v1.5
Using -v (The best way, most portable)
Use the -v option: (P.S. use a space after -v or it will be less portable. E.g., awk -v var= not awk -vvar=)
variable="line one\nline two"
awk -v var="$variable" 'BEGIN {print var}'
line one
line two
This should be compatible with most awk, and the variable is available in the BEGIN block as well:
If you have multiple variables:
awk -v a="$var1" -v b="$var2" 'BEGIN {print a,b}'
Warning. As Ed Morton writes, escape sequences will be interpreted so \t becomes a real tab and not \t if that is what you search for. Can be solved by using ENVIRON[] or access it via ARGV[]
PS If you have vertical bar or other regexp meta characters as separator like |?( etc, they must be double escaped. Example 3 vertical bars ||| becomes -F'\\|\\|\\|'. You can also use -F"[|][|][|]".
Example on getting data from a program/function inn to awk (here date is used)
awk -v time="$(date +"%F %H:%M" -d '-1 minute')" 'BEGIN {print time}'
Example of testing the contents of a shell variable as a regexp:
awk -v var="$variable" '$0 ~ var{print "found it"}'
Variable after code block
Here we get the variable after the awk code. This will work fine as long as you do not need the variable in the BEGIN block:
variable="line one\nline two"
echo "input data" | awk '{print var}' var="${variable}"
or
awk '{print var}' var="${variable}" file
Adding multiple variables:
awk '{print a,b,$0}' a="$var1" b="$var2" file
In this way we can also set different Field Separator FS for each file.
awk 'some code' FS=',' file1.txt FS=';' file2.ext
Variable after the code block will not work for the BEGIN block:
echo "input data" | awk 'BEGIN {print var}' var="${variable}"
Here-string
Variable can also be added to awk using a here-string from shells that support them (including Bash):
awk '{print $0}' <<< "$variable"
test
This is the same as:
printf '%s' "$variable" | awk '{print $0}'
P.S. this treats the variable as a file input.
ENVIRON input
As TrueY writes, you can use the ENVIRON to print Environment Variables.
Setting a variable before running AWK, you can print it out like this:
X=MyVar
awk 'BEGIN{print ENVIRON["X"],ENVIRON["SHELL"]}'
MyVar /bin/bash
ARGV input
As Steven Penny writes, you can use ARGV to get the data into awk:
v="my data"
awk 'BEGIN {print ARGV[1]}' "$v"
my data
To get the data into the code itself, not just the BEGIN:
v="my data"
echo "test" | awk 'BEGIN{var=ARGV[1];ARGV[1]=""} {print var, $0}' "$v"
my data test
Variable within the code: USE WITH CAUTION
You can use a variable within the awk code, but it's messy and hard to read, and as Charles Duffy points out, this version may also be a victim of code injection. If someone adds bad stuff to the variable, it will be executed as part of the awk code.
This works by extracting the variable within the code, so it becomes a part of it.
If you want to make an awk that changes dynamically with use of variables, you can do it this way, but DO NOT use it for normal variables.
variable="line one\nline two"
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
Here is an example of code injection:
variable='line one\nline two" ; for (i=1;i<=1000;++i) print i"'
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
1
2
3
.
.
1000
You can add lots of commands to awk this way. Even make it crash with non valid commands.
One valid use of this approach, though, is when you want to pass a symbol to awk to be applied to some input, e.g. a simple calculator:
$ calc() { awk -v x="$1" -v z="$3" 'BEGIN{ print x '"$2"' z }'; }
$ calc 2.7 '+' 3.4
6.1
$ calc 2.7 '*' 3.4
9.18
There is no way to do that using an awk variable populated with the value of a shell variable, you NEED the shell variable to expand to become part of the text of the awk script before awk interprets it. (see comment below by Ed M.)
Extra info:
Use of double quote
It's always good to double quote variable "$variable"
If not, multiple lines will be added as a long single line.
Example:
var="Line one
This is line two"
echo $var
Line one This is line two
echo "$var"
Line one
This is line two
Other errors you can get without double quote:
variable="line one\nline two"
awk -v var=$variable 'BEGIN {print var}'
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ backslash not last character on line
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ syntax error
And with single quote, it does not expand the value of the variable:
awk -v var='$variable' 'BEGIN {print var}'
$variable
More info about AWK and variables
Read this faq.
It seems that the good-old ENVIRON awk built-in hash is not mentioned at all. An example of its usage:
$ X=Solaris awk 'BEGIN{print ENVIRON["X"], ENVIRON["TERM"]}'
Solaris rxvt
You could pass in the command-line option -v with a variable name (v) and a value (=) of the environment variable ("${v}"):
% awk -vv="${v}" 'BEGIN { print v }'
123test
Or to make it clearer (with far fewer vs):
% environment_variable=123test
% awk -vawk_variable="${environment_variable}" 'BEGIN { print awk_variable }'
123test
You can utilize ARGV:
v=123test
awk 'BEGIN {print ARGV[1]}' "$v"
Note that if you are going to continue into the body, you will need to adjust
ARGC:
awk 'BEGIN {ARGC--} {print ARGV[2], $0}' file "$v"
I just changed #Jotne's answer for "for loop".
for i in `seq 11 20`; do host myserver-$i | awk -v i="$i" '{print "myserver-"i" " $4}'; done
I had to insert date at the beginning of the lines of a log file and it's done like below:
DATE=$(date +"%Y-%m-%d")
awk '{ print "'"$DATE"'", $0; }' /path_to_log_file/log_file.log
It can be redirect to another file to save
Pro Tip
It could come handy to create a function that handles this so you dont have to type everything every time. Using the selected solution we get...
awk_switch_columns() {
cat < /dev/stdin | awk -v a="$1" -v b="$2" " { t = \$a; \$a = \$b; \$b = t; print; } "
}
And use it as...
echo 'a b c d' | awk_switch_columns 2 4
Output:
a d c b

Extract filename

So I am new to SED and Unix and I would like to replace the following file:
1500:../someFile.C:111 error
1869:../anotherFile.C:222 error
1869:../anotherFile2.Cxx:333 error
//thousands of more lines with same structure
With the followig file
someFile.c
anotherFile.c
anotherFile2.Cxx
Basically, I just want to extract the filename from every line.
So far, I have read the documentation on sed and the second answer here. My best attempt was to use a regex as follows:
sed "s/.\*\/.:.*//g" myFile.txt
Lots of ways to do this.
Sure, you could use sed:
sed 's/^[^:]*://;s/:.*//;s#\.\./##' input.txt
sed 's%.*:\.\./\([^:]*\):.*%\1%' input.txt
Or you could use a series of grep -o instances in a pipe:
grep -o ':[^:]*:' input.txt | grep -o '[^:]\{1,\}' | grep -o '/.*' | grep -o '[^/]\{1,\}'
You could even use awk:
awk -F: '{sub(/\.\.\//,"",$2); print $2}' input.txt
But the simplest way would probably be to use cut:
cut -d: -f2 input.txt | cut -d/ -f2
You can capture the substring between last / and following : and replace the whole string with the captured string(\1).
sed 's#.*/\([^:]\+\).*#\1#g' myFile.txt
someFile.C
anotherFile.C
anotherFile2.Cxx
OR , with little less escaping, sed with -r flag.
sed -r 's#.*/([^:]+).*#\1#g' myFile.txt
Or if you want to use grep,this will only work if your grep supports -P flag which will enable PCRE:
grep -oP '.*/\K[^:]+' myFile.txt
someFile.C
anotherFile.C
anotherFile2.Cxx

How to print the last but one record of a file using sed? [duplicate]

I have a file that has the following as the last three lines. I want to retrieve the penultimate line, i.e. 100.000;8438; 06:46:12.
.
.
.
99.900; 8423; 06:44:41
100.000;8438; 06:46:12
Number of patterns: 8438
I don't know the line number. How can I retrieve it using a shell script? Thanks in advance for your help.
Try this:
tail -2 yourfile | head -1
A short sed one-liner inspired by https://stackoverflow.com/a/7671772/5287901
sed -n 'x;$p'
Explanation:
-n quiet mode: dont automatically print the pattern space
x: exchange the pattern space and the hold space (hold space now store the current line, and pattern space the previous line, if any)
$: on the last line, p: print the pattern space (the previous line, which in this case is the penultimate line).
Use this
tail -2 <filename> | head -1
ed and sed can do it as well.
str='
99.900; 8423; 06:44:41
100.000;8438; 06:46:12
Number of patterns: 8438
'
printf '%s' "$str" | sed -n -e '${x;1!p;};h' # print last line but one
printf '%s\n' H '$-1p' q | ed -s <(printf '%s' "$str") # same
printf '%s\n' H '$-2,$-1p' q | ed -s <(printf '%s' "$str") # print last line but two
From: Useful sed one-liners by Eric Pement
# print the next-to-the-last line of a file
sed -e '$!{h;d;}' -e x # for 1-line files, print blank line
sed -e '1{$q;}' -e '$!{h;d;}' -e x # for 1-line files, print the line
sed -e '1{$d;}' -e '$!{h;d;}' -e x # for 1-line files, print nothing
You don't need all of them, just pick one.
tail +2 <filename>
This prints from second line to last line.
To clarify what has already been said:
ec2thisandthat | sort -k 5 | grep 2012- | awk '{print $2}' | tail -2 | head -1
snap-e8317883
snap-9c7227f7
snap-5402553f
snap-3e7b2c55
snap-246b3c4f
snap-546a3d3f
snap-2ad48241
snap-d00150bb
returns
snap-2ad48241
tac <file> | sed -n '2p'

How to delete duplicate lines in file in unix?

I can delete duplicate lines in files using below commands:
1) sort -u and uniq commands. is that possible using sed or awk ?
There's a "famous" awk idiom:
awk '!seen[$0]++' file
It has to keep the unique lines in memory, but it preserves the file order.
sort and uniq these only need to remove duplicates
cat filename | sort | uniq >> filename2
if its file consist of number use sort -n
After sorting we can use this sed command
sed -E '$!N; /^(.*)\n\1$/!P; D' filename
If the file is unsorted then you can use with combination of the command.
sort filename | sed -E '$!N; /^\(.*\)\n\1$/!P; D'

Grep part of large file without spliting it

How can I grep a certain part of a large file from lines 1000 to 2000, up to line 1000 or from line 1000 for example?
I don't want to split the file in smaller files.
you could use sed to pre-process. EDIT: adding a q per Kent's suggestion
sed -n '1000,2000{p;2000q}' file.txt | grep 'abc'
for line 1000 through end of file
sed -n '1000,$p' file.txt | grep 'abc'
As a minor improvement over the sed solution by #ravoori, refactor the grep into the sed:
sed '1000,$/pattern/!d;2000q' file.txt
If you have the pattern in a variable, use double quotes;
sed '1000,$/'"$pattern"'/!d;2000q' file.txt
Or equivalently in Awk:
awk 'NR==2000{exit(0)}NR>=1000 && /pattern/' file.txt
or with a variable
awk -v pat="$pattern" 'NR==2000{exit(0)}NR>=1000 && $0~pat' file.txt
I'd suggest
head -2000 FILE.TXT | tail -1000 | grep XXX
as the neatest solution because head does not have to read the huge file, just the first few N thousand lines. It essentially achieves what q does in the sed solution.

Resources