sed: replace found pattern with another pattern in the same line - unix

I have a line like this:
"LExx 01236723 LE12"
I want to find LE12 and replace LExx with LE12.
Is that possible with sed?

you can try this:
$ echo "LExx 01236723 LE12" | sed 's/LExx\(.*LE\(..\)$\)/LE\2\1/'
LE12 01236723 LE12
explanation
s # substitute
/LExx # find LExx
\(.* # save rest in arg1 (\1)
LE\(..\)$ # find LE and save next 2 char in arg2 (\2)
\) # end arg1
/
LE\2\1 # print LE + arg2 + arg1
/

Assuming both search strings are constants, then using GNU sed (running under bash) with the regexp extension \b (for word boundary):
sed '/LE12/s/\bLExx\b/LE12/' <<< "LExx 01236723 LE12"
Output:
LE12 01236723 LE12

Related

replace specific columns on lines not starting with specific character in a text file

I have a text file that looks like this:
>long_name
AAC-TGA
>long_name2
CCTGGAA
And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:
cols="2 4 7"
I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:
>long_name
ANCNTGN
>long_name2
CNTNGAN
Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.
It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.
E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):
sed -i.bak '/^[^>]/s/ /N/g' input.txt
And I can use AWK to replace specific columns of lines as I want to like this (I think...):
awk '$2=N'
But I am struggling to stitch this together
With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.
awk -v cols='2 4 7' '
BEGIN {
split(cols,f)
FS=OFS=""
}
!/^>/ {
for (i in f)
$(f[i])="N"
}
1' file
Also see Save modifications in place with awk.
You can generate a list of replacement commands first and then pass them to sed
$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN
Can also use {} grouping
$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2; s/./N/4; s/./N/7; }
Using any awk in any shell on every UNIX box:
$ awk -v cols='2 4 7' '
BEGIN { split(cols,c) }
!/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN

Remove consecutive duplicate words from a file using awk or sed

My input file looks like below:
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
Output shoud look like:
"true, rohith Rohith;
cold burn, and fact and fact good?"
i am trying the same with awk, but couldn't able to get the desired result.
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s ",$i,FS)}{printf("\n")}' input.txt
Could someone please help me here.
Regards,
Rohith
With GNU awk for the 4th arg to split():
$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}
$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”
Just match the same backreference in sed:
sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
How it works:
:l - create a label l to jump to. See tl below.
s - substitute
/
\(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
\([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
[^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
\2 - match the same thing as in the second \(...\) - ie. match the word.
\($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
/
\1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
/
g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.
Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.
Below are my other solution, which also remove repeated words across lines.
My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:
# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
# ouptut hexadecimal representation of non-word
printf "%s" "$1" | xxd -p | tr -d "\n"
# and output space with the word
printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
# just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
# change non-word from hex to characters
printf "%s" "${1:1}" | xxd -r -p
# output word
printf "%s" "$2"
' --
But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
nonword=$0
}
NR%2==0{
if (length(lastword) && lastword != $0) {
printf "%s%s", lastword, nonword
}
lastword=$0
}
END{
printf "%s%s", lastword, nonword
}'
In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
NR%2{ n=$0 }
NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
NR%2-1 { l=$0 }
END { printf "%s%s", l, n }
'
Tested on repl. The snippets output:
true, rohith Rohith;
cold burn, and fact and fact good?
Simple sed:
echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'
This is not exactly what you have shown in output but is close using gnu-awk:
awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file
“true , rohith Rohith;
cold burn, and fact and fact good ?”
sed -E 's/(\w+) *\1/\1/g' sample.txt
sample.txt
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
output:
:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”
Explanation
(\w) *\1 - matches a word separated by a space of the same word and saves it
Depending on your expected input, this might work:
sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/ / /g' myfile
([a-zA-Z0-9_-]+) = words that might be repeated.
( *)\1 = check if the previous word is repeated after a space.
s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).
s/ / /g = removes double spaces.
This works with GNU sed.

Use awk to replace word in file

I have a file with some lines:
a
b
c
d
I would like to cat this file into a awk command to produce something like this:
letter is a
letter is b
letter is c
letter is d
using something like this:
cat file.txt | awk 'letter is $1'
But it's not printing out as expected:
$ cat raw.txt | awk 'this is $1'
a
b
c
d
At the moment, you have no { action } block, so your condition evaluates the two empty variables this and is, concatenating them with the first field $1, and checks whether the result is true (a non-empty string). It is, so the default action prints each line.
It sounds like you want to do this instead:
awk '{ print "letter is", $1 }' raw.txt
Although in this case, you might as well just use sed:
sed 's/^/letter is /' raw.txt
This command matches the start of each line and adds the string.
Note that I'm passing the file as an argument, rather than using cat with a pipe.
Not sure if you wanted sed or awk but this is in awk:
$ awk '{print "letter is " $1}' file
letter is a
letter is b
letter is c
letter is d

Compare size of each cell in Unix Scripting

I want to compare each cell size/length and change its content depending on its length.
The current table is of format
AB
CD
AB
AB
CD
155668/01
AB
1233/10
I want to replace the cells which have length more than "2" to DE.
Output
AB
CD
AB
AB
CD
DE
AB
DE
I tried
awk -F "," '{ if($(#1) > "2") {print "DE"} else {print $1 }}'
It says syntax error.
If I use wc -m in place if $(# the output is same is the input.
The easiest way is to use sed:
sed '/^..$/!s/.*/DE/' file
In awk, you could say:
awk '!/^..$/ { $0 = "DE" } 1' file
In both cases, the idea is the same: if the line does not consist of exactly two characters, replace the whole line with DE. In the case of sed, the whole line is .*, in the case of awk, it is $0.
Try this -
$ awk '{print (length($1)>2?"DE":$1)}' f
AB
CD
AB
AB
CD
DE
AB
DE
The idiomatic way would be:
awk 'length($1) > 2 { $1 = "DE" } 1'

How to delete space from an nth location using sed command?

input:
B0128001594 2015081100001 RESPDTL
output:
B0128001594 2015081100001RESPDTL
In the above example, I need the 3rd white space to be removed.
To remove the second block of spaces use:
$ sed 's/\s\+//2' <<< "B0128001594 2015081100001 RESPDTL"
B0128001594 2015081100001RESPDTL
# ^
# removed here
To remove the second space, use:
$ sed 's/ //2' <<< "B0128001594 2015081100001 RESPDTL"
B0128001594 2015081100001 RESPDTL
# ^
# removed here
Note the usage of \s\+: \s for a space (tab or space) and \+ to indicate one or many.
As andlrc suggests in comments, do use [[:space:]] to make it POSIX.
How about use awk?
echo "B0128001594 2015081100001 RESPDTL" | awk '{print $1,$2$3}'
You can use the following:
$ sed 's/^\([^ ]* *[^ ]*\) */\1/' input
B0128001594 2015081100001RESPDTL
Basically it replaces the first two columns and the trailing space, with the first two columns.
So A B C becomes A BC and A B C D becomes A BC D.
Usage: awk -f f.awk file.dat
$ cat f.awk
{
match($0, /[^ ]+[ ]+[^ ]+[ ]/)
print substr($0, 1, RLENGTH-1) substr($0, RLENGTH+1)
}

Resources