Create duplicate line based on maximum number of delimiters in a field - unix

I have a file which contains multiple fields and 2 types of delimiters. If the number of delimiters in one of the fields reaches a defined number then I want to split the field after the number is met onto the next line while replicating the first part of the line.
Is this possible in awk or sed?
Example
Input
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8,9,10|
a3|b|c|d|1,2|
Max Number = 6, to split on commas in field 5
Output
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|

Assuming not more than one split would be required:
$ sed -E 's/^(([^|]+\|){4})(([^,]+,){5}[^,]+),(.*)/\1\3|\n\1\5/' ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
-E use ERE, some sed version uses -r option instead
^(([^|]+\|){4}) first 4 columns delimited by |
(([^,]+,){5}[^,]+) 6 columns delimited by , (without trailing ,)
, comma between 6th and 7th column
(.*) rest of line
\1\3|\n\1\5 split as required
The column and max number can be passed from shell variables too (example shown for bash)
$ col=5; max=6
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
$ col=5; max=8
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8|
a2|b|c|d|9,10|
a3|b|c|d|1,2|

awk to the rescue!
awk -F\| -v OFS=\| -v c=',' '
{n=split($5,a,c);
if(n>6)
{f=$5;
$5=a[1] c a[2] c a[3] c a[4] c a[5] c a[6];
print;
$5=f;
gsub(/([^,]+,){6}/,"",$5)}}1' file

Related

replace specific columns on lines not starting with specific character in a text file

I have a text file that looks like this:
>long_name
AAC-TGA
>long_name2
CCTGGAA
And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:
cols="2 4 7"
I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:
>long_name
ANCNTGN
>long_name2
CNTNGAN
Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.
It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.
E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):
sed -i.bak '/^[^>]/s/ /N/g' input.txt
And I can use AWK to replace specific columns of lines as I want to like this (I think...):
awk '$2=N'
But I am struggling to stitch this together
With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.
awk -v cols='2 4 7' '
BEGIN {
split(cols,f)
FS=OFS=""
}
!/^>/ {
for (i in f)
$(f[i])="N"
}
1' file
Also see Save modifications in place with awk.
You can generate a list of replacement commands first and then pass them to sed
$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN
Can also use {} grouping
$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2; s/./N/4; s/./N/7; }
Using any awk in any shell on every UNIX box:
$ awk -v cols='2 4 7' '
BEGIN { split(cols,c) }
!/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN

How to delete space from an nth location using sed command?

input:
B0128001594 2015081100001 RESPDTL
output:
B0128001594 2015081100001RESPDTL
In the above example, I need the 3rd white space to be removed.
To remove the second block of spaces use:
$ sed 's/\s\+//2' <<< "B0128001594 2015081100001 RESPDTL"
B0128001594 2015081100001RESPDTL
# ^
# removed here
To remove the second space, use:
$ sed 's/ //2' <<< "B0128001594 2015081100001 RESPDTL"
B0128001594 2015081100001 RESPDTL
# ^
# removed here
Note the usage of \s\+: \s for a space (tab or space) and \+ to indicate one or many.
As andlrc suggests in comments, do use [[:space:]] to make it POSIX.
How about use awk?
echo "B0128001594 2015081100001 RESPDTL" | awk '{print $1,$2$3}'
You can use the following:
$ sed 's/^\([^ ]* *[^ ]*\) */\1/' input
B0128001594 2015081100001RESPDTL
Basically it replaces the first two columns and the trailing space, with the first two columns.
So A B C becomes A BC and A B C D becomes A BC D.
Usage: awk -f f.awk file.dat
$ cat f.awk
{
match($0, /[^ ]+[ ]+[^ ]+[ ]/)
print substr($0, 1, RLENGTH-1) substr($0, RLENGTH+1)
}

Unix Command for counting number of words which contains letter combination (with repeats and letters in between)

How would you count the number of words in a text file which contains all of the letters a, b, and c. These letters may occur more than once in the word and the word may contain other letters as well. (For example, "cabby" should be counted.)
Using sample input which should return 2:
abc abb cabby
I tried both:
grep -E "[abc]" test.txt | wc -l
grep 'abcdef' testCount.txt | wc -l
both of which return 1 instead of 2.
Thanks in advance!
You can use awk and use the return value of sub function. If successful substitution is made, the return value of the sub function will be the number of substitutions done.
$ echo "abc abb cabby" |
awk '{
for(i=1;i<=NF;i++)
if(sub(/a/,"",$i)>0 && sub(/b/,"",$i)>0 && sub(/c/,"",$i)>0) {
count+=1
}
}
END{print count}'
2
We keep the condition of return value to be greater than 0 for all three alphabets. The for loop will iterate over every word of every line adding the counter when all three alphabets are found in the word.
I don't think you can get around using multiple invocations of grep. Thus I would go with (GNU grep):
<file grep -ow '\w+' | grep a | grep b | grep c
Output:
abc
cabby
The first grep puts each word on a line of its own.
Try this, it will work
sed 's/ /\n/g' test.txt |grep a |grep b|grep c
$ cat test.txt
abc abb cabby
$ sed 's/ /\n/g' test.txt |grep a |grep b|grep c
abc
cabby
hope this helps..

Remove all lines from file with duplicate value in field, including the first occurrence

I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.
I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.
Alternately, I can remove lines with the duplicate using an awk one-liner like
awk -F"[,]" '!_[$2]++'
but this retains the line with the first incidence of the repeated value in col 2.
As an example, if my data is
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
I would like to remove ALL lines (including the first) where b occurs in the second column.
Like this:
d,e,f
h,i,j
Thanks for any advice!!
If the order is not important then the following should work:
awk -F, '
!seen[$2]++ {
line[$2] = $0
}
END {
for(val in seen)
if(seen[val]==1)
print line[val]
}' file
Output
h,i,j
d,e,f
Solution with grep:
grep -v -E '\b,b,\b' text.txt
Content of the file:
$ cat text.txt
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f
$ grep -v -E '\b,b,\b' text.txt
d,e,f
h,i,j
a,n,b
b,c,f
Hope it helps
Some different awk:
awk -F, '
BEGIN {f=0}
FNR==NR {_[$2]++;next}
f==0 {
f=1
for(j in _)if(_[j]>1)delete _[j]
}
$2 in _
' file file
Explanation
The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].

Search file for characters excluding a set of characters

I have a text file with 1.3million rows and 258 columns delimited by semicolons (;). How can I search for what characters are in the file, excluding letters of the alphabet (both upper and lower case), semicolon (;), quote (') and double quote (")? Ideally the results should be in a non-duplicated list.
Use the following pipeline
# Remove the characters you want to exclude
tr -d 'A-Za-z;"'\' <file |
# One character on each line
sed 's/\(.\)/\1\
/g' |
# Remove duplicates
sort -u
Example
echo '2343abc34;ABC;;#$%"' |
tr -d 'A-Za-z;"'\' |
sed 's/\(.\)/\1\
/g' |
sort -u
$
%
2
3
4
#
you can use grep -v command and pipe it to sort and then to uniq.

Resources