Replace last 9 delimeters "," with "|" in Unix - unix

I want to replace last 9 "," delimeters with "|" in a file.
For example, from:
abcd,3,5,5,7,7,1,2,3,4
"ashu,pant,something",3,5,5,7,7,8,7,8,8,8
to:
abcd|3|5|5|7|7|1|2|3|4
"ashu,pant,something"|3|5|5|7|7|8|7|8|8|8
Help would be really appreciated.

Not exactly the same but replace all after the second occurrence with GNU sed:
$ echo \"ashu,pant\",3,5,5,7,7,87,8,8,8 |
sed 's/,/|/2g'
"ashu,pant"|3|5|5|7|7|87|8|8|8
Edit to match your changed requirements:
Hackish, but first reverse lines and replace all commas with pipes, then replace pipes with commas starting from 10th occurrence:
$ echo -e \"ashu,pant\",3,5,5,7,7,87,8,8,8\\nabcd,3,5,5,7,7,1,2,3,4 |
rev |
sed 's/,/|/g; s/|/,/10g' |
rev
"ashu,pant"|3|5|5|7|7|87|8|8|8
abcd|3|5|5|7|7|1|2|3|4
You could also use GNU awk and FPAT to replace all comma outside of quotes:
$ echo -e \"ashu,pant\",3,5,5,7,7,87,8,8,8\\nabcd,3,5,5,7,7,1,2,3,4 |
awk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")";OFS="|"}{$1=$1}1'
"ashu,pant"|3|5|5|7|7|87|8|8|8
abcd|3|5|5|7|7|1|2|3|4

awk '{gsub(/[[:digit:]]/," |&")gsub(/, /,"")}1' file
output
abcd|3|5|5|7|7|1|2|3|4
"ashu,pant,something"|3|5|5|7|7|8|7|8|8|8

Related

replace specific columns on lines not starting with specific character in a text file

I have a text file that looks like this:
>long_name
AAC-TGA
>long_name2
CCTGGAA
And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:
cols="2 4 7"
I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:
>long_name
ANCNTGN
>long_name2
CNTNGAN
Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.
It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.
E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):
sed -i.bak '/^[^>]/s/ /N/g' input.txt
And I can use AWK to replace specific columns of lines as I want to like this (I think...):
awk '$2=N'
But I am struggling to stitch this together
With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.
awk -v cols='2 4 7' '
BEGIN {
split(cols,f)
FS=OFS=""
}
!/^>/ {
for (i in f)
$(f[i])="N"
}
1' file
Also see Save modifications in place with awk.
You can generate a list of replacement commands first and then pass them to sed
$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7
$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN
Can also use {} grouping
$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2; s/./N/4; s/./N/7; }
Using any awk in any shell on every UNIX box:
$ awk -v cols='2 4 7' '
BEGIN { split(cols,c) }
!/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN

Create duplicate line based on maximum number of delimiters in a field

I have a file which contains multiple fields and 2 types of delimiters. If the number of delimiters in one of the fields reaches a defined number then I want to split the field after the number is met onto the next line while replicating the first part of the line.
Is this possible in awk or sed?
Example
Input
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8,9,10|
a3|b|c|d|1,2|
Max Number = 6, to split on commas in field 5
Output
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
Assuming not more than one split would be required:
$ sed -E 's/^(([^|]+\|){4})(([^,]+,){5}[^,]+),(.*)/\1\3|\n\1\5/' ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
-E use ERE, some sed version uses -r option instead
^(([^|]+\|){4}) first 4 columns delimited by |
(([^,]+,){5}[^,]+) 6 columns delimited by , (without trailing ,)
, comma between 6th and 7th column
(.*) rest of line
\1\3|\n\1\5 split as required
The column and max number can be passed from shell variables too (example shown for bash)
$ col=5; max=6
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6|
a2|b|c|d|7,8,9,10|
a3|b|c|d|1,2|
$ col=5; max=8
$ sed -E "s/^(([^|]+\|){$((col-1))})(([^,]+,){$((max-1))}[^,]+),(.*)/\1\3|\n\1\5/" ip.txt
a1|b|c|d|1,2,3,4|
a2|b|c|d|1,2,3,4,5,6,7,8|
a2|b|c|d|9,10|
a3|b|c|d|1,2|
awk to the rescue!
awk -F\| -v OFS=\| -v c=',' '
{n=split($5,a,c);
if(n>6)
{f=$5;
$5=a[1] c a[2] c a[3] c a[4] c a[5] c a[6];
print;
$5=f;
gsub(/([^,]+,){6}/,"",$5)}}1' file

Getting the last x digits from output with grep command

I need help getting the last 16 digits from the output I get with this command ;
cat q5data.txt | grep -o '[0-9]*[0-9]\{16\}'
The output I get is :
6420029454020029
26787889786973463
92272417810036027222591368318424
1147142436072964
And id want the last 16 digits only of the numbers above, so it would look something like this :
6420029454020029
6787889786973463
7222591368318424
1147142436072964
So yeah, the question is, how would I get the last 16 digits ?
q5data contains this:
0111102.82575525572371251FriThuSat32169716436971243.1415 foo100001$$$3.14153
foo`3.1415Green100010blah2.8
2.85720948213811501Purple`WedTueBLACK1869228491762178BLACK$$3.14100001Feb010000
taoblahfoopiGreen010111
VOIDchiOrangeSatNILLVOIDBLACK$$$Sat3.14155378825854705118Mar$WHITEAug`Tue
4421929582063064
2.8$$$$BLACKSun$"blah$ThublahJun2057411253659033Orange$$Sun$$fubar'
BLACKSun8061215743158569Jul'010101`2.8MayFri$$'blah
100001$3.141533.14153taoBLACKWHITE3.141532.8'foo"chi`BLACK$$$3300209361826966
5976364681345632YellowFri"JanWHITEWedWHITE3652470302503667WHITE
1237496282374608WHITEpiNILLVOID110111WHITEApr'$$$2.83536505910579946111010
54891762211716313.14$$RedWedtaoMonFri110010$$3068508931421361$PurpleNILLWHITE9242959892278294Sep
000110BlueOct2582940799974379
phifoo$
Purple3.1415Green '
3.14BLACKTuepiYellowWHITEchi35798399298233973.14153.1415WHITEpitao$SunBlue010110
NULLBLACKTue1650665049652872`2.8$'$$$NULL3.14SatGreen$$3.141533.14153GreenVOIDJul"
chichifubarWedpiBLACK3.14153BLACKpiWHITEThu$ BLACK
blah2.8fubar4411479881441554$$`BLACKWHITE1101113.14SepWHITEJanThuGreen
$$WHITE'"3675572769992033fooBlueNULL100000'
BLACK 3.14WHITEDecfubarOrangeMay NILLWHITE2570850288634750 101011$$$Mon
Tue" 3.143.14phiSat7665425103246257MayphiTue'0010110101112.8BLACK$fubar"
0358649831711525100010'FriJunThu"3.14SunGreenfubarMonWHITEVOID$$$VOID1877369637528056Jan$010010
GreenTue000111ThuBLACKApr011010
Jun6244216458497289`PurpleAug$$$2685357800265115''2.8taopi101100$$chiFeb
9471418620899225VOID8617331495319240NULLWHITEblah5461478451014026
6352741666667105
WHITEfooOct011010pi$$$110100BLACKBLACKTuePurpleWHITE9093492271343727SepNovchi
Orange3.144596443153024361`"'$$78253311502390510101103.14153Friphi $Mon
1385825179552755YellowBLACK001011Sep$$RedFebfubarMon010010000010fubar"Jul0110117544560082562350
3.141653642540032022chi'Orange
1253542283769081tao4876457038962098MonSunMayWHITEYellow3.14153$Orange000101blah
RedSatNILLphiVOIDWedfubarGreen chi$$piphiJul$$$111001`9540185369262601NILLVOID
7006440921851679Wed3.14152.8chiGreenThu$$Tuefoofooblahpi$$$taopi$ May 'Feb
MayNILLblah8007182476768737JantaophiThutao$'Jul AprNILLBLACK'3.14153Feb3.1415
57067714600406493.141537231229468300261Mon$`SunNILL `NULL3.14153foochi1000109494160741986074
6577869219715310JulJanBLACKfubarBLACK2.8phiGreen0091496849086433
SunBlue2355648762601053 3.1415NULL$$$BLACK100011 ThuDecJun2.83.1415phiFeb"
9173525733960126BLACK 3.14153`110001PurpleRedFebfubarVOIDfoo$$$blah9330024102534139
Jun$$VOIDVOID4099554992034342Julpi9976331355660412taoWHITEGreen$$100010NILLVOID
3.14153phiSatphi43658305924319679197159994746838phipiApr
3.1415RedblahMayfooJul100011NovtaoMon3.141533.14JanGreen$$ OctNILLfooWHITE3.1415
96027197435535111011013.14VOID3583462878046156NULL3.1415blahOrangefoo 100101taofoo3.14153"3.1415
$$Red3.14Marblah'
3797758515388131tao $$$101010NULL2268984774582096BlueBlue3.14153Oct`
74321533961822933.14153994759453326425$$Jul001111PurpleGreenTueNovJan2742714540787707Blue$$$
0010003.14blah3.14ThuWHITE$$$$blah
3997313793176662 3.141463510697622121Yellow 3.1415'Jul`3.14153NILL2.8Thuphi
3134920264311067fooNov`NULL1111119335359393623483Tue$$$GreenVOIDtaoRedTueAug$$3.141532.8Sat'
3.14153Oct100010FebJan$$3.1415pi$$'chiRed$$$NILL8614261680268364
fubarBLACKpi110001110101pichi0126011887834143GreenNILLYellow NILLfoo101000 $$$
RedTueNULLThu2.814091424413091162.8 WHITE$WHITE60620358244865230211111773156587'pi
Yellow3.1415$$$$$
"Aug3.1415VOIDBLACK0810996065354809$$$NULLfoo$$Orange6850772642048628WedBLACK
BLACKBluepi 70173555329860651869981769139132phi$$$$$$3.14Feb2.86083883638401362
6420029454020029WHITE26787889786973463.14 3.14 Mon`92272417810036027222591368318424$$$tao
fooTue"1147142436072964AprPurpleSep
Okay so, at the begining of q5data we see 01111102. and right after this we see : 82575525572371251 (17 digits)
Id like it to output the last 16 digits ( 2575525572371251 )
Thank you :)
To match the end of the pattern use \b
grep -o '[0-9]\{16\}\b' q5data.txt
so this will match 16 digits up to a word boundary.
If you want to capture digits in strings terminated with non-numerical chars you need negative lookahead (with -P option, not available in standard grep)
$ grep -Po '[0-9]{16}(?![0-9])'
e.g.
$ echo "12345678901234567890aaa" | grep -Po '[0-9]{16}(?![0-9])'
5678901234567890
If you want the last 16 digits from every run of 16 or more digits, then you could filter through grep twice:
grep -Eo '[0-9]{16,}' <q5data.txt | grep -Eo '.{16}$'
The first selects all runs of 16 or more digits, and the second selects the last 16 characters from each run.
Testing this on the first line of your input file gives:
$ grep -Eo '[0-9]{16,}' <<<'0111102.82575525572371251FriThuSat32169716436971243.1415 foo100001$$$3.14153' | grep -Eo '.{16}$'
2575525572371251
2169716436971243
grep -Eo '([0-9]{16})$' q5data.txt

Unix Command for counting number of words which contains letter combination (with repeats and letters in between)

How would you count the number of words in a text file which contains all of the letters a, b, and c. These letters may occur more than once in the word and the word may contain other letters as well. (For example, "cabby" should be counted.)
Using sample input which should return 2:
abc abb cabby
I tried both:
grep -E "[abc]" test.txt | wc -l
grep 'abcdef' testCount.txt | wc -l
both of which return 1 instead of 2.
Thanks in advance!
You can use awk and use the return value of sub function. If successful substitution is made, the return value of the sub function will be the number of substitutions done.
$ echo "abc abb cabby" |
awk '{
for(i=1;i<=NF;i++)
if(sub(/a/,"",$i)>0 && sub(/b/,"",$i)>0 && sub(/c/,"",$i)>0) {
count+=1
}
}
END{print count}'
2
We keep the condition of return value to be greater than 0 for all three alphabets. The for loop will iterate over every word of every line adding the counter when all three alphabets are found in the word.
I don't think you can get around using multiple invocations of grep. Thus I would go with (GNU grep):
<file grep -ow '\w+' | grep a | grep b | grep c
Output:
abc
cabby
The first grep puts each word on a line of its own.
Try this, it will work
sed 's/ /\n/g' test.txt |grep a |grep b|grep c
$ cat test.txt
abc abb cabby
$ sed 's/ /\n/g' test.txt |grep a |grep b|grep c
abc
cabby
hope this helps..

Search file for characters excluding a set of characters

I have a text file with 1.3million rows and 258 columns delimited by semicolons (;). How can I search for what characters are in the file, excluding letters of the alphabet (both upper and lower case), semicolon (;), quote (') and double quote (")? Ideally the results should be in a non-duplicated list.
Use the following pipeline
# Remove the characters you want to exclude
tr -d 'A-Za-z;"'\' <file |
# One character on each line
sed 's/\(.\)/\1\
/g' |
# Remove duplicates
sort -u
Example
echo '2343abc34;ABC;;#$%"' |
tr -d 'A-Za-z;"'\' |
sed 's/\(.\)/\1\
/g' |
sort -u
$
%
2
3
4
#
you can use grep -v command and pipe it to sort and then to uniq.

Resources