I'm working with manipulating lines from a .vcf file where bread is listed 1 through 20 in roman numerals. I only want the lines corresponding to bread 10, so I've used
awk '/breadX/ {print}' file.vcf > Test.txt
to output a list of lines containing "breadX" to Test.txt. That is all good, however it is also including "breadXI" on to "breadXX" in the list. Is there an option to exclude cases that don't match assuming "breadX" is out of order and towards the middle (XIV...X...XX), and that there is more information in the line. I only want lines that start with bread 10, and not any of the other options. Any help would be appreciated.
In the lack of definitive data sample to see what might follow the breadX just exclude all possible strings where roman numeral symbols I, V, X, L, D, M follow:
$ awk '/^breadX([^IVXLDM]|$)/' file
Sample test file:
$ cat file
breadX
breadXI
breadX2
3
Test it:
$ awk '/^breadX([^IVXLDM]|$)/' file
Output:
breadX
breadX2
If breadX is a word, you can use word boundary to limit your search.
cat file
test breadXI more
hi breadX yes
cat home breadXX
awk '/\<breadX\>/' file
hei breadX yes
\< start of word
\> end of word
PS you do not need the print since its default action if test is true.
need to insert '\N' between whereever 2 sequencial commas in the line like below:
"abc,,,,5,,,3.2,,"
to:
"abc,\N,\N,\N,5,\N,\N,3.2,\N,"
Also, the number of the consequencial comma is not fixed, maybe 6, 7 or more. Need a flexible way to handle it.
Didn't find a clear solution from the google.
You can just use the following sed command:
sed 's/,,/,\\N,/g;s/,,/,\\N,/g;'
Demo:
$ echo 'abc,,,,5,,,3.2,,' | sed 's/,,/,\\N,/g;s/,,/,\\N,/g;s/,,/,\\N,/g'
abc,\N,\N,\N,5,\N,\N,3.2,\N,
Explanations:
s/,,/,\\N,/g will replace ,, by ,\N, globally on the string, you will have however to do two passes on the pattern space to be sure that all the replacements took place giving the commands: s/,,/,\\N,/g;s/,,/,\\N,/g;.
Additional notes:
To answer to your doubts about this approach not being flexible, I have prepared the following input file.
$ cat input_comma.txt
abc,,,,5,,,3.2,,
,,,,,,def,
1,,,,,,1.2
6commas,,,,,,
7commas,,,,,,,
As you can see, it does not matter how many successive commas are present in the input:
$ sed 's/,,/,\\N,/g;s/,,/,\\N,/g;s/,,/,\\N,/g' input_comma.txt
abc,\N,\N,\N,5,\N,\N,3.2,\N,
,\N,\N,\N,\N,\N,def,
1,\N,\N,\N,\N,\N,1.2
6commas,\N,\N,\N,\N,\N,
7commas,\N,\N,\N,\N,\N,\N,
With awk a similar approach in 2 passes can be implemented in the same way:
$ echo "test,,,mmm,,,,aa,," | awk '{gsub(/\,\,/,",\\N,");gsub(/\,\,/,",\\N,")} 1'
test,\N,\N,mmm,\N,\N,\N,aa,\N,
Could you please try following once.
awk '{gsub(/\,\,/,",\\N,");gsub(/\,\,/,",\\N,")} 1' Input_file
With perl:
perl -pe '1 while s/,,/,\\N,/g'
Hi all I'm trying to replace all spaces beginning in certain part of my file. I tried to do it but I can't make it to start in a certain part.
i tried this sed "s/\s/_/g" < file.txt > file_1.txt but all of the spaces turn into underscore.
inside file.txt :
My Name
Favorite Food
Favorite Color
Time is gold
List of Dogs:
Shi ba Inu
Sibe rian Husky
Labra dor Retriever
Ger man Shep herd
Bull Doge
Be agle
chi hua hua
Bull Ter rier
expected file_1.txt:
My Name
Favorite Food
Favorite Color
Time is gold
List of Dogs:
Shi_ba_I_nu
Sibe_rian_Husky
Labra_dor_Retriever
Ger_man_Shep_herd
Bull_Doge
Be_agle
chi_hua_hua
Bull_Ter_rier
If you want the substitution to happen only after "List of Dogs", try
sed -e '1,/List of Dogs:/b' -e 's/\s/_/g'
The command b means "branch" (to the end of the script, i.e. bypass the substitution) and the address range specifies this action for the first line through the first line matching the regex.
If you want the substitution happen only after the :, use something like this:
sed -r '/:/,$ s/\s/_/g;' file.txt > file_1.txt
The substitution is restricted from a line containing : until the end of the file $.
Given your initial input file.txt:
My Name
Favorite Food
Favorite Color
Time is gold
List of Dogs:
Shi ba Inu
Sibe rian Husky
Labra dor Retriever
Ger man Shep herd
Bull Doge
Be agle
chi hua hua
Bull Ter rier
You can try this:
$ sed '/List of Dogs/,$s/\s/_/g;s/List_of_Dogs/List of Dogs/g' file.txt
Which results:
My Name
Favorite Food
Favorite Color
Time is gold
List of Dogs:
Shi_ba_Inu
Sibe_rian_Husky
Labra_dor_Retriever
Ger_man_Shep_herd
Bull_Doge
Be_agle
chi_hua_hua
Bull_Ter_rier
Explanation
sed commands can be split by ;
first part starts with getting an address, which is the form range start,range end. Finds the line that List of Dogs starts at. And $ specifies last line of file, for the range end part of this syntax
so just for this address range, your search and replace command is done: $s/\s/_/g
but unfortunately the command also replaced and resulted in List_of_Dogs: so second command s/List_of_Dogs/List of Dogs/g is just a workaround to convert it back
You have the answer and you don't know it =)
You say you want to replace the spaces, but you have not said what you want to replace them with. I suspect, you want to replace them with a no-space character, right?
sed "s/ //g" $original_file > $new_file
or referencing the space with \s the following should also work
sed "s/\s//g" $original_file > $new_file
The syntax is basically
sed "s/find_this/replace_with/g" $original_file > $new_file
I hope that helps...
keep it simple, obvious, robust, portable, etc. and just use awk:
$ awk 'found{gsub(/[[:space:]]/,"_")} /:/{found=1} {print}' file
My Name
Favorite Food
Favorite Color
Time is gold
List of Dogs:
Shi_ba_Inu
Sibe_rian_Husky
Labra_dor_Retriever
Ger_man_Shep_herd
Bull_Doge
Be_agle
chi_hua_hua
Bull_Ter_rier
I have a string in different ranges :
WATSON_AJAY_AB04_DOTHING.data
WATSON_NAVNEET_CK4_DOTHING.data
WATSON_PRASHANTH_KJ56_DOTHING.data
WATSON_ABHINAV_KD323_DOTHING.data
On these above string how can I extract
AB04,CK4,KJ56,KD323
in Unix?
echo "$string" | cut -d'_' -f3
You could use sed or grep for this task. But since the string is so simple, I dont think you will need to.
One method is to use the bash 'cut' command. Below is an example directly on the BASH shell/command line:
jimm#pi$ string='WATSON_AJAY_AB04_DOTHING.data'
jimm#pi$ cut -d '_' -f 3 <<< "$string"
AB04 <-- outputs the result directly
(edit: of course Lucas' answer above is also a quick 'one-liner' that does the same thing as above - he beat me to it) :)
The cut will take an _ character as the delimiter (the -d '_' part), then display the 3rd slice of the string (the -f 3 part).
Or, if you want to output that 3rd slice from a list of content (using your list above), you can write a simple BASH script.
First, save the lines above ('WATSON...etc') into something like text.txt. Then open up your favorite text editor and type:
#!/bin/sh
cut -d '_' -f 3 < $1
Save that script to some useful name like slice.sh, and make sure it is executable with something like chmod 775 slice.sh.
Then at the command line you can execute the script against your text file, and immediately get an output of those parts of the file you want (in this case the third set of text, separated by the _ character):
$ ./slice.sh text.txt
AB04
CK4
KJ56
KD323
Hope that helps! Bear in mind that the commands above may vary a bit, depending on the flavor of *nix you are using, but it should at least point you in the right direction.
I have a very huge file and need to look at a few characters in the middle of some huge line.
Is there a way to show easily characters from n1 position to n2 position in line number l in some file?
I think there should be some way to do it with sed, just cannot find corresponding option.
You better use awk:
awk 'NR==line_number {print substr($0,start_position,num_of_characters_to_show)}' file
For example, print 5 characters starting from the 2nd character in the line 2:
$ cat a
1234567890
abcdefghij
$ awk 'NR==2 {print substr($0,2,5)}' a
bcdef
If you really need to use sed, you can use something like:
$ sed -rn '2{s/^.{1}(.{5}).*$/\1/;p}' a
bcdef
This matches 2-1=1 digits after the beginning of the line and then catches 5 to print them back. And all of this is done just in the line 2, so we use -n to prevent the default print of the line.
The elegance of UNIX has always lain in its ability to string together relatively simple programs into pipelines to achieve complexity. You can do a sed-only solution but it's not likely to be as readable as a pipeline.
To that end, you can use a combination of sed to get a specific line and cut to get character positions on that line:
pax> echo '12345
...> abcde
...> fghij' | sed -n 2p | cut -c2-4
bcd
If you just want to use a single tool, awk can do it:
pax> echo '12345
...> abcde
...> fghij' | awk 'NR==2{print substr($0,2,3);exit}'
bcd
So can Perl:
pax> echo '12345
...> abcde
...> fghij' | perl -ne 'if($.==2){print substr($_,1,3); exit}'
In both those cases, it exits after the relevant line to avoid processing the rest of the file.
One solution using only sed, that insert newlines characters just before and after the substring and uses them as flags to remove all content not between them, like:
sed -n '2 { s/.\{5\}/&\n/; s/.\{2\}/&\n/; s/[^\n]*\n//; s/\n.*//; p; q }' infile
Assuming infile like:
1234567890
abcdefghij
It yields:
cde
Not that range is from 2 to 5 but start counting from zero and it excludes the end (so characters 2, 3 and 4). You can handle with it or do some arithmetic just before the command.