How to extract text from a file which appears one or more times in each line? - unix

I have a text file which have 1 or more email ids in each line. E.g.
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
Now, the problem is id:value may appear one or more times in a single line. How do I extract all id:value pairs so that the output is,
id:123, id:5678
id:3643, id:6721
I tried egrep -o but that is putting each id:value pair in a separate line.
sed/awk should do the trick but I am a noob
Do not want to use Perl as that will require a Perl installation.
On further analysis of the data files, I am seeing inconsistent separators, i.e. not all lines are , separated. Some are even separated with : and |. Also, , is appearing within the address value field. i.e. address:52nd st, new york. Can this be done in awk using a regex expression?

If your content is in the file test.txt then the following command:
cat test.txt | sed 's/ *: */:/g' | grep -o 'id:[0-9]*'
will return:
The sed command is to remove any spaces adjacent to the colon, yielding an output of:
id:123, name:test, id:5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id:6721, name kate, address:la
and the grep -o command finds all matches to id: proceeded by zero or more numbers, with the -o to return only the matching part of the input string.
Per the man page:
-o, --only-matching Print only the matched (non-empty) parts of a matching
line, with each such part on a separate output line.
(FYI, the grep and sed commands are using regular expressions.)
Sorry, I didn't read carefully. I see that you object to the -o output format of one value per line. Back to the drawing board...
Note: If the reason you are opposed to the -o output is to preserve line numbers, using grep -no will give the following output (where the first number is the line number):
Maybe that helps?

This might work for you (GNU sed):
sed -r 's/\<id:\s*/\n/g;s/,[^\n]*//g;s/\n/, id:/g;s/^, //' file
Convert the words id: and any following spaces to a unique token (in this case \n). Delete anyting following a , upto a \n. Replace the \n by the token , id: and then delete the leading ,.

This should work:
awk -F, '{id=0;for(i=1;i<=NF;i++) if($i~/id:/) id=id?id FS $i:$i; print id}' file
$ cat file
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
$ awk -F, '{id=0;for(i=1;i<=NF;i++) if($i~/id:/) id=id?id FS $i:$i; print id}' file
id:123, id: 5678
id:3643, id: 6721

perl -lne 'push #a,/id:[^,]*/g;print "#a";undef #a' your_file
Tested Below:
> cat temp
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
> perl -lne 'push #a,/id:[^,]*/g;print "#a";undef #a' temp
id:123 id: 5678
id:3643 id: 6721

This is just a variation of an answer allready given..I personaly prefere the script verion in a file more than the command line (better control, readability)
id:1, name:test, id:2, name john, address:new york
id:3, name:bob
id:4, name:meg, id:5, name kate, address:la
{ if($i~/id:/)
id=id?id $i:$i;}
print id
call: awk -f id.awk id.txt
id:1, id:2,
id:4, id:5,


Replacing the last word of a line only if match string found

I want to replace the last word of the line only if a matching string found.
Input file :
"id": 5918915,
"description": "Test Job - NA",
"revision": 5
Expected output :
"id": 5918915,
"description": "Test Job - EU",
"revision": 5
So, for lines matching description, replace the last word with given word. In this case, in line 2 replace last word NA", with EU",
I tried
sed -i '/"description"/s/.*/EU",//g' file_name
but it is not working
sed -i -r '/^[ \t]*"description":.*/s/^(.* )NA",[\t ]*$/\1EU/' FILE
"id": 5918915,
"description": "Test Job - EU",
"revision": 5"
For testing, remove the -i switch.
The amount of whitespace isn't quiet clear, so I placed [ \t]* at line start and end for blanks and tabs of random size.
Your command:
sed -i '/"description"/s/.*/EU",//g' file_name
should substitute the whole line with EU",, not just the last char sequence.
The -i is an option of GNU-sed. Check your version and read the fine manual. If your sed lacks support, you have to redirect the output to a file sed "COMMANDS" INFILE > TMPFILE; mv TMPFILE INFILE. Note, that sed "COMMANDS" INFILE > INFILE will not work, but destroy the INFILE immediately; a popular, clever, but disfunctional idea. I had it too. :)
I got the working command
sed -i '/"description"/ s/[^ ]* *$/EU",/' file_name

Display matched string to end of line

How to find a particular string in a file and display matched string and rest of the line?
For example- I have a line in a.txt:
This code gives ORA-12345 in my code.
So, I am finding string 'ORA-'
Output Should be:
ORA-12345 in my code
Tried using grep:
grep 'ORA-*' a.txt
but it gives whole line in the output.
# Create test data:
echo "junk ORA-12345 more stuff" > a.tst
echo "junk ORB-12345 another stuff" >> a.tst
# Actually command:
# the -o (--only-matching) flag will print only the matched result, and not the full line
cat a.tst | grep -o 'ORA-.*$' # ORA-12345 more stuff
As fedorqui pointed out you can use:
grep -o 'ORA-.*$' a.tst
An additional answer in awk:
awk '$0 ~ "ORA" {print substr($0, match($0, "ORA"))}' a.tst
From the inside out, here's what's going on:
match($0, "ORA") finds where in the line ORA appears. In this case, it happens to be position 17.
substr($0, match($0, "ORA")) then returns from position 17 to the end of the line.
$0 ~ "ORA" makes sure that the the above is applied only to those lines that contain ORA.
with sed
echo "This code gives ORA-12345 in my code." | sed 's/.*ORA-/ORA-/'

Counting the number of 7 character words in a file that start with tree and do not end in u or v

I'm trying to count the number of 7 character words in a file that start with tree and do not end in u or v. I know how to specify the begin with tree and end in u or v condition in cat, but I'm not sure how to identify exactly 7 words or enter the conditions using wc. My pathname is /users/file1.txt.
This is the valid cat command(missing number of 7 character words)
cat /users/file1.txt | grep ^tree.*[!uv]
Below is the invalid wc command(missing number of 7 character words)
wc - w /users/file1.txt | grep ^tree.*[!uv]
Do you like perl? Here a one-liner:
cat /users/file1.txt | perl -lne 'if (/^(tree)(.{4}$)(?<![uv])/) { print $_ }'
sed -e 's/%//g' -e 's/\btree..[^uv]\b/%/g' -e 's/[^%]//g' -e 's/%/word /g' /users/file1.txt | wc -w
Don't let anyone steal our token.
Give us a token for what we want to count; match word boundaries to count to 7, negate match character in (u,v).
Get rid of everything else.
Turn our token into a friendly word plus a space.
Count 'em.
Reut's answer is very close.
But this will get you where you need:
cat /users/file1.txt | grep -wo 'tree..[^uv]' | wc -l
-w will get exact word matches
see that I ditched the .* and specified .. instead, as the total number of characters matched is 7
I also got rid of the ^tree so you can also match words that aren't at the beginning of the line.
Using grep and wc:
# echo the file # filter files # grep EXACT words # count
cat /users/file1.txt | grep ^tree.*[^u^v] | grep -o '[^\ ]\{7\}' | wc -w
Pipe walkthrough:
Echo content of the source file:
cat /users/file1.txt
Pass only files starting with "tree" and not ending with either "u" or "v":
grep ^tree.*[^u^v]
Forward any word that is composed of 7 non-spaces (if you want only letters use [a-zA-Z] instead of [^\ ]):
grep -o '[^\ ]\{7\}'
Count the words that made it here:
wc -w
Here is one other way using pretty basic bash:
for word in $(cat
if [ 7 -eq ${#word} ]
echo $count
Or in a single line:
count=0; for word in $(cat; do if [ 7 -eq ${#word} ]; then count=$((count+1)); fi; done; echo $count
You may want to remove dots and commas from word.

Delete duplicate headers in awk

I used used cat to combine several files and they all have the same headers. Is there anyway I can retain the 1st occurrence of the header and delete the succeeding headers inside the concatenated file?
FirstName, LastName, Phone, Zip
FirstName, LastName, Phone, Zip
I'd do it this way:
sed '1h;2,$G;s/^\(.*\)\n\1$//;/./P;d' filename
You can do this:
cp file1 result
tail -q -n +2 file2 file3 file4 >> result
That is, start with the entire contents of file1, then append from the other files starting with line 2 of each. This way you avoid the need to try to find the extra headers and delete them later.
If you prefer, here's another formulation of the same:
head -1 file1 > result
tail -q -n +2 file1 file2 file3 file4 >> result
Try this:
sed -e '2,$s/FirstName, LastName, Phone, Zip//g' -e '/^$/d' Yourfile.txt
You can replace "FirstName, LastName, Phone, Zip" with whatever header you have. From 2nd line to end of file, it will remove the header patter with , then delete the blank lines with /^$/d'
Here is an awk version. It will skipp all line with FirstName except line 1
awk 'NR>1 && /^FirstName/ {next}1' file
FirstName, LastName, Phone, Zip
If the header line do changing, we need a pattern to follow.
awk way
awk '!a[$0];NR==1{a[$0]++}' file

How to grep a particular position line from the result?

I grep a pattern from a directory and the 4 lines before that pattern, I need to further grep the top line from each result , but not getting how to do .
Please suggest regarding this.
The problem explained with example :
in a directory 'direktory'
there are multiple files with different name like 20130611 and 2013400 etc..
the data wrote in the files, which I am interested in is like this :
My name is
Name has been written above
now in every instance "Name has been written above" is written in the unit of lines but the value keep on changing in place of "My name is" so I want to grep this particular line from every occurrence .
Please suggest some method to get the result.
Thanks in advance.
a#x:/tmp$ cat namefile
My name is
Name has been written above
a#x:/tmp$ cat namefile | grep -B 4 "Name has been written above" | head -1
My name is
Where "4" can be replaced by N i.e. number of lines the target data lies above the grepped line
Try something like
for file in $(ls <wherever>)
# Tell the user which file we're looking at
echo ""
echo $file
echo ""
# Output the first line of the file
head -1 $file
# Output the line continaing <pattern> and the four
# preceding lines
<your grep command here>
