How to extract text from a file which appears one or more times in each line? - unix

I have a text file which have 1 or more email ids in each line. E.g.
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
Now, the problem is id:value may appear one or more times in a single line. How do I extract all id:value pairs so that the output is,
id:123, id:5678
id:567
id:3643, id:6721
I tried egrep -o but that is putting each id:value pair in a separate line.
sed/awk should do the trick but I am a noob
Do not want to use Perl as that will require a Perl installation.
EDIT:
On further analysis of the data files, I am seeing inconsistent separators, i.e. not all lines are , separated. Some are even separated with : and |. Also, , is appearing within the address value field. i.e. address:52nd st, new york. Can this be done in awk using a regex expression?

If your content is in the file test.txt then the following command:
cat test.txt | sed 's/ *: */:/g' | grep -o 'id:[0-9]*'
will return:
id:123
id:5678
id:567
id:3643
id:6721
The sed command is to remove any spaces adjacent to the colon, yielding an output of:
id:123, name:test, id:5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id:6721, name kate, address:la
and the grep -o command finds all matches to id: proceeded by zero or more numbers, with the -o to return only the matching part of the input string.
Per the man page:
-o, --only-matching Print only the matched (non-empty) parts of a matching
line, with each such part on a separate output line.
(FYI, the grep and sed commands are using regular expressions.)
EDIT:
Sorry, I didn't read carefully. I see that you object to the -o output format of one value per line. Back to the drawing board...
Note: If the reason you are opposed to the -o output is to preserve line numbers, using grep -no will give the following output (where the first number is the line number):
1:id:123
1:id:5678
2:id:567
3:id:3643
3:id:6721
Maybe that helps?

This might work for you (GNU sed):
sed -r 's/\<id:\s*/\n/g;s/,[^\n]*//g;s/\n/, id:/g;s/^, //' file
Convert the words id: and any following spaces to a unique token (in this case \n). Delete anyting following a , upto a \n. Replace the \n by the token , id: and then delete the leading ,.

This should work:
awk -F, '{id=0;for(i=1;i<=NF;i++) if($i~/id:/) id=id?id FS $i:$i; print id}' file
Test:
$ cat file
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
$ awk -F, '{id=0;for(i=1;i<=NF;i++) if($i~/id:/) id=id?id FS $i:$i; print id}' file
id:123, id: 5678
id:567
id:3643, id: 6721

perl -lne 'push #a,/id:[^,]*/g;print "#a";undef #a' your_file
Tested Below:
> cat temp
id:123, name:test, id: 5678, name john, address:new york
id:567, name:bob
id:3643, name:meg, id: 6721, name kate, address:la
> perl -lne 'push #a,/id:[^,]*/g;print "#a";undef #a' temp
id:123 id: 5678
id:567
id:3643 id: 6721
>

This is just a variation of an answer allready given..I personaly prefere the script verion in a file more than the command line (better control, readability)
id.txt
id:1, name:test, id:2, name john, address:new york
id:3, name:bob
id:4, name:meg, id:5, name kate, address:la
id.akw
{
i=0
for(i=1;i<=NF;i++)
{ if($i~/id:/)
id=id?id $i:$i;}
print id
id=""
}
call: awk -f id.awk id.txt
output:
id:1, id:2,
id:3,
id:4, id:5,

Related

Replacing the last word of a line only if match string found

I want to replace the last word of the line only if a matching string found.
Input file :
"id": 5918915,
"description": "Test Job - NA",
"revision": 5
Expected output :
"id": 5918915,
"description": "Test Job - EU",
"revision": 5
So, for lines matching description, replace the last word with given word. In this case, in line 2 replace last word NA", with EU",
I tried
sed -i '/"description"/s/.*/EU",//g' file_name
but it is not working
sed -i -r '/^[ \t]*"description":.*/s/^(.* )NA",[\t ]*$/\1EU/' FILE
"id": 5918915,
"description": "Test Job - EU",
"revision": 5"
For testing, remove the -i switch.
The amount of whitespace isn't quiet clear, so I placed [ \t]* at line start and end for blanks and tabs of random size.
Your command:
sed -i '/"description"/s/.*/EU",//g' file_name
should substitute the whole line with EU",, not just the last char sequence.
The -i is an option of GNU-sed. Check your version and read the fine manual. If your sed lacks support, you have to redirect the output to a file sed "COMMANDS" INFILE > TMPFILE; mv TMPFILE INFILE. Note, that sed "COMMANDS" INFILE > INFILE will not work, but destroy the INFILE immediately; a popular, clever, but disfunctional idea. I had it too. :)
I got the working command
sed -i '/"description"/ s/[^ ]* *$/EU",/' file_name

Display matched string to end of line

How to find a particular string in a file and display matched string and rest of the line?
For example- I have a line in a.txt:
This code gives ORA-12345 in my code.
So, I am finding string 'ORA-'
Output Should be:
ORA-12345 in my code
Tried using grep:
grep 'ORA-*' a.txt
but it gives whole line in the output.
# Create test data:
echo "junk ORA-12345 more stuff" > a.tst
echo "junk ORB-12345 another stuff" >> a.tst
# Actually command:
# the -o (--only-matching) flag will print only the matched result, and not the full line
cat a.tst | grep -o 'ORA-.*$' # ORA-12345 more stuff
As fedorqui pointed out you can use:
grep -o 'ORA-.*$' a.tst
An additional answer in awk:
awk '$0 ~ "ORA" {print substr($0, match($0, "ORA"))}' a.tst
From the inside out, here's what's going on:
match($0, "ORA") finds where in the line ORA appears. In this case, it happens to be position 17.
substr($0, match($0, "ORA")) then returns from position 17 to the end of the line.
$0 ~ "ORA" makes sure that the the above is applied only to those lines that contain ORA.
with sed
echo "This code gives ORA-12345 in my code." | sed 's/.*ORA-/ORA-/'

Counting the number of 7 character words in a file that start with tree and do not end in u or v

I'm trying to count the number of 7 character words in a file that start with tree and do not end in u or v. I know how to specify the begin with tree and end in u or v condition in cat, but I'm not sure how to identify exactly 7 words or enter the conditions using wc. My pathname is /users/file1.txt.
This is the valid cat command(missing number of 7 character words)
cat /users/file1.txt | grep ^tree.*[!uv]
Below is the invalid wc command(missing number of 7 character words)
wc - w /users/file1.txt | grep ^tree.*[!uv]
Do you like perl? Here a one-liner:
cat /users/file1.txt | perl -lne 'if (/^(tree)(.{4}$)(?<![uv])/) { print $_ }'
sed -e 's/%//g' -e 's/\btree..[^uv]\b/%/g' -e 's/[^%]//g' -e 's/%/word /g' /users/file1.txt | wc -w
Don't let anyone steal our token.
Give us a token for what we want to count; match word boundaries to count to 7, negate match character in (u,v).
Get rid of everything else.
Turn our token into a friendly word plus a space.
Count 'em.
Reut's answer is very close.
But this will get you where you need:
cat /users/file1.txt | grep -wo 'tree..[^uv]' | wc -l
-w will get exact word matches
see that I ditched the .* and specified .. instead, as the total number of characters matched is 7
I also got rid of the ^tree so you can also match words that aren't at the beginning of the line.
Using grep and wc:
# echo the file # filter files # grep EXACT words # count
cat /users/file1.txt | grep ^tree.*[^u^v] | grep -o '[^\ ]\{7\}' | wc -w
Pipe walkthrough:
Echo content of the source file:
cat /users/file1.txt
Pass only files starting with "tree" and not ending with either "u" or "v":
grep ^tree.*[^u^v]
Forward any word that is composed of 7 non-spaces (if you want only letters use [a-zA-Z] instead of [^\ ]):
grep -o '[^\ ]\{7\}'
Count the words that made it here:
wc -w
Here is one other way using pretty basic bash:
count=0
for word in $(cat f.py)
do
if [ 7 -eq ${#word} ]
then
count=$((count+1))
fi
done
echo $count
Or in a single line:
count=0; for word in $(cat f.py); do if [ 7 -eq ${#word} ]; then count=$((count+1)); fi; done; echo $count
You may want to remove dots and commas from word.

Delete duplicate headers in awk

I used used cat to combine several files and they all have the same headers. Is there anyway I can retain the 1st occurrence of the header and delete the succeeding headers inside the concatenated file?
Thanks!
Example:
FirstName, LastName, Phone, Zip
(data)
(data)
(data)
FirstName, LastName, Phone, Zip
(data)
(data)
(data)
I'd do it this way:
sed '1h;2,$G;s/^\(.*\)\n\1$//;/./P;d' filename
You can do this:
cp file1 result
tail -q -n +2 file2 file3 file4 >> result
That is, start with the entire contents of file1, then append from the other files starting with line 2 of each. This way you avoid the need to try to find the extra headers and delete them later.
If you prefer, here's another formulation of the same:
head -1 file1 > result
tail -q -n +2 file1 file2 file3 file4 >> result
Try this:
sed -e '2,$s/FirstName, LastName, Phone, Zip//g' -e '/^$/d' Yourfile.txt
You can replace "FirstName, LastName, Phone, Zip" with whatever header you have. From 2nd line to end of file, it will remove the header patter with , then delete the blank lines with /^$/d'
Here is an awk version. It will skipp all line with FirstName except line 1
awk 'NR>1 && /^FirstName/ {next}1' file
FirstName, LastName, Phone, Zip
(data)
(data)
(data)
(data)
(data)
(data)
If the header line do changing, we need a pattern to follow.
awk way
awk '!a[$0];NR==1{a[$0]++}' file

How to grep a particular position line from the result?

I grep a pattern from a directory and the 4 lines before that pattern, I need to further grep the top line from each result , but not getting how to do .
Please suggest regarding this.
The problem explained with example :
in a directory 'direktory'
there are multiple files with different name like 20130611 and 2013400 etc..
the data wrote in the files, which I am interested in is like this :
[
My name is
.....
......
......
Name has been written above
]
now in every instance "Name has been written above" is written in the unit of lines but the value keep on changing in place of "My name is" so I want to grep this particular line from every occurrence .
Please suggest some method to get the result.
Thanks in advance.
a#x:/tmp$ cat namefile
[
My name is
.....
......
......
Name has been written above
]
a#x:/tmp$ cat namefile | grep -B 4 "Name has been written above" | head -1
My name is
Where "4" can be replaced by N i.e. number of lines the target data lies above the grepped line
Try something like
for file in $(ls <wherever>)
do
# Tell the user which file we're looking at
echo ""
echo $file
echo ""
# Output the first line of the file
head -1 $file
# Output the line continaing <pattern> and the four
# preceding lines
<your grep command here>
done

Resources