how to list a specific string or number in a file in Unix - unix

for Example if your file has following lines
1=10200|2=2343i|3=otit|5=89898|54=9546i96i|10=2459
1=10200|54=9546i96i|10=2459|2=2343i|3=otit|5=8
1=10200|5=IGY|14=897|459=122|132=1|54=9546i96i|10=2459
1=10200|2=2343i|5=0|54=9546i96i
The output should be
5=89898
5=8
5=IGY
5=0

You could use grep with the -o flag to return only the regexp matches.
Assuming you have a file.txt that you want to parse:
cat file.txt | grep -o -E "(\||^)5=[^|]*" | grep -o "5=[^|]*"
This will match anything that starts with 5= up until the first |.
By running this command on the input you provided I get:
5=89898
5=8
5=IGY
5=0
Cheers
Edit: as Walter A suggested, my previous solution did not cover all cases.
I have added an extra parsing step: first, you get all strings that match 5=... at the start of a line, or |5=..., and then you remove the |.

Use (^|[|]) for matching start of field (start of line or |) and remember/match string until next | or end-of-line.
sed -nr 's/.*(^|[|])(5=[^|]*).*/\2/p' file

Related

How do I use grep to find words of specified or unspecified length?

In the Unix command line (CentOS7) I have to use the grep command to find all words with:
At least n characters
At most n characters
Exactly n characters
I have searched the posts here for answers and came up with grep -E '^.{8}' /sample/dir but this only gets me the words with at least 8 characters.
Using the $ at the end returns nothing. For example:
grep -E '^.{8}$' /sample/dir
I would also like to trim the info in /sample/dir so that I only see the specific information. I tried using a pipe:
cut -f1,7 -d: | grep -E '^.{8}' /sample/dir
Depending on the order, this only gets me one or the other, not both.
I only want the usernames at the beginning of each line, not all words in each line for the entire file.
For example, if I want to find userids on my system, these should be the results:
1.
tano-ahsoka
skywalker-a
kenobi-obiwan
ahsoka-t
luke-s
leia-s
ahsoka-t
kenobi-o
grievous
I'm looking for two responses here as I have already figured out number 1.
Numbers 2 and 3 are not working for some reason.
If possible, I'd also like to apply the cut for all three outputs.
Any and all help is appreciated, thank you!
You can run one grep for extracting the words, and another for filtering based on length.
grep -oE '(\w|-)+' file | grep -Ee '^.{8,}$'
grep -oE '(\w|-)+' file | grep -Ee '^.{,8}$'
grep -oE '(\w|-)+' file | grep -Ee '^.{8}$'
Update the pattern based on requirements and maybe use -r and specify a directory instead of a file. Adding -h option may also be needed to prevent the filenames from being printed.
Depending on your implementation of grep, it might work to use:
grep -o -E '\<\w{8}\>' # exactly 8
grep -o -E '\<\w{8,}\>' # 8 or more
grep -o -E '\<\w{,8}\>' # 8 or less

How to pipe file into `egrep`?

I'm trying to get the hang of how to use egrep to count occurrences of something in a file.
If I have a file like this, called myfile:
commas and stuff
more things
Then I'm able to count the number of o's in the first like by :
head myfile | egrep -c "o" myfile
For some reason, I am not able to do this with pipes. If I have a file like this called mypipes:
| my | pipes |
I tried this:
head mypipes | egrep -c "|" mypipes
That gives me an error empty sub(expression), so I tried the following:
head mypipes | egrep -c "\|" mypipes
This gives me a value of "1", which is clearly wrong.
How do I do this correctly? a full explanation rather than a one-off solution would be phenomenal. thanks.
From the manual:
-c, --count
Suppress normal output;
instead print a count of matching lines for each input file.
You're counting the lines, not the characters.
egrep -o '\|' | wc -l
Note: You either feed the input (e.g.: pipe), or specify the file you want to search in. Why do you do both?
In the example you've constructed, there is no need to pipe data to egrep because you're already telling it to read directly from a file.
If you don't provide a filename, egrep will read from STDIN instead:
head mypipes | egrep -c "\|"
As for your other question...
This gives me a value of "1", which is clearly wrong.
Is it? egrep -c counts the number of lines that match, which in this case is 1.
If you want to count occurences, ignoring lines, this might help: Count total number of occurrences using grep

Using grep to search DNA sequence files

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

How to grep for the whole word

I am using the following command to grep stuff in subdirs
find . | xargs grep -s 's:text'
However, this also finds stuff like <s:textfield name="sdfsf"...../>
What can I do to avoid that so it just finds stuff like <s:text name="sdfsdf"/>
OR for that matter....also finds <s:text somethingElse="lkjkj" name="lkkj"
basically s:text and name should be on same line....
You want the -w option to specify that it's the end of a word.
find . | xargs grep -sw 's:text'
Use \b to match on "word boundaries", which will make your search match on whole words only.
So your grep would look something like
grep -r "\bSTRING\b"
adding color and line numbers might help too
grep --color -rn "\bSTRING\b"
From http://www.regular-expressions.info/wordboundaries.html:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
You can drop the xargs command by making grep search recursively. And you normally don't need the 's' flag. Hence:
grep -wr 's:text'
you could try rg, https://github.com/BurntSushi/ripgrep :
rg -w 's:text' .
should do it
Use -w option for whole word match. Sample given below:
[binita#ubuntu ~]# a="abcd efg"
[binita#ubuntu ~]# echo $a
abcd efg
[binita#ubuntu ~]# echo $a | grep ab
abcd efg
[binita#ubuntu ~]# echo $a | grep -w ab
[binita#ubuntu ~]# echo $a | grep -w abcd
abcd efg
This is another way of setting the boundaries of the word, note that it doesn't work without the quotes around it:
grep -r '\<s:text\>' .
If you just want to filter out the remainder text part, you can do this.
xargs grep -s 's:text '
This should find only s:text instances with a space after the last t. If you need to find s:text instances that only have a name element, either pipe your results to another grep expression, or use regex to filter only the elements you need.

Can you grep a file using a regular expression and only output the matching part of a line?

I have a log file which contains a number of error lines, such as:
Failed to add email#test.com to database
I can filter these lines with a single grep call:
grep -E 'Failed to add (.*) to database'
This works fine, but what I'd really like to do is have grep (or another Unix command I pass the output into) only output the email address part of the matched line.
Is this possible?
sed is fine without grep:
sed -n 's/Failed to add \(.*\) to database/\1/p' filename
You can also just pipe grep to itself :)
grep -E 'Failed to add (.*) to database' | grep -Eo "[^ ]+#[^ ]+"
Or, if "lines in interest" are the only ones with emails, just use the last grep command without the first one.
You can use sed:
grep -E 'Failed to add (.*) to database'| sed 's/'Failed to add \(.*\) to database'/\1'
Recent versions of GNU grep have a -o option which does exactly what you want. (-o is for --only-matching).
This should do the job:
grep -x -e '(?<=Failed to add ).+?(?= to database)'
It uses a positive look-ahead assertion, followed by the match for the email address, followed by a postivie look-behind assertion. This insures that it matches the entire line, but only actually consumes (and thus returns) the email address part.
The -x option specifies that grep should match lines rather than the whole text.
or python:
cat file | python -c "import re, sys; print '\r\n'.join(re.findall('add (.*?) to', sys.stdin.read()))"
-r option for sed allows regexps without backslashes
sed -n -r 's/Failed to add (.*) to database/\1/p' filename
If you just want to use grep and output only matching part of line
grep -E -o 'Failed to add (.*) to database'
Then maybe if you want to write it to a file
cat yourlogfile | grep -E -o 'Failed to add (.*) to database' >> outputfile
So as of grep utility -o is going to -o, --only-matching show only nonempty parts of lines that match'.
If you want to use grep, it would be more appropiate to use egrep;
About egrep
Search a file for a pattern using full regular expressions.
grep will not always have as complete of functionality for regex.

Resources