How to pipe file into `egrep`? - unix

I'm trying to get the hang of how to use egrep to count occurrences of something in a file.
If I have a file like this, called myfile:
commas and stuff
more things
Then I'm able to count the number of o's in the first like by :
head myfile | egrep -c "o" myfile
For some reason, I am not able to do this with pipes. If I have a file like this called mypipes:
| my | pipes |
I tried this:
head mypipes | egrep -c "|" mypipes
That gives me an error empty sub(expression), so I tried the following:
head mypipes | egrep -c "\|" mypipes
This gives me a value of "1", which is clearly wrong.
How do I do this correctly? a full explanation rather than a one-off solution would be phenomenal. thanks.

From the manual:
-c, --count
Suppress normal output;
instead print a count of matching lines for each input file.
You're counting the lines, not the characters.
egrep -o '\|' | wc -l
Note: You either feed the input (e.g.: pipe), or specify the file you want to search in. Why do you do both?

In the example you've constructed, there is no need to pipe data to egrep because you're already telling it to read directly from a file.
If you don't provide a filename, egrep will read from STDIN instead:
head mypipes | egrep -c "\|"
As for your other question...
This gives me a value of "1", which is clearly wrong.
Is it? egrep -c counts the number of lines that match, which in this case is 1.
If you want to count occurences, ignoring lines, this might help: Count total number of occurrences using grep

Related

How do I use grep to find words of specified or unspecified length?

In the Unix command line (CentOS7) I have to use the grep command to find all words with:
At least n characters
At most n characters
Exactly n characters
I have searched the posts here for answers and came up with grep -E '^.{8}' /sample/dir but this only gets me the words with at least 8 characters.
Using the $ at the end returns nothing. For example:
grep -E '^.{8}$' /sample/dir
I would also like to trim the info in /sample/dir so that I only see the specific information. I tried using a pipe:
cut -f1,7 -d: | grep -E '^.{8}' /sample/dir
Depending on the order, this only gets me one or the other, not both.
I only want the usernames at the beginning of each line, not all words in each line for the entire file.
For example, if I want to find userids on my system, these should be the results:
1.
tano-ahsoka
skywalker-a
kenobi-obiwan
ahsoka-t
luke-s
leia-s
ahsoka-t
kenobi-o
grievous
I'm looking for two responses here as I have already figured out number 1.
Numbers 2 and 3 are not working for some reason.
If possible, I'd also like to apply the cut for all three outputs.
Any and all help is appreciated, thank you!
You can run one grep for extracting the words, and another for filtering based on length.
grep -oE '(\w|-)+' file | grep -Ee '^.{8,}$'
grep -oE '(\w|-)+' file | grep -Ee '^.{,8}$'
grep -oE '(\w|-)+' file | grep -Ee '^.{8}$'
Update the pattern based on requirements and maybe use -r and specify a directory instead of a file. Adding -h option may also be needed to prevent the filenames from being printed.
Depending on your implementation of grep, it might work to use:
grep -o -E '\<\w{8}\>' # exactly 8
grep -o -E '\<\w{8,}\>' # 8 or more
grep -o -E '\<\w{,8}\>' # 8 or less

how to list a specific string or number in a file in Unix

for Example if your file has following lines
1=10200|2=2343i|3=otit|5=89898|54=9546i96i|10=2459
1=10200|54=9546i96i|10=2459|2=2343i|3=otit|5=8
1=10200|5=IGY|14=897|459=122|132=1|54=9546i96i|10=2459
1=10200|2=2343i|5=0|54=9546i96i
The output should be
5=89898
5=8
5=IGY
5=0
You could use grep with the -o flag to return only the regexp matches.
Assuming you have a file.txt that you want to parse:
cat file.txt | grep -o -E "(\||^)5=[^|]*" | grep -o "5=[^|]*"
This will match anything that starts with 5= up until the first |.
By running this command on the input you provided I get:
5=89898
5=8
5=IGY
5=0
Cheers
Edit: as Walter A suggested, my previous solution did not cover all cases.
I have added an extra parsing step: first, you get all strings that match 5=... at the start of a line, or |5=..., and then you remove the |.
Use (^|[|]) for matching start of field (start of line or |) and remember/match string until next | or end-of-line.
sed -nr 's/.*(^|[|])(5=[^|]*).*/\2/p' file

ls and xargs to output specific file extentions

I am trying to use ls and xargs to print specific file extensions .bam and .vcf witout the path. The below is close but when I | the two ls commands I get the error below. Separated it works fine except each file is printed on a newline (my actual data has hundreds of files and make it easier to read). Thank you :).
files in directory
1.bam
1.vcf
2.bam
2.vcf
command with error
ls /home/cmccabe/Desktop/NGS/test/R_folder/*.bam | xargs -n1 basename | ls /home/cmccabe/Desktop/NGS/test/R_folder/*.vcf | xargs -n1 basename >> /home/cmccabe/Desktop/NGS/test/log
xargs: basename: terminated by signal 13
desired output
1.bam 1.vcf
2.bam 2.vcf
You cannot pipe output into ls and have it print that with its other output. You should give the parameters to the first one and it will output everything.
ls *.a *.b *.c | xargs ...q
ls isn't really doing anything for you currently, it's the shell that's listing all your files. Since you're piping ls's output around, you're actually vulnerable to dangerous file names.
basename can take multiple arguments with the -a option:
basename -a "path/to/files/"*.{bam,vcf}
To print that in two columns, you could use printf via xargs, with sort for... sorting. The -z or -0 flags throughout cause null bytes to be used as the filename separators:
basename -az "path/to/files/"*.{bam,vcf} | sort -z | xargs -0n 2 printf "%b\t%b\n"
If you're going to be doing any more processing after printing to columns, you may want to replace the %bs in the printf format with %qs. That will escape non-printable characters in the output, but might look a bit ugly to human eyes.

Unix: How can I count all lines containing a string in all files in a directory and see the output for each file separately

In UNIX I can do the following:
grep -o 'string' myFile.txt | wc -l
which will count the number of lines in myFile.txt containing the string.
Or I can use :
grep -o 'string' *.txt | wc -l
which will count the number of lines in all .txt extension files in my folder containing the string.
I am looking for a way to do the count for all files in the folder but to see the output separated for each file, something like:
myFile.txt 10000
myFile2.txt 20000
myFile3.txt 30000
I hope I have made my self clear, if not you can see a somewhat close example in the output of :
wc -l *.txt
Why not simply use grep -c which counts matching lines? According to the GNU grep manual it's even in POSIX, so should work pretty much anywhere.
Incidentally, your use of -o makes your commands count every occurence of the string, not every line with any occurences:
$ cat > testfile
hello hello
goodbye
$ grep -o hello testfile
hello
hello
And you're doing a regular expression search, which may differ from a string search (see the -F flag for string searching).
Use a loop over all files, something like
for f in *.txt; do echo -n $f $'\t'; echo grep 'string' "$f" | wc -l; done
But I must admit that #Yann's grep -c is neater :-). The loop can be useful for more complicated things though.

"grep | xargs grep" with search conditions on different strings

I want to grep files that contain text "wp_" but do not contain text "wp3_". E.g. I've got a file with two strings:
wp_123
wp3_123
I try $ grep -lr wp_ ~/tmp | xargs grep -vl wp3_
It outputs this file name! But if I remove the linebreak, it's working like I want, i.e. handles string "wp_123 wp3_123" correctly.
How to make it work with search conditions on different strings?
P.S. Sorry for kind of duplicate, but seems that nobody noticed my comment during last hour...
This should work
$ grep -lr 'wp_' ~/tmp | xargs grep -L 'wp3_'
The single quotes are not necessary in this case, but are a good habit to prevent pattern characters from being interpreted by the shell. In your original attempt, -vl means "print each file with at least one line that does not match". Here, -L means "print each file with no lines that match".

Resources