Extract pattern from multiple files and save with file names - count

I can save the occurrence of the extracted pattern from multiple gzipped files in a single output file but I also want to save the input file name.
I have used grep command to extract patterns and wc -l to get a count of occurrences.
#!/bin/bash
for sample in *.gtf.gz;
do
#echo $sample
describer=$(echo ${sample} | sed
's/_sorted_astalavista.gtf.gz//')
echo $describer
zgrep -w '0,1-2^' $sample | wc -l | tee -a output2.txt
done
For example, the occurrence of pattern in input_file_1 is 300 and its occurrence in input_file_2 is 500. I am able to save 300, 500 in a file:
300
500
but I also want to include input file names so that output looks like:
input_file_1 300
input_file_2 500
.
.
.
..

Related

LS - sort by date, display only file name

I'm trying to list some files, but I only want the file names, in order of file date. I've tried a few commands but they don't see to work.
I know that using this code I can list only the file names:
ls -f *
And I know that using this command I can list the files sorted by date:
ls -ltr *
So I have tried using this command to list the file names only, sorted by file date, but it doesn't sort by date:
ls -ltr -f *
That last command simply lists the file names, but sorted by file name, not date.
Any ideas how I can do this with a simple ls command?
FYI, once I get this working my ultimate goal is to only list the most recently created 10 file names, using something like this:
ls -ltr -f * | tail -10
You could try the following command:
ls -ltr | awk '{ print $9 }' | tail -n +2
It extracts the file names from the ls -ltr command.
According to the manual for ls, the -f flag is used to,
-f do not sort, enable -aU, disable -ls --color
One way of extracting only files would be,
ls -p | grep -v /
The option -p is used to append a '/' to a directory name, we can grep for lines not containing a '/'.
To extract 10 most recently used files you could do the following
ls -ptr * | grep -v / | tail -10

Find files containing two separate words in text

My UNIX is quite rusty but what I want is to search a location in UNIX for files containing the two separate words in their text of "generate" and "process", but both words on the SAME LINE?
I know there are script files that contain details of the script author and its function noted at the top of the script. For example, the start of one such script contains the following;
function: generate sales overtime process
I have tried things like the following (again my UNIX is rusty)
grep -rwl . -e "generate" | "process"
But this gives errors such unrecognised commands
What I want is a list of Progress files like;
salesovertime1.p
salestravel1.p
salesexpenses1.p
salesexpenses2.p
If you search for file then find is appropriate, and then you may filter with grep:
find . -exec grep -H generate {} \; 2> /dev/null | grep process
will find recursively every file form the current directory, then filter ones that contains word "generate" and the filter again with ones that contains word "process". Filenames will be produced on output with option -H (GNU grep) and error messages begin redirected to /dev/null.
Now if you want filenames only, you can use :
find . -exec grep -H generate {} \; 2> /dev/null | grep process | cut -f1 -d\:
If you want "generate" and "process" in the same file but on different lines, the following will do it:
grep process `find . -exec grep -H generate {} \; 2> /dev/null | cut -f1 -d\:` 2> /dev/null | cut -f1 -d\:
The find generates a file list that is used to grep against and extract filenames with redirecting errors again.

Counting all words in multiple files and outputting each count along with its filename

I am struggling with the following issue. I have a Unix directory containing ~ 60K files and I would like to be able to count all the words in each file and output a list with each count along with its corresponding filename.
This is a job for find and wc:
find . -maxdepth 1 -type f -exec wc -w {} \;
This will find all files (-type f) in the current working directory (.), without recursing into subdirectories (-maxdepth 1) and for each result will execute (-exec [...] \;) wc -w passing that filename ({}) as an argument.
wc prints the number of newlines, words, and bytes in files by default, -w specifies it should just print the word-count.
Few more handy commands to get result from current folder
To get word in all files with file type java
find . -name '*.java' | xargs grep 'word'
To get word in all files
find . -type f | xargs grep 'word'

Appending multiple files into one file

I append multiple data files into a single data file using the cat command. How can I assign that single file value into a new file?
I am using the command:
cat file1 file2 file3 > Newfile.txt
AnotherFile=`cat Newfile.txt`
sort $AnotherFile | uniq -c
it showing error like can not open AnotherFile
How to assign this newfile value into another file?
Original answer to original question
Well, the easiest way is probably cp:
cat file1 file2 file3 > Newfile.txt
cp Newfile.txt AnotherFile.txt
Failing that, you can use:
cat file1 file2 file3 > Newfile.txt
AnotherFile=$(cat Newfile.txt)
echo "$AnotherFile" > AnotherFile.txt
Revised answer to revised question
The original question had echo "$AnotherFile" as the third line; the revised question has sort $AnotherFile | uniq -c as the third line.
Assuming that sort $AnotherFile is not sorting all the contents of the files mentioned in the list created from concatenating the original files (that is, assuming that file1, file2 and file3 do not contain just lists of file names), then the objective is to sort and count the lines found in the source files.
The whole job can be done in a single command line:
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c
Or (more usually):
cat file1 file2 file3 | tee Newfile.txt | sort | uniq -c | sort -n
which lists the lines in increasing order of frequency.
If you really do want to sort the contents of the files listed in file1, file2, file3 but only list the contents of each file once, then:
cat file1 file2 file3 | tee Newfile.txt | sort -u | xargs sort | sort | uniq -c
It looks weird having three sort-related commands in a row, but there is justification for each step. The sort -u ensures each file name is listed once. The xargs sort converts a list of file names on standard input into a list of file names on the sort command line. The output of this is the sorted data from each batch of files that xargs produces. If there are so few files that xargs doesn't need to run sort more than once, then the following plain sort is redundant. However, if xargs has to run sort more than once, then the final sort has to deal with the fact that the first lines from the second batch produced by xargs sort probably come before the last lines produced by the first batch produced by xargs sort.
This becomes a judgement call based on knowledge of the data in the original files. If the files are small enough that xargs won't need to run multiple sort commands, omit the final sort. A heuristic would be "if the sum of the sizes of the source files is smaller than the maximum command line argument list, don't include the extra sort".
You can probably do that in one go:
# Write to two files at once. Both files have a constantly varying
# content until cat is finished.
cat file1 file2 file3 | tee Newfile.txt> Anotherfile.txt
# Save the output filename, just in case you need it later
filename="Anotherfile.txt"
# This reads the contents of Newfile into a variable called AnotherText
AnotherText=`cat Newfile.txt`
# This is the same as "cat Newfile.txt"
echo "$AnotherText"
# This saves AnotherText into Anotherfile.txt
echo "$AnotherText" > Anotherfile.txt
# This too, using cp and the saved name above
cp Newfile.txt "$filename"
If you want to create the second file all in one go, this is a common pattern:
# During this process the contents of tmpfile.tmp is constantly changing
{ slow process creating text } > tmpfile.tmp
# Very quickly create a complete Anotherfile.txt
mv tmpfile.tmp Anotherfile.txt
make file and redirectin this in append mode.
touch Newfile.txt
cat files* >> Newfile.txt

How to find the total number of occurance of text and files with find command

I am trying to run find command to find out total number of occurrence of a particular text and also with another find command to get the number of files which has this text.
What I have right now is this command.
find . -name "*.txt" | xargs grep -i "abc"
This brings all the files which has the text abc etc. I want to get either one or two find command to get
Total number of times abc appears
Total number of files which has abc in it.
You need to use more grep(1) options to do what you wish:
For the total number of times abc occurs, you need to take care of the case that abc is on a single line two or more times:
find . -name '*.txt' -print0 | xargs -0 grep -o -i abc | wc -l
For the total number of files that contain abc, you need to take care of the case that a single file contains abc two or more times:
find . -name '*.txt' -print0 | xargs -0 grep -l -i abc | wc -l
From the grep(1) manpage:
-l, --files-with-matches
Suppress normal output; instead print the name of each
input file from which output would normally have been
printed. The scanning will stop on the first match.
(-l is specified by POSIX.)
and
-o, --only-matching
Print only the matched (non-empty) parts of a matching
line, with each such part on a separate output line.

Resources