UNIX: cat large number of files - output being doubled - unix

I need to concatenate a large number of text files from a series of directories. All the text files have the same name but some folders do not contain the file and just need to be skipped.
When I use cat ./**/File.txt > newFile.txt I get the following error /bin/cat: Argument list too long.
I tried using the ulimit command a few different ways but that did not work.
I have tried:
find . -name File.txt -exec cat {} \; > newFile.txt
find . -name File.txt -exec cat {} \+ > newFile.txt
find . -type f -name File.txt | xargs cat
and this results in the files being concatenated twice. For example, I have 3 text files named File.txt, each in a different directory, each with a different line of text:
test1
test2
test3
When I do the above commands my newFile.txt looks like:
test1
test2
test3
test1
test2
test3
I can't figure out why this is happening twice. When I use the command cat ./**/File.txt > newFile.txt on my small test set, it works fine and I end up with one file that has:
test1
test2
test3
I also tried
for a in File.txt ; do cat $a >> newFile.txt ; done
but get the message
cat: File.txt: No such file or directory
because some of the directories do not contain this text file, is my guess.
Is there another way to do this, or is there a reason my files are being concatenated twice?

Here's how I would do it
find . -name File.txt -exec cat {} >> output.txt \;
This searches for all occurrences of the file File.txt and appends the cat'ed output of that file to the file output.txt
However, I have tried your find command and it too also works.
find . -name File.txt -exec cat {} \; > newFile.txt
I would suggest that you clear down the output file newFile.txt before you try either your find or my find as follows:
>newFile.txt
This is a handy way to empty a file's contents. (Although this should not matter to you right now emptying a file by redirecting nothing to it can be done even if another process is writing to the file)
Hope this helps.

Related

How to cat all files with filename with certain words in unix

I have a bunch of file in one directory, what I wanted to do is:
cat a-12-08.json b-12-08_others.json b-12-08-mian.json >> new.json
But there are too many files, is there any command I can use to cat all files with "12-08" in their filename?
I found the solution below.
Here is the answer:
cat *12-08* >> new.json
you can use find to do what you want to archive:
find . -type f -name '*12-08*' -exec sh -c 'grep "one" {} && cat {} >> /tmp/output.txt' \;
In this way you can cat the files with contain the word that you looking for
Use a wildcard name:
cat *12-08* >>new.json
This will work as long as there aren't so many files that you exceed the maximum length of a command line, ARG_MAX (2MB on the Linux systems I checked).

How to find files that match names in a list and copy them to a directory?

I have a list of 50 names that look like this:
O8-E7
O8-F2
O8-F6
O8-F8
O8-H2
O9-A5
O9-B8
O9-D8
O9-E2
O9-F5
O9-H12
S37-A5
S37-B11
S37-B12
S37-C12
S37-D12
S37-E8
S37-G2
I want to look inside a specific directory for all the subdirectories whose name contains one of these elements.
For example, the directory Sample_S37-G2-from-Specimen-001 would be a match.
Inside those subdirectories, there is a file called accepted_hits.bam (unfortunately named the same way in all of them). I want to find these files and copy them into a single folder, with the name of the sample subdirectory that they came from.
For example, I would copy the accepted_hits.bam file from the subdirectory Sample_S37-G2-from-Specimen-001 to the new_dir as S37-G2_accepted_hits.bam
I tried using find, but it's not working and I don't really understand why.
cat sample.list | while read FILENAME; do find /path/to/sampleDirectories -name "$FILENAME" -exec cp '{}' new
_dir\; done
Any ideas? Thanks!
You are looking for dirs that are exactly the same as the lines in your input.
The first improvement would be using wildcards
cat sample.list | while read FILENAME; do
find /path/to/sampleDirectories -name "*${FILENAME}*" -exec cp '{}' new_dir\; done
Your new problem is that now you will be looking for dir's, not files. You want to find dir's with the filename accepted_hits.bam.
So your next try would be parsing the output of
find /path/to/sampleDirectories -name accepted_hits.bam | grep "${FILENAME}"
but you do not want to call find for each entry in sample.list.
You need to start with 1 find command and get the relevant subdirs from it.
A complication is that you want to have the substring from orgfile in your destfile name. Look at the grep options o and f, they help!
find /path/to/sampleDirectories -name accepted_hits.bam | while read orgfile | do
matched_part=$(echo "${orgfile}" | grep -of sample.list)
if [ -n "${matched_part}" ]; then
cp ${orgfile} newdir/${matched_part}accepted_hits.bam
fi
done
This will only work when your sample.list is without additional spaces. When you have spaces and can not cange the file, you need to copy/parse sample.list to another file.
When one of your 50 entries in sample.list is a substring of "accepted_hits.bam", you need to do some extra work.
Edit: if [ -n "${matched_part}" ] was missing the $.
Try using egrep with alternation
build a text file with single line of patterns: (pat1|pat2|pat3)
call find to list all of the regular files
use egrep to select the ones based on the patterns in the pattern file
awk 'BEGIN { printf("(") } FNR==1 {printf("%s", $0)} FNR>1 {printf("|%s", $0)} END{printf(")\n") } ' sample.list > t.sed
find /path/to/sampleDirectories -type f | egrep -f t.sed > filelist

Merging file names into one text file

I have 8 files that need to be merged into one text file, with each of the file names being on a separate line.
The output should be as follows:
file.txt:
output1/transcripts.gtf
output2/transcripts.gtf
output3/transcripts.gtf
and so on...
I have read several other suggestions and I know it should be an easy fix. I have tried dir and awk but have only gotten results that has all files in one line. I am using unix.
How about this?
ls -1 output*/*.gtf > file.txt
or if the nesting of you sub directories is deeper and you want all files with names ending in ".gtf":
find . -type f -name "*.gtf" -print | cut -b 3- > file.txt

unix: how to concatenate files matched in grep

I want to concatenate the files whose name does not include "_BASE_". I thought it would be somewhere along the lines of ...
ls | grep -v _BASE_ | cat > all.txt
the cat part is what I am not getting right. Can anybody give me some idea about this?
Try this
ls | grep -v _BASE_ | xargs cat > all.txt
You can ignore some files with ls using --ignore option and then cat them into a file.
ls --ignore="*_BASE_*" | xargs cat > all.txt
Also you can do that without xargs:
cat $( ls --ignore="*_BASE_*" ) > all.txt
UPD:
Dale Hagglund noticed, that filename like "Some File" will appear as two filenames, "Some" and "File". To avoid that you can use --quoting-style=WORD option, when WORD can be shell or escape.
For example, if --quoting-style=shell Some File will print as 'Some File' and will be interpreted as one file.
Another problem is output file could the same of one of lsed files. We need to ignore it too.
So answer is:
outputFile=a.txt; ls --ignore="*sh*" --ignore="${outputFile}" --quoting-style=shell | xargs cat > ${outputFile}
If you want to get also files from subdirectories, `find' is your friend:
find . -type f ! -name '*_BASE_*' ! -path ./all.txt -exec cat {} >> all.txt \+
It searches files in the current directory and its subdirectories, it finds only files (-type f), ignores files matching to wildcard pattern *_BASE_*, ignores all.txt, and executes cat in the same manner as xargs would.

Shell script to process files

I need to write a Shell Script to process a huge folder of nearly 20 levels.I have to process each and every file and check which files contain lines like
select
insert
update
When I mean line it should take the line till I find a semicolon in that file.
I should get a result like this
C:/test.java select * from dual
C:/test.java select * from test
C:/test1.java select * from tester
C:/test1.java select * from dual
and so on.Right now I have a script to read all the files
#!bin/ksh
FILE=<FILEPATH to be traversed>
TEMPFILE=<Location of Temp file>
cd $FILE
for f in `find . ! -type d`;
do
cat $FILE/addedText.txt>>$TEMPFILE/newFile.txt
cat $f>>$TEMPFILE/newFile.txt
rm $f
cat $TEMPFILE/newFile.txt>>$f
rm $TEMPFILE/newFile.txt
done
I have very little knowledge of awk and sed to proceed further in reading each file and achieve what I want to.Can anyone help me in this
if you have GNU find/gawk
find /path -type f -name "*.java" | while read -r FILE
do
awk -vfile="$FILE" 'BEGIN{RS=";"}
/select|update|insert/{
b=gensub(/(.*)(select|update|insert)(.*)/,"\\2\\3","g",$0)
gsub(/\n+/,"",b)
print file,b
}
' "$FILE"
done
if you are on Solaris, use nawk
find /path -type f -name "test*file" | while read -r FILE
do
nawk -v file="$FILE" 'BEGIN{RS=";"}
/select/{ gsub(/.*select/,"select");gsub(/\n+/,"");print file,$0; }
/update/{ gsub(/.*update/,"update");gsub(/\n+/,"");print file,$0; }
/insert/{ gsub(/.*insert/,"insert");gsub(/\n+/,"");print file,$0; }
' "$FILE"
done
Note this is simplistic case. your SQL statement might be more complicated.

Resources