Combining two big files from specific lines in Unix

Combining two big files from specific lines in Unix - unix

I have two large files that I want to combine in one file and gzip it as well. However for the second file I want to exclude the first two lines. How can I do it? What I have done so far is:
awk 'FNR>2' /application/psmcHard_0.msOut.gz /JPT/psmcHard_0.msOut.gz > /all_data/psmcHard_0.msOut
Do you think this is the right way to do it? And how can I gzip the file?

You input files are in compressed with '.gz' - awk will not be able to process them directly. You will have to unpack the files, concat them, and recompress the files
(
# Pass first file as-is (no need to unzip/rezip)
cat /application/psmcHard_0.msOut.gz
# Unzip second file, filter required lines, and re-zip
zcat /JPT/psmcHard_0.msOut.gz | awk 'FNR > 2' | gzip
) > /all_data/psmcHard_0.msOut.gz

Related

Is there a way to make Unix diff -r compare only differences in filenames, but not check if any single file actually differs?

I need to compare two large directories with a lot of files in them. I tried using:
diff -r Directory1 Directory2
but the process is really slow due to the amount of files and their huge size.
So I thought about making the process faster by just comparing the content of the folders and not the actual content of the files.
Is there a way to make diff recursively check only if every subdirectory of Directory1 and Directory2 match in name and file content, but not check if every single file in Directory1 actually matches every single file in Directory2?
For example, let's say I have Directory1/SubDirectory1 and Directory2/Subdirectory1.
I want to check only if Directory1/SubDirectory1.1 and Directory2/Subdirectory2.1 have the same number of files with the same filenames (let's say, file1, file2, ... fileN), but I don't care about matching every file1, file2 ... fileN of Directory1/SubDirectory1.1 to every file1, file2 ... fileN of SubDirectory2.1 to see if their content is actually the same.
Is there a way of doing this?
Edit:
I tried using:
diff <(path1) <(path2)
but unfortunately, diff outputs the full path for each file. The output I get is thus:
< /Volume1/.../.../Directory1/SubDirectory1.1/file1
< /Volume1/.../.../Directory1/SubDirectory1.1/file2
...
> /Volume2/.../.../Directory2/SubDirectory2.1/file1
> /Volume2/.../.../Directory2/SubDirectory2.1/file2
...
Here every single filename clearly differs, because the full paths differ.
Is there a way to force find to output paths only starting from the directory you give as argument? For example:
find -(some option I'm not aware of) /Volume1/.../.../Directory1
outputs:
/Directory1/SubDirectory1.1/file1
/Directory1/SubDirectory1.1/file2
...

A simple way:
cd /.../Directory1
find . | sort >/tmp/dir1.lst
cd /.../Directory2
find . | sort >/tmp/dir2.lst
diff /tmp/dir1.lst /tmp/dir2.lst
It will fail if your filenames contain newlines, but in many cases that isn't a concern.
If scripting this, make sure to use auto-generated temp file names, e.g. with mktemp(1), to avoid symlink attacks and other problems.

Nate Eldredge, thank you for your answer!
However, I was able to solve my problem creating a script named fast_diff.sh, with just a line of code, as follows:
diff <(find "$1" | sed "s|$1\/||g" | sort) <(find "$2" | sed "s|$2\/||g" | sort)
The script takes two arguments, let's say path1 and path2:
./fast_diff.sh /Volume1/.../.../Directory1 /Volume2/.../.../Directory2
Now the variable $1 is equal to "/Volume1/.../.../Directory1" and the variable $2 is equal to "/Volume2/.../.../Directory2".
The command find gives as output something like:
/Volume1/.../.../Directory1/SubDirectory1.1/file1
/Volume1/.../.../Directory1/SubDirectory1.1/file2
...
Now I pipe this output to sed, using:
sed "s|$1||g"
which replaces every occurrence of "/Volume1/.../.../Directory1" with nothing. I used | as a separator instead of / because there are many occurrences of / in the directory path.
Employing the previous line of code, though, lists all subdirectories and files starting with a slash:
/SubDirectory1.1/file1
/SubDirectory1.1/file2
...
To remove the slash, I added \/:
sed "s|$1\/||g"

Move files with certain name to folder with certain name - Unix

I am trying since a while now, can anyone help me please?
I want to move files with certain names, e.g.
tree.txt
apple.txt
....
To their corresponding folder
tree
apple
I tried this but it takes too much time to do it individually:
mv *tree* destination_directory/tree
because then I need to repeat this 200 times
mv *apple* destination_directory/apple
.....
Is there any way to make this faster?
I have a list.txt with all the file names.
Thank you so much,
Bine

Assuming toy have the list of txt files in a file called "filewithtxts", you can read the file into a while loop and then process each entry
while read file;
do
dir=$(awk -F_ '{ print $(NF-1)"_"$NF }' <<< "${file%.txt}")
mv *"${file%.txt}"* "destination_directory/$dir" # Use ${file%.txt} to strip .txt from the entry
done < filewithtxts

UNIX how to use the base of an input file as part of an output file

I use UNIX fairly infrequently so I apologize if this seems like an easy question. I am trying to loop through subdirectories and files, then generate an output from the specific files that the loop grabs, then pipe an output to a file in another directory whos name will be identifiable from the input file. SO far I have:
for file in /home/sub_directory1/samples/SSTC*/
do
samtools depth -r chr9:218026635-21994999 < $file > /home/sub_directory_2/level_2/${file}_out
done
I was hoping to generate an output from file_1_novoalign.bam in sub_directory1/samples/SSTC*/ and to send that output to /home/sub_directory_2/level_2/ as an output file called file_1_novoalign_out.bam however it doesn't work - it says 'bash: /home/sub_directory_2/level_2/file_1_novoalign.bam.out: No such file or directory'.
I would ideally like to be able to strip off the '_novoalign.bam' part of the outfile and replace with '_out.txt'. I'm sure this will be easy for a regular unix user but I have searched and can't find a quick answer and don't really have time to spend ages searching. Thanks in advance for any suggestions building on the code I have so far or any alternate suggestions are welcome.
p.s. I don't have permission to write files to the directory containing the input folders

Beneath an explanation for filenames without spaces, keeping it simple.
When you want files, not directories, you should end your for-loop with * and not */.
When you only want to process files ending with _novoalign.bam, you should tell this to unix.
The easiest way is using sed for replacing a part of the string with sed.
A dollar-sign is for the end of the string. The total script will be
OUTDIR=/home/sub_directory_2/level_2
for file in /home/sub_directory1/samples/SSTC/*_novoalign.bam; do
echo Debug: Inputfile including path: ${file}
OUTPUTFILE=$(basename $file | sed -e 's/_novoalign.bam$/_out.txt/')
echo Debug: Outputfile without path: ${OUTPUTFILE}
samtools depth -r chr9:218026635-21994999 < ${file} > ${OUTDIR}/${OUTPUTFILE}
done
Note 1:
You can use parameter expansion like file=${fullfile##*/} to get the filename without path, but you will forget the syntax in one hour.
Easier to remember are basename and dirname, but you still have to do some processing.
Note 2:
When your script first changes the directory to /home/sub_directory_2/level_2 you can skip the basename call.
When all the files in the dir are to be processed, you can use the asterisk.
When all files have at most one underscore, you can use cut.
You might want to add some error handling. When you want the STDERR from samtools in your outputfile, add 2>&1.
These will turn your script into
OUTDIR=/home/sub_directory_2/level_2
cd /home/sub_directory1/samples/SSTC
for file in *; do
echo Debug: Inputfile: ${file}
OUTPUTFILE="$(basename $file | cut -d_ -f1)_out.txt"
echo Debug: Outputfile: ${OUTPUTFILE}
samtools depth -r chr9:218026635-21994999 < ${file} > ${OUTDIR}/${OUTPUTFILE} 2>&1
done

Converting Filename to Filename_Inode

I'm writing my first script that takes a file and moves it to another folder, except that I want to change the filename of the file to filename_inode instead of just filename incase there are any files with the same name
I've figured out how to show this by creating the following 4 variables
inode=$(ls -i $1 | cut -c1-7) #lists the file the user types, cuts the inode from it
space="_" #used to put inbetween the filename and bname
bname=$(basename $1) #gets the basename of the file without the directory etc
bnamespaceinode=$bname$space$inode #combines the 3 values into one variable
echo "$bnamespaceinode #prints filename_inode to the window
So the bottom echo shows filename_inode which is what I want, except now when I try to move this using mv or cp i'm getting the following errors
I dont think it's anything wrong with the syntax i'm using for the mv and cv commands, and so I'm thinking I need to concatenate the 3 variables into a new file or use the result of the first and then append the other 2 to that file?
I've tried both of the above but still not having any luck, any ideas?
Thanks

Without clearer examples, I guess this could work:
$TARGETDIR=/my/target/directory
mv $1 $TARGETDIR/$(basename "$1" | sed 's/_.*/_inode/')

Unzip only limited number of files in linux

I have a zipped file containing 10,000 compressed files. Is there a Linux command/bash script to unzip only 1,000 files ? Note that all compressed files have same extension.

unzip -Z1 test.zip | head -1000 | sed 's| |\\ |g' | xargs unzip test.zip
-Z1 provides a raw list of files
sed expression encodes spaces (works everywhere, including MacOS)

You can use wildcards to select a subset of files. E.g.
Extract all contained files beginning with b:
unzip some.zip b*
Extract all contained files whose name ends with y:
unzip some.zip *y.extension
You can either select a wildcard pattern that is close enough, or examine the output of unzip -l some.zip closely to determine a pattern or set of patterns that will get you exactly the right number.

I did this:
unzip -l zipped_files.zip |head -1000 |cut -b 29-100 >list_of_1000_files_to_unzip.txt
I used cut to get only the filenames, first 3 columns are size etc.
Now loop over the filenames :
for files in `cat list_of_1000_files_to_unzip.txt `; do unzip zipped_files.zip $files;done

Some advices:
Execute zip to only list a files, redirect output to some file
Truncate this file to get only top 1000 rows
Pass the file to zip to extract only specified files

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex