How to create a file containing a set of entries from a larger file in UNIX? - unix

I'm working with a large fasta file, and want to create a file containing only the entries for a specific genus. grep alone is not working for this as it only includes the line containing the genus name, and I also want to include the protein code.
I tried
grep -i genus file.fasta > genus.txt
This did work but created a file with only the first line of each data entry, thus not including the protein sequence. I also tried
grep -iC6 genus file.fasta > genus.txt
This did not work as not all the sequences are the same length. Appreciate any help. I'm using Cygwin if it makes any difference.

Related

Order one file based on the order of another file. Both files contain a Header row

I have two files that I want to later on compare. Both files contain in the first row the headings. Both files are generated from different tools. Because they are created by different tools, the column order does not match between the two files. I want to order the columns of one file to match the order of the columns of another file. I want to do this in unix. Both files use a pipe as a delimiter.
Example:
File 1:
Userid|Account|Name|Dept
jonsa1|6532158|Sally|A42
File 2:
Name|Dept|Account|Userid
Sally|B64|6532158|jonsa1
I want to reorder the columns (and data) of File 2 to match File 1:
File 2:
Userid|Account|Name|Dept
jonsa1|6532158|Sally|B64
Using csvcut from the csvkit package to reorder the columns, and sed to extract the header from the first file and turn the pipes into commas so the names are in the right format for the -c (columns) option:
$ csvcut -d '|' -c "$(sed '1{s/|/,/g;q}' file1)" file2 | csvformat -D '|'
Userid|Account|Name|Dept
jonsa1value1|6532158|Sally|B64

Number of lines differ in text and zipped file

I zippded few files in unix and later found zipped files have different number of lines than the raw files.
>>wc -l
70308 /location/filename.txt
2931 /location/filename.zip
How's this possible?
zip files are binary files. wc command is targeted for text files.
zip compressed version of a text file may contain more or less number of newline characters because zipping is not done line per line. So if they both give same output for all commands, there is no point of compressing and keeping the file in different format.
From wc man page:
-l, --lines
print the newline counts
To get the matching output, you should try
$ unzip -c | wc -l # Decompress on stdout and count the lines
This would give (about) 3 extra lines (if there is no directory structure involved). If you compressed directory containing text file instead of just file, you may see a few more lines containing the file/directory information.
In compression algorithm word/character is replaced by some binary sequence.
let's suppose \n is replaced by 0011100
and some other character 'x' is replaced by 0001010(\n)
so wc program search for sequence 0001010 in zip file and count of these can vary.

splitting the file basis of Line Number

Can u pls advise the unix command as I have a file which contain the records in the below format
333434
435435
435443
434543
343536
Now the total line count is 89380 , now i want to create a seprate
I am trying to split my large big file into small bits using the line numbers. For example my file has 89380 lines and i would like to divide this into small files wach of which has 1000 lines.
could you please advise unix command to achieve this
can unix split command can be used here..!!
Use split
Syntax split [options] filename prefix
Replace filename with the name of the large file you wish to split. Replace prefix with the name you wish to give the small output files. You can exclude [options], or replace it with either of the following:
-l linenumber
-b bytes
If you use the -l (a lowercase L) option, replace linenumber with the number of lines you'd like in each of the smaller files (the default is 1,000). If you use the -b option, replace bytes with the number of bytes you'd like in each of the smaller files.
The split command will give each output file it creates the name prefix with an extension tacked to the end that indicates its order. By default, the split command adds aa to the first output file, proceeding through the alphabet to zz for subsequent files. If you do not specify a prefix, most systems use x .
Example1:
split myfile
This will output three 1000-line files: xaa, xab, and xac.
Example2:
split -l 500 myfile segment
This will output six 500-line files: segmentaa, segmentab, segmentac, segmentad, segmentae, and segmentaf.
Example3:
Assume myfile is a 160KB file:
split -b 40k myfile segment
This will output four 40KB files: segmentaa, segmentab, segmentac, and segmentad.
You can use the --lines switch or its short form -l
split --lines=1000 input_file_name output_file_prefix
I think you can use sed command.
you can use sed -n "1, 1000p" yourfile > outputfile to get line 1 to line 1000.

Split files linux and then grep

I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead
There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.

Compare two folders which have many files inside contents

Have two folders with approx. 150 java property files.
In a shell script, how to compare both folders to see if there is any new property file in either of them and what are the differences between the property files.
The output should be in a report format.
To get summary of new/missing files, and which files differ:
diff -arq folder1 folder2
a treats all files as text, r recursively searched subdirectories, q reports 'briefly', only when files differ
diff -r will do this, telling you both if any files have been added or deleted, and what's changed in the files that have been modified.
I used
diff -rqyl folder1 folder2 --exclude=node_modules
in my nodejs apps.
Could you use dircmp ?
Diff command in Unix is used to find the differences between files(all types). Since directory is also a type of file, the differences between two directories can easily be figure out by using diff commands. For more option use man diff on your unix box.
-b Ignores trailing blanks (spaces and tabs)
and treats other strings of blanks as
equivalent.
-i Ignores the case of letters. For example,
`A' will compare equal to `a'.
-t Expands <TAB> characters in output lines.
Normal or -c output adds character(s) to the
front of each line that may adversely affect
the indentation of the original source lines
and make the output lines difficult to
interpret. This option will preserve the
original source's indentation.
-w Ignores all blanks (<SPACE> and <TAB> char-
acters) and treats all other strings of
blanks as equivalent. For example,
`if ( a == b )' will compare equal to
`if(a==b)'.
and there are many more.

Resources