Number of lines differ in text and zipped file - unix

I zippded few files in unix and later found zipped files have different number of lines than the raw files.
>>wc -l
70308 /location/filename.txt
2931 /location/filename.zip
How's this possible?

zip files are binary files. wc command is targeted for text files.
zip compressed version of a text file may contain more or less number of newline characters because zipping is not done line per line. So if they both give same output for all commands, there is no point of compressing and keeping the file in different format.
From wc man page:
-l, --lines
print the newline counts
To get the matching output, you should try
$ unzip -c | wc -l # Decompress on stdout and count the lines
This would give (about) 3 extra lines (if there is no directory structure involved). If you compressed directory containing text file instead of just file, you may see a few more lines containing the file/directory information.

In compression algorithm word/character is replaced by some binary sequence.
let's suppose \n is replaced by 0011100
and some other character 'x' is replaced by 0001010(\n)
so wc program search for sequence 0001010 in zip file and count of these can vary.

Related

UNIX split command splitting this file, but what names are resulting?

We receive a big csv file from a client (500k lines, est) that we split into smaller chunks using the split command.
You can see how we're using the command below, but my bash knowledge is a bit rusty, could someone refresh me on the ${processFile}_ bit below, and how the files are being named in the end? Not recalling what the underscore does...
split -l 50000 $PROCESSING_CURRENT_DIR/$processFile ${processFile}_
This isn't anything to do with bash but how split(1) command processes its arguments to split the input.
Syntax is:
split [OPTION]... [FILE [PREFIX]]
DESCRIPTION
Output pieces of FILE to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
So it uses the given prefix and makes output files.

How to move files based on a list (which contains the filename and destination path) in terminal?

I have a folder that contains a lot of files. In this case images.
I need to organise these images into a directory structure.
I have a spreadsheet that contains the filenames and the corresponding path where the file should be copied to. I've saved this file as a text document named files.txt
+--------------+-----------------------+
| image01.jpg | path/to/destination |
+--------------+-----------------------+
| image02.jpg | path/to/destination |
+--------------+-----------------------+
I'm trying to use rsync with the --files-from flag but can't get it to work.
According to man rsync:
--include-from=FILE
This option is related to the --include option, but it specifies a FILE that contains include patterns (one per line). Blank lines in the file and lines starting with ';' or '#' are ignored. If FILE is -, the list will be read from standard input
Here's the command i'm using: rsync -a --files-from=/path/to/files.txt path/to/destinationFolder
And here's the rsync error: syntax or usage error (code 1) at /BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-52.200.1/rsync/options.c(1436) [client=2.6.9]
It's still pretty unclear to me how the files.txt document should be formatted/structured and why my command is failing.
Any help is appreciated.

rsync exlude list from database?

I know I can exclude rsync files listed in a text file, but can I make rsync read a sqlite (or other) database as an exclude list?
Otherwise I guess I could dump the sqlite to a text file, but I would like to eliminate the extra step, since I have many files in many directories.
The man page says:
--exclude-from=FILE
This option is related to the --exclude option, but it specifies a FILE that contains exclude patterns (one per line). Blank lines in the file and lines starting with ";" or "#" are ignored. If FILE is -, the list will be read from standard input.
So just pipe the file names into rsync:
sqlite3 my.db "SELECT filename FROM t" | rsync --exclude-from=- ...

Unzip only limited number of files in linux

I have a zipped file containing 10,000 compressed files. Is there a Linux command/bash script to unzip only 1,000 files ? Note that all compressed files have same extension.
unzip -Z1 test.zip | head -1000 | sed 's| |\\ |g' | xargs unzip test.zip
-Z1 provides a raw list of files
sed expression encodes spaces (works everywhere, including MacOS)
You can use wildcards to select a subset of files. E.g.
Extract all contained files beginning with b:
unzip some.zip b*
Extract all contained files whose name ends with y:
unzip some.zip *y.extension
You can either select a wildcard pattern that is close enough, or examine the output of unzip -l some.zip closely to determine a pattern or set of patterns that will get you exactly the right number.
I did this:
unzip -l zipped_files.zip |head -1000 |cut -b 29-100 >list_of_1000_files_to_unzip.txt
I used cut to get only the filenames, first 3 columns are size etc.
Now loop over the filenames :
for files in `cat list_of_1000_files_to_unzip.txt `; do unzip zipped_files.zip $files;done
Some advices:
Execute zip to only list a files, redirect output to some file
Truncate this file to get only top 1000 rows
Pass the file to zip to extract only specified files

splitting the file basis of Line Number

Can u pls advise the unix command as I have a file which contain the records in the below format
333434
435435
435443
434543
343536
Now the total line count is 89380 , now i want to create a seprate
I am trying to split my large big file into small bits using the line numbers. For example my file has 89380 lines and i would like to divide this into small files wach of which has 1000 lines.
could you please advise unix command to achieve this
can unix split command can be used here..!!
Use split
Syntax split [options] filename prefix
Replace filename with the name of the large file you wish to split. Replace prefix with the name you wish to give the small output files. You can exclude [options], or replace it with either of the following:
-l linenumber
-b bytes
If you use the -l (a lowercase L) option, replace linenumber with the number of lines you'd like in each of the smaller files (the default is 1,000). If you use the -b option, replace bytes with the number of bytes you'd like in each of the smaller files.
The split command will give each output file it creates the name prefix with an extension tacked to the end that indicates its order. By default, the split command adds aa to the first output file, proceeding through the alphabet to zz for subsequent files. If you do not specify a prefix, most systems use x .
Example1:
split myfile
This will output three 1000-line files: xaa, xab, and xac.
Example2:
split -l 500 myfile segment
This will output six 500-line files: segmentaa, segmentab, segmentac, segmentad, segmentae, and segmentaf.
Example3:
Assume myfile is a 160KB file:
split -b 40k myfile segment
This will output four 40KB files: segmentaa, segmentab, segmentac, and segmentad.
You can use the --lines switch or its short form -l
split --lines=1000 input_file_name output_file_prefix
I think you can use sed command.
you can use sed -n "1, 1000p" yourfile > outputfile to get line 1 to line 1000.

Resources