Count empty lines or line breaks in unix - unix

I have a file which I am loading to Hive.It has 1 header record and 2 trailer records.I am skipping these 3 rows.
WC -l on the file gives 780112 records.780108 records are getting loaded to hive table.
Downloading this file to excel has 780113 records including header and trailer.
I am assuming there is some empty line or line break in the file which could be those 2 missing lines and why does WC -l gives the wrong count.
How to find it ?
I have tried to search empty line using :g in vi editor but did not give a match.

Work around to skip trailer in landing table by giving a filter condition

Related

How to handle a file having header in between the records after removing duplicates from the file

We have a file which has been processed by unix command for removing duplicates. After the de-duplication new file has the header in-between the records. Please help to solve this and thanks in advance for inputs.
Unix Command : Sort -u >
I would do something like this:
grep "headers" >output.txt
grep -v "headers" >>output.txt
The idea is the following: first take the headers and put them into output.txt, and afterwards take everything which is not a header and put it into that output file.
First you need to put the information in the output file (which means you need to create the output file, hence the single > character), secondly you need to append the information to the already existing output file (hence the double >> character).

How to display the first line for a set of files(say like 10) in unix?

I don't know how to do that guys.I know only how to get first line for an individual file.
First i listed only the files that has ssa as a part of its name.I used the command
ls | grep ssa
This command gives me 10 files now i want to display only the first lines for all 10 files.I don't know how to do that.Can anyone help me with that?
The head command can accept multiple input files. So when you suppress the header output and limit the number of lines to 1 that should be what you are looking for:
head -qn 1 *
If you want to combine this with other commands, you have to take care to really hand over all input arguments to a call of head:
ls | xargs head -qn 1

Loading multiple CSV files with SQLite

I'm using SQLite, and I need to load hundreds of CSV files into one table. I didn't manage to find such a thing in the web. Is it possible?
Please note that in the beginning i used Oracle, but since Oracle have a 1000 columns limitation per table, and my CSV files have more than 1500 columns each one, i had to find another solution. I wan't to try SQLite, since i can install it fast and easily.
These CSV files have been supplied with such as amount of columns and i can't change or split them (nevermind why).
Please advise.
I ran into a similar problem and the comments to your question actually gave me the answer that finally worked for me
Step 1: merge the multiple csv's into a single file. Exclude the header for most of them but write down the header from one of them in the beginning.
Step 2: Load the single merged csv into SQLite.
For step 1 I used:
$ head -1 one.csv > all_combined.csv
$ tail -n +2 -q *.csv >> all_combined.csv
The first command writes only the first line of the csv file (you can choose whichever one file), the second command writes the whole document starting from line 2 and therefore excluding the header. The -q option makes sure that tail never writes the file name as a header.
Make sure to put all_combined.csv in a separate folder, or in some distributions, it will be included recursively!
To load into SQLite (Step 2) the answer given by Hot Licks worked for me:
sqlite> .mode csv
sqlite> .import all_combined.csv my_new_table
This assumes that my_new_table hasn't been created. Alternatively you can create beforehand and then load, but in that case exclude the header from Step 1.
http://www.sqlite.org/cli.html --
Use the ".import" command to import CSV (comma separated value) data into an SQLite table. The ".import" command takes two arguments which are the name of the disk file from which CSV data is to be read and the name of the SQLite table into which the CSV data is to be inserted.
Note that it is important to set the "mode" to "csv" before running the ".import" command. This is necessary to prevent the command-line shell from trying to interpret the input file text as some other format.
sqlite> .mode csv
sqlite> .import C:/work/somedata.csv tab1
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
Note that you need to make sure that the files DO NOT have an initial line defining the field names. And, for "hundreds" of files you will probably want to prepare a script rather than typing in each file individually.
I didn't find a nicer way to solve this so I used find along with xargs to avoid creating a huge intermediate .csv file:
find . -type f -name '*.csv' | xargs -I% sqlite3 database.db ".mode csv" ".import % new_table" ".exit"
find prints out the file names and the -I% parameter to xargs causes the command after it to be run once for each line, with % replaced by a name of a csv file.
You can use DB Browser for SQLite to do this pretty easily.
File > Import > Table from CSV file... and then select all the files to open them together into a single table.
I just tested this out with a dozen CSV files and got a single 1 GB table from them without any work. As long as they have the same schema, DB Browser is able to put them together. You'll want to keep the 'Column Names in first line' option checked.

Number of lines differ in text and zipped file

I zippded few files in unix and later found zipped files have different number of lines than the raw files.
>>wc -l
70308 /location/filename.txt
2931 /location/filename.zip
How's this possible?
zip files are binary files. wc command is targeted for text files.
zip compressed version of a text file may contain more or less number of newline characters because zipping is not done line per line. So if they both give same output for all commands, there is no point of compressing and keeping the file in different format.
From wc man page:
-l, --lines
print the newline counts
To get the matching output, you should try
$ unzip -c | wc -l # Decompress on stdout and count the lines
This would give (about) 3 extra lines (if there is no directory structure involved). If you compressed directory containing text file instead of just file, you may see a few more lines containing the file/directory information.
In compression algorithm word/character is replaced by some binary sequence.
let's suppose \n is replaced by 0011100
and some other character 'x' is replaced by 0001010(\n)
so wc program search for sequence 0001010 in zip file and count of these can vary.

Selecting a range of records from a file in Unix

I have 4,930,728 records on a file text file in unix. This file is used to ingest images to Oracle web center content using batchloader. <<EOD>> indicate end of record as per below sample.
I have two questions
After processing 4,300,846 of 4,930,728 record(s), the batchloader fails for whatever resoan. Now I want to create a new file with records from 4,300,846 to 4,930,728. How do I do achieve that?
I want to split this text file containing 4930728 records into multiple files each contaiting range of (1,000,000) records e.g. file 1 contains records from 0 to 10,000,000. The second file contains records from 1,000,001 to 20,000,000 and so on. How do I achieve this?
filename: load_images.txt
Action = insert
DirectReleaseNewCheckinDoc=1
dUser=Biometric
dDocTitle=333_33336145454_RT.wsq
dDocType=Document
dDocAuthor=Biometric
dSecurityGroup=Biometric
dDocAccount=Biometric
xCUSTOMER_MSISDN=33333
xPIN_REF=64343439
doFileCopy=1
fParentGUID=2CBC11DF728D39AEF91734C58AE5E4A5
fApplication=framework
primaryFile=647229_234343145454_RT.wsq
primaryFile:path=/ecmmigration_new/3339_2347333145454_RT.wsq
xComments=Biometric Migration from table OWCWEWW_MIG_3007
<<EOD>>
Answer #1:
head -n 4930728 myfile.txt | tail -n $(echo "4930728 - 4300846" | bc)
Answer #2 - to split the files in 1000 0000 lines:
split -l 10000000 myfile.txt ### It will create file like xaa,xab and so on

Resources