Selecting a range of records from a file in Unix - unix

I have 4,930,728 records on a file text file in unix. This file is used to ingest images to Oracle web center content using batchloader. <<EOD>> indicate end of record as per below sample.
I have two questions
After processing 4,300,846 of 4,930,728 record(s), the batchloader fails for whatever resoan. Now I want to create a new file with records from 4,300,846 to 4,930,728. How do I do achieve that?
I want to split this text file containing 4930728 records into multiple files each contaiting range of (1,000,000) records e.g. file 1 contains records from 0 to 10,000,000. The second file contains records from 1,000,001 to 20,000,000 and so on. How do I achieve this?
filename: load_images.txt
Action = insert
DirectReleaseNewCheckinDoc=1
dUser=Biometric
dDocTitle=333_33336145454_RT.wsq
dDocType=Document
dDocAuthor=Biometric
dSecurityGroup=Biometric
dDocAccount=Biometric
xCUSTOMER_MSISDN=33333
xPIN_REF=64343439
doFileCopy=1
fParentGUID=2CBC11DF728D39AEF91734C58AE5E4A5
fApplication=framework
primaryFile=647229_234343145454_RT.wsq
primaryFile:path=/ecmmigration_new/3339_2347333145454_RT.wsq
xComments=Biometric Migration from table OWCWEWW_MIG_3007
<<EOD>>

Answer #1:
head -n 4930728 myfile.txt | tail -n $(echo "4930728 - 4300846" | bc)
Answer #2 - to split the files in 1000 0000 lines:
split -l 10000000 myfile.txt ### It will create file like xaa,xab and so on

Related

Unix command for below file

I have a CSV file like below
05032020
Col1|col2|col3|col4|col5
Infosys
Tcs
Wipro
Accenture
Deloitte
I want record count by skipping date and Header columns
O/p: Record count 5 with including line numbers
cat FF_Json_to_CSV_MAY03.txt
05032020
requestId|accountBranch|accountNumber|guaranteeGuarantor|accountPriority|accountRelationType|accountType|updatedDate|updatedBy
0000000001|5BW|52206|GG1|02|999|CHECKING|20200503|BTCHLCE
0000000001|55F|80992|GG2|02|1999|IRA|20200503|0QLC
0000000001|55F|24977|CG|01|3999|CERTIFICAT|20200503|SRIKANTH
0000000002|5HJ|03349|PG|01|777|SAVINGS|20200503|BTCHLCE
0000000002|5M8|999158|GG3|01|900|CORPORATE|20200503|BTCHLCE
0000000002|5LL|49345|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000002|5HY|15786|PG|01|999|CORPORATE|20200503|BTCHLCE
0000000003|55F|34956|CG|01|999|CORPORATE|20200503|SRIKANTH
0000000003|5BY|14399|GG10|03|10|MONEY MARK|20200503|BTCHLCE
0000000003|5PE|32100|PG|04|999|JOINT|20200503|BTCHLCE
0000000003|5LB|07888|GG25|02|999|BROKERAGE|20200503|BTCHLCE
0000000004|55F|36334|CG|02|999|JOINT|20200503|BTCHLCE
0000000005|55F|06739|GG9|02|999|SAVINGS|20200503|BTCHLCE
0000000005|5CP|39676|PG|01|999|SAVINGS|20200503|BTCHLCE
0000000006|55V|62452|CG|01|10|CORPORATE|20200503|SRIKANTH
0000000007|55V|H9889|CG|01|999|SAVINGS|20200503|BTCHLCE
0000000007|5L2|03595|PG|02|999|CORPORATE|20200503|BTCHLCE
0000000007|55V|C1909|GG8|01|10|JOINT|20200503|BTCHLCE
I need line numbers from 00000000001
There are two ways to solve your issue:
Count only the records you want to count
Count all records and remove the ones you don't want to count
From your example, it's not possible to know how to do it, but let me give you some ideas:
Imagine that your file starts with 3 header lines, then you can do something like:
wc -l inputfile | awk '{print $1-3}'
Imagine that the lines you want to count all start with a number and a dot, then you can do something like:
grep "[0-9]*\." inputfile | wc -l

Count empty lines or line breaks in unix

I have a file which I am loading to Hive.It has 1 header record and 2 trailer records.I am skipping these 3 rows.
WC -l on the file gives 780112 records.780108 records are getting loaded to hive table.
Downloading this file to excel has 780113 records including header and trailer.
I am assuming there is some empty line or line break in the file which could be those 2 missing lines and why does WC -l gives the wrong count.
How to find it ?
I have tried to search empty line using :g in vi editor but did not give a match.
Work around to skip trailer in landing table by giving a filter condition

Loading multiple CSV files with SQLite

I'm using SQLite, and I need to load hundreds of CSV files into one table. I didn't manage to find such a thing in the web. Is it possible?
Please note that in the beginning i used Oracle, but since Oracle have a 1000 columns limitation per table, and my CSV files have more than 1500 columns each one, i had to find another solution. I wan't to try SQLite, since i can install it fast and easily.
These CSV files have been supplied with such as amount of columns and i can't change or split them (nevermind why).
Please advise.
I ran into a similar problem and the comments to your question actually gave me the answer that finally worked for me
Step 1: merge the multiple csv's into a single file. Exclude the header for most of them but write down the header from one of them in the beginning.
Step 2: Load the single merged csv into SQLite.
For step 1 I used:
$ head -1 one.csv > all_combined.csv
$ tail -n +2 -q *.csv >> all_combined.csv
The first command writes only the first line of the csv file (you can choose whichever one file), the second command writes the whole document starting from line 2 and therefore excluding the header. The -q option makes sure that tail never writes the file name as a header.
Make sure to put all_combined.csv in a separate folder, or in some distributions, it will be included recursively!
To load into SQLite (Step 2) the answer given by Hot Licks worked for me:
sqlite> .mode csv
sqlite> .import all_combined.csv my_new_table
This assumes that my_new_table hasn't been created. Alternatively you can create beforehand and then load, but in that case exclude the header from Step 1.
http://www.sqlite.org/cli.html --
Use the ".import" command to import CSV (comma separated value) data into an SQLite table. The ".import" command takes two arguments which are the name of the disk file from which CSV data is to be read and the name of the SQLite table into which the CSV data is to be inserted.
Note that it is important to set the "mode" to "csv" before running the ".import" command. This is necessary to prevent the command-line shell from trying to interpret the input file text as some other format.
sqlite> .mode csv
sqlite> .import C:/work/somedata.csv tab1
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
Note that you need to make sure that the files DO NOT have an initial line defining the field names. And, for "hundreds" of files you will probably want to prepare a script rather than typing in each file individually.
I didn't find a nicer way to solve this so I used find along with xargs to avoid creating a huge intermediate .csv file:
find . -type f -name '*.csv' | xargs -I% sqlite3 database.db ".mode csv" ".import % new_table" ".exit"
find prints out the file names and the -I% parameter to xargs causes the command after it to be run once for each line, with % replaced by a name of a csv file.
You can use DB Browser for SQLite to do this pretty easily.
File > Import > Table from CSV file... and then select all the files to open them together into a single table.
I just tested this out with a dozen CSV files and got a single 1 GB table from them without any work. As long as they have the same schema, DB Browser is able to put them together. You'll want to keep the 'Column Names in first line' option checked.

Subsetting a file into multiple files based on a value in the last two positions of the record in the file using powershell

I want to subset a file into multiple txt files based on a value in the last two positions of the record in the file in powershell. The source file is from IBM z/OS machine and it does not have file extension. What i currently do is use a awk command to subset it based on the values in the last two positions of the record in the file like below
awk '{print > "file.txt" substr($0,length-2,2) }' RAW
The file name is RAW and it creates multiple files depending on the last two distinct values in a record. So if AA is the value in the last two position of the record of the file. i would get a file outputted like fileAA.txt. How can i achieve this in powershell?
Thanks
You could try something like:
Get-Content RAW | %{ $fn="file"+$_.Substring($_.length-2)+".txt"; $_ | Out-File $fn -Append; }

To replace the first character of the last line of a unix file with the file name

We need a shell script that retrieves all txt files in the current directory and for each file checks if it is an empty file or contains any data in it (which I believe can be done with wc command).
If it is empty then ignore it else since in our condition, all txt files in this directory will either be empty or contain huge data wherein the last line of the file will be like this:
Z|11|21||||||||||
That is the last line has the character Z then | then an integer then | then an integer then certain numbers of | symbols.
If the file is not empty, then we just assume it to have this format. Data before the last line are garbled and not necessary for us but there will be at least one line before the last line, i.e. there will be at least two lines guaranteed if the file is non-empty.
We need a code wherein, if the file is non-empty, then it takes the file, replaces the 'Z' in the last line with 'filename.txt' and writes the new data into another file say tempfile. The last line will thus become as:
filename.txt|11|21|||||||
Remaining part of the line remains same. From the tempfile, the last line, i.e., filename.txt|int|int||||| is taken out and merged into a finalfile. The contents of tempfile is cleared to receive data from next filename.txt in the same directory. finalfile has the edited version of the last lines of all non-empty txt files in that directory.
Eg: file1.txt has data as
....
....
....
Z|1|1|||||
and file2.txt has data as
....
....
....
Z|2|34|||||
After running the script, new data of file1.txt becomes
.....
.....
.....
file1.txt|1|1||||||
This will be written into a new file say temp.txt which is initially empty. From there the last line is merged into a file final.txt. So, the data in final.txt is:
file1.txt|1|1||||||
After this merging, the data in temp.txt is cleared
New data of file2.txt becomes
...
...
...
file2.txt|2|34||||||
This will be written into the same file temp.txt. From there the last line is merged into the same file final.txt.
So, the data in final.txt is
file1.txt|1|1||||||
file2.txt|2|34||||||
After considering N number of files that was returned to be as of type txt and non-empty and within the same directory, the data in final.txt becomes
file1.txt|1|1||||||
file2.txt|2|34||||||
file3.txt|8|3||||||
.......
.......
.......
fileN.txt|22|3|||||
For some of the conditions, I already know the command, like
For finding files in a directory of type text,
find <directory> -type f -name "*.txt"
For taking the last line and merging it into another file
tail -1 file.txt>>destination.txt
You can use 'sed' to replace the "z" character. You'll be in a loop, so you can use the filename that you have in that. This just removes the Z, and then echos the line and filename.
Good luck.
#!/bin/bash
filename=test.txt
line=`tail -1 $filename | sed "s/Z/$filename/"`
echo $line
Edit:
Did you run your find command first, and see the output? It has of course a ./ at the start of each line. That will break sed, since sed uses / as a delimiter. It also will not work with your problem statement, which does not have an extra "/" before the filename. You said current directory, and the command you give will traverse ALL subdirectories. Try being simple and using LS.
# `2>/dev/null` puts stderr to null, instead of writing to screen. this stops
# us getting the "no files found" (error) and thinking it's a file!
for filename in `ls *.txt 2>/dev/null` ; do
... stuff ...
done

Resources