Loading multiple CSV files with SQLite - sqlite

I'm using SQLite, and I need to load hundreds of CSV files into one table. I didn't manage to find such a thing in the web. Is it possible?
Please note that in the beginning i used Oracle, but since Oracle have a 1000 columns limitation per table, and my CSV files have more than 1500 columns each one, i had to find another solution. I wan't to try SQLite, since i can install it fast and easily.
These CSV files have been supplied with such as amount of columns and i can't change or split them (nevermind why).
Please advise.

I ran into a similar problem and the comments to your question actually gave me the answer that finally worked for me
Step 1: merge the multiple csv's into a single file. Exclude the header for most of them but write down the header from one of them in the beginning.
Step 2: Load the single merged csv into SQLite.
For step 1 I used:
$ head -1 one.csv > all_combined.csv
$ tail -n +2 -q *.csv >> all_combined.csv
The first command writes only the first line of the csv file (you can choose whichever one file), the second command writes the whole document starting from line 2 and therefore excluding the header. The -q option makes sure that tail never writes the file name as a header.
Make sure to put all_combined.csv in a separate folder, or in some distributions, it will be included recursively!
To load into SQLite (Step 2) the answer given by Hot Licks worked for me:
sqlite> .mode csv
sqlite> .import all_combined.csv my_new_table
This assumes that my_new_table hasn't been created. Alternatively you can create beforehand and then load, but in that case exclude the header from Step 1.

http://www.sqlite.org/cli.html --
Use the ".import" command to import CSV (comma separated value) data into an SQLite table. The ".import" command takes two arguments which are the name of the disk file from which CSV data is to be read and the name of the SQLite table into which the CSV data is to be inserted.
Note that it is important to set the "mode" to "csv" before running the ".import" command. This is necessary to prevent the command-line shell from trying to interpret the input file text as some other format.
sqlite> .mode csv
sqlite> .import C:/work/somedata.csv tab1
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
Note that you need to make sure that the files DO NOT have an initial line defining the field names. And, for "hundreds" of files you will probably want to prepare a script rather than typing in each file individually.

I didn't find a nicer way to solve this so I used find along with xargs to avoid creating a huge intermediate .csv file:
find . -type f -name '*.csv' | xargs -I% sqlite3 database.db ".mode csv" ".import % new_table" ".exit"
find prints out the file names and the -I% parameter to xargs causes the command after it to be run once for each line, with % replaced by a name of a csv file.

You can use DB Browser for SQLite to do this pretty easily.
File > Import > Table from CSV file... and then select all the files to open them together into a single table.
I just tested this out with a dozen CSV files and got a single 1 GB table from them without any work. As long as they have the same schema, DB Browser is able to put them together. You'll want to keep the 'Column Names in first line' option checked.

Related

UNIX split command splitting this file, but what names are resulting?

We receive a big csv file from a client (500k lines, est) that we split into smaller chunks using the split command.
You can see how we're using the command below, but my bash knowledge is a bit rusty, could someone refresh me on the ${processFile}_ bit below, and how the files are being named in the end? Not recalling what the underscore does...
split -l 50000 $PROCESSING_CURRENT_DIR/$processFile ${processFile}_
This isn't anything to do with bash but how split(1) command processes its arguments to split the input.
Syntax is:
split [OPTION]... [FILE [PREFIX]]
DESCRIPTION
Output pieces of FILE to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
So it uses the given prefix and makes output files.

unix compare lists of file names

I believe similar questions have been answered on SO before. I cant find any that seem to match to my particular situation, though I am sure many others have faced this scenario.
In an FTP session on Red Hat I have produced a list of file names that reside on the server currently. The list contains the file names and only the file names. Call this file1. Perhaps it contains something like:
513569430_EDIP000754535900_MFC_20190618032554.txt
blah.txt
duh.txt
Then I have downloaded the files and produced a list of successfully downloaded files. As well, this list contains the file names and only the file names. Call this file2. Perhaps it contains something like:
loadFile.dat
513569430_EDIP000754535900_MFC_20190618032554.txt
localoutfile.log
Now I want to loop through the names in file1 and check if they exist in file2. If exists I will go back to FTP server and delete the file from server.
I have looked at while loops and comm and test command, but I just cant seem to crack the code. I expect there are many ways to achieve this task. Any suggestions out there or working references?
My area of trouble is really not the looping itself but rather the comparing of contents between 2 files.
comm -1 -2 file1 file2 returns just the lines that are identical in both files. This can be used as the basis of a batch command file for sftp.
From the comments to the question, it seems that line-endings differ for the two files. This can be fixed in various ways, simplest probably being with tr. comm understands - as a filename to mean "read from stdin".
For example:
tr -d '\r` file1 | comm -1 -2 - file2
If file1 or file2 are not sorted, this must be corrected for comm to operate properly. With bash, this could be:
comm -1 -2 <( sort file1 | tr -d '\r' ) <( sort file2 )
With shells that don't understand the <( ... ) syntax, temporary files may be used explicitly.
Thank you for the advice #jhnc.
After giving this some deeper consideration and conversation, I realized that I don't even need to do this comparison. After I download the files I just need to produce the list of successful downloads. Then I can go and delete from server based on list of successful downloads.
However, I am still interested to know how to compare with the '\r \n' vs '\n' line ending situation

rsync exlude list from database?

I know I can exclude rsync files listed in a text file, but can I make rsync read a sqlite (or other) database as an exclude list?
Otherwise I guess I could dump the sqlite to a text file, but I would like to eliminate the extra step, since I have many files in many directories.
The man page says:
--exclude-from=FILE
This option is related to the --exclude option, but it specifies a FILE that contains exclude patterns (one per line). Blank lines in the file and lines starting with ";" or "#" are ignored. If FILE is -, the list will be read from standard input.
So just pipe the file names into rsync:
sqlite3 my.db "SELECT filename FROM t" | rsync --exclude-from=- ...

How to delete the first row from a .csv file?

I have 500 csv files in which first row of each file has the name of that file. I wish to exclude the file name from my data.
I have tried this but it is not working:
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)
myfiles = myfiles[-1, ]
This is clearly a R question. However, I thought I would suggest a Unix approach. Unix will be much faster than R for this task and IMO it is the more natural tool. If you have Windows you'll have to download cygwin. This may be a headache, however, with only minimal knowledge, Unix is a very powerful tool. There are essentially two approaches to your problem:
First Approach
You can modify each file so that the first row is removed. This means that your original .csv will no longer exist.
sed -i 1d *.csv
Second Approach
The first approach is problematic. You might want to keep the original files. If this is the case you need to remove the -i flag from the above code. We will also need to use a for loop so we can name each of the new files.
for f in *.csv; do sed 1d $f > new_$f; done
A for loop in Unix is kinda like an R for loop, except do and done replace { and }.

Subsetting a file into multiple files based on a value in the last two positions of the record in the file using powershell

I want to subset a file into multiple txt files based on a value in the last two positions of the record in the file in powershell. The source file is from IBM z/OS machine and it does not have file extension. What i currently do is use a awk command to subset it based on the values in the last two positions of the record in the file like below
awk '{print > "file.txt" substr($0,length-2,2) }' RAW
The file name is RAW and it creates multiple files depending on the last two distinct values in a record. So if AA is the value in the last two position of the record of the file. i would get a file outputted like fileAA.txt. How can i achieve this in powershell?
Thanks
You could try something like:
Get-Content RAW | %{ $fn="file"+$_.Substring($_.length-2)+".txt"; $_ | Out-File $fn -Append; }

Resources