loading a UCS-2LE file in Netezza - odbc

I have multiple 30GB/1billion records files which I need to load into Netezza. I am connecting using pyodbc and running the following commands.
create temp table tbl1(id bigint, dt varchar(12), ctype varchar(20), name varchar(100)) distribute on (id)
insert into tbl1
select * from external 'C:\projects\tmp.CSV'
using (RemoteSource 'ODBC' Delimiter '|' SkipRows 1 MaxErrors 10 QuotedValue DOUBLE)
Here's a snippet from the nzlog file
Found bad records
bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic,
"text consumed"[last char examined]
----------------------------------------------------------------------------------------
1: 2(0) [1, INT8] contents of field, ""[0x00<NUL>]
2: 3(0) [1, INT8] contents of field, ""[0x00<NUL>]
and the nzbad file has "NUL" between every character.
I created a new file with the first 2million rows. Then I ran iconv on it
iconv -f UCS-2LE -t UTF-8 tmp.CSV > tmp_utf.CSV
The new file loads perfectly with no errors using the same commands. Is there any way for me to load the files without the iconv transformation? It is taking a really long time to run iconv.

UCS-2LE is not supported by Netezza, i hope for your sake that UTF-8 is enough for the data you have (no ancient languages or the like ?)
You need to focus on doing the conversion faster by:
searching the internet for a more cpu efficient implementation than/of iconv
Convert multiple files in parallel at a time (same as your number of CPU-cores minus one Is probably max). You may need to split the original files before you do it. The netezza loader prefers relatively large files though, so you may want to put them back together while loading for extra speed in that step :)

Related

Vroom/fread won't read LARGE .csv file - cannot memory map it

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.
df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) :
Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
read.csv on the other hand will read the ten rows happily.
Why won't vroom or fread read it using the usual altrep, even for 10 rows?
This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

How to read in more than 250,000 characters XML CLOB field from Oracle into R or SAS?

I need to read in this XML COLB column from Oracle table. I tried the simple read in like below:
xmlbefore <- dbGetQuery(conn, "select ID, XML_TXT from XML_table")
But I can only read in about 225,000 characters. When I compare with the sample XML file, it only read in maybe 2/3 or 3/4 of the entire field.
I assume R has limitation of maybe 225,000 characters and SAS has even less, like about only 1000 Characters.
How can I read in the entire field with all characters (I think it is about 250,000-270,000)?
SAS dataset variables have a 32k char limit, macro variables 64k. LUA variables in SAS however have no limit (other than memory) so you can read your entire XML file into a single variable in one go.
PROC LUA is available from SAS 9.4M3 (check &sysvlong for details). If you have an earlier version of SAS, you can still process your XML by parsing it a single character at a time (RECFM=N).

Import BLOBs from a CSV to an SQLite table

How should I put UUIDs into a CSV file in order to make SQLite .import command to load them into table as 128-bit BLOBs?
As far as I know, the only ways to generate a blob from the sqlite3 shell are using the zeroblob(), randomblob() and readfile() sql functions, CASTing a value, or as a base 16 blob literal (X'1234ABCD').
If your UUIDs are already represented as big endian 128 bit binary numbers in the CVS file, you might be able to do something like UPDATE table SET uuid = CAST(uuid AS BLOB); after the import. If they're a textual representation like 123e4567-e89b-12d3-a456-426655440000 you could write a user-defined function to do the conversion, and using it with a similar post-import UPDATE.
SQLite is unable to import BLOBs from CSV.
The solution is to convert CSV to an SQL-statement file and execute it:
sqlite3 database.db < database.sql
If you pipe from an application, 100000-row chunks are the most optimal amount per process instance.
If you try to pipe many gigabytes at once, sqlite3 will crash with an Out of memory error.

Easiest way to recast variables in SQLite

New to databasing, so please let me know if I'm going about this entirely wrong.
I want to use databases to store large datasets (I use R to analyze data, which cannot load datasets larger than available RAM) and and I'm using SQLite-Manager in FireFox to import .csv files. 99% of the time I use reals, but would like to avoid all the clicking to manually cast each of 100 columns as REAL (the default in SQLite-Manager is TEXT).
Is there a way I can I can quickly/easily cast all columns as REAL? Thanks!
why don't you make a script to be interpreted by the SQLite shell?
Run sqlite my_db < script.txt with contents of scripts.txt following:
CREATE TABLE foo(
col1 REAL,
col2 REAL,
[...] generate those lines with a decent text editor
);
.separator ;
.import 'my/csv/file.csv' foo
.q
Note that dot-commands of the SQLite shell are available using “.help”. Imports are rudimentary and won't work if you have double quotes (remove them). Only the , is interpreted as a separator, you cannot escape it. If needed you can use a multicharacter separator.
Also be sure that file.csv is UTF8-encoded.

Fixing Unicode Byte Sequences

Sometimes when copying stuff into PostgreSQL I get errors that there's invalid byte sequences.
Is there an easy way using either vim or other utilities to detect byte sequences that cause errors such as: invalid invalid byte sequence for encoding "UTF8": 0xde70 and whatnot, and possibly and easy way to do a conversion?
Edit:
What my workflow is:
Dumped sqlite3 database (from trac)
Trying to replay it in postgresql
Perhaps there's an easier way?
More Edit:
Also tried these:
Running enca to detect encoding of the file
Told me it was ASCII
Tried iconv to convert from ASCII to UTF8. Got an error
What did work is deleting the couple erroneous lines that it complained about. But that didn't really solve the real problem.
Based on one short sentence, it sounds like you have text in one encoding (e.g. ANSI/ASCII) and you are telling PostgreSQL that it's actually in another encoding (Unicode UTF8). All the different tools you would be using: PostgreSQL, Bash, some programming language, another programming language, other data from somewhere else, the text editor, the IDE, etc., all have default encodings which may be different, and some step of the way, the proper conversions are not being done. I would check the flow of data where it crosses these kinds of boundaries, to ensure that either the encodings line up, or the encodings are properly detected and the text is properly converted.
If you know the encoding of the dump file, you can convert it to utf-8 by using recode. For example, if it is encoded in latin-1:
recode latin-1..utf-8 < dump_file > new_dump_file
If you are not sure about the encoding, you should see how sqlite was configured, or maybe try some trial-and-error.
I figured it out. It wasn't really an encoding issue.
SQLite's output escaped strings differently than Postgres expects. There were some cases where 'asdf\xd\foo' was outputted. I believe the '\x' was causing it to expect the following characters to be unicode encoding.
Solution to this is dumping each table individually in CSV mode in sqlite 3.
First
sqlite3 db/trac.db .schema | psql
Now, this does the trick for the most part to copy the data back in
for table in `sqlite3 db/trac.db .schema | grep TABLE | sed 's/.*TABLE \(.*\) (/\1/'`
do
echo ".mode csv\nselect * from $table;" | sqlite3 db/trac.db | psql -c "copy $table from stdin with csv"
done
Yeah, kind of a hack, but it works.

Resources