We are currently extracting several Teradata .TPT files that we will upload to AWS S3, however the files are coming with ANSI encode
I need them to come with encode UTF-8
You must specify the character set in your TPT script. At the top add:
USING CHARACTER SET UTF8
The tricky part is that UTF8 here has 3 bytes per character, so in your DEFINE SCHEMA you must triple the size of each field.
For example if your schema looks like:
DEFINE SCHEMA s_some_export
(
status VARCHAR(20),
userid VARCHAR(20),
firstname VARCHAR(64),
);
You'll have to triple the values to accommodate your UTF8 characters:
DEFINE SCHEMA s_some_export
(
status VARCHAR(60),
userid VARCHAR(60),
firstname VARCHAR(192),
);
Sometimes, because I'm lazy, I define my TPT with USING CHARACTER SET UTF16 so that I only need double each field size (the math is easier). BUT it means I have to convert it to UTF8 after extraction. In Linux this would just be iconv -f UTF-16LE -t UTF-8 myoutputfile.csv > myoutputfile.utf8.csv
Some caveats:
If your table's field is defined as CHAR and CHARACTER SET LATIN then you may run into column size issues with your schema. see here
Dates and Timestamps can get wierd as they don't need to be doubled so defining them as VARCHAR in your schema can get you into trouble. You may have to fuss around a bit here. My suggestion would be to change the view from which you are selecting the data for you TPT and CAST(yourdate AS VARCHAR(10)) as yourdate and then use VARCHAR(30) in your schema so you don't have to think about the field types while defining your schema. This means extra CPU overhead in your extraction, but unless you are running tight on resources I think it's worth it. I'm also very lazy that way and always happy to just get the damned TPT to extract data without much debugging.
Related
I'm using java swing application which needs unicode string to drag into jtable.Is it possible to store unicode data in SQLITE database? If so,which SQLite does support unicode..I need free sqlite not the premium..
SQLite always stores text data as Unicode, using the Unicode encoding specified when the database was created. The database driver itself takes care to return the data as the Unicode string in the encoding used by your language/platform.
If you have conversion problems, either your application tried to store an ASCII string without converting it to Unicode, or you tried to read one value and force a conversion on it.
SQLite uses a kind of dynamic typing, where each value is stored using a specific storage class. A column's type specifies the affinity or how the value is treated. For example:
A column with NUMERIC affinity may contain values using all five storage classes. When text data is inserted into a NUMERIC column, the storage class of the text is converted to INTEGER or REAL
There are five storage classes, NULL, INTEGER, REAL, TEXT, BLOB. TEXT stores string data using the Unicode encoding specified for the database (UTF-8, UTF-16BE or UTF-16LE).
What specific problem are you facing, or is this a general question?
SQLite always uses Unicode strings.
sqlite3 doesn't fully support UNICODE. There is a wrapper class called CppSQLite3 which fully supports UNICODE>
Here's a crazy one: the same external table definition works fine in one
database, but fails in another. Not schema - database. Two databases, both
on the same OS, different servers. In addition, it's failing on the 2nd date
field, though both are defined the same. The NLS settings are the same on both servers, thought I thought the date mask should override that anyway. Here's the definition:
-- access parameters
-- http://docs.oracle.com/cd/E11882_01/server.112/e16536/et_params.htm
CREATE TABLE ext_tab (
FIELD1 VARCHAR2(30),
FIELD2_DATE DATE,
FIELD3 VARCHAR2(4),
FIELD4 VARCHAR2(6),
FIELD5_DATE DATE
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY DIR_DATADIR
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE
NOBADFILE
NODISCARDFILE
LOGFILE 'LOGFILE_LOG'
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' and '"'
LRTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
(
FIELD1 CHAR(30),
FIELD2_DATE CHAR(8) date_format DATE mask 'YYYYMMDD',
FIELD3 CHAR(4),
FIELD4 CHAR(6),
FIELD5_DATE CHAR(8) date_format DATE mask 'YYYYMMDD'
)
)
LOCATION ('Sample_Input_csv.csv')
)
REJECT LIMIT UNLIMITED
NOPARALLEL;
Here's sample data:
TOTEA01217611,20121122,TOTE,847759,20121122
And, the log error:
KUP-04021: field formatting error for field FIELD5_DATE
KUP-04026: field too long for datatype
Anyone have an answer for this madness?
Apparently, the input file was corrupt in some way, perhaps loaded as binary instead of ASCII.
What we did:
- pulled another file from the first server to the second server and tested - this one worked fine!
- deleted the contents of this second file, and cut and pasted the exact text from the first file directly into this second file
- ran the test again - it worked!
Everything, as far as we could tell, was identical between the two files. To rule out something to do with the filename, we then renamed this file to the original file's name, but it still worked. We then re-FTP'd the original file, and it worked this time as well. So, again, the only thing we can think of is that some non-printing characters were in the file.
We didn't have a hex editor available to check, but for anyone coming across this same thing, viewing the contents as hex would be one way to make sure there was nothing odd in the file.
I have a SQlite3 table that has typeless columns like in this example:
CREATE TABLE foo(
Timestamp INT NOT NULL,
SensorID,
Value,
PRIMARY KEY(Timestamp, SensorID)
);
I have specific reasons not to declare the type of the columns SensorID and Value.
When inserting rows with numeric SensorID and Value columns I notice that they are being written as plain text into the .db file.
When I change the CREATE TABLE statement to...
CREATE TABLE foo(
Timestamp INT NOT NULL,
SensorID INT,
Value REAL,
PRIMARY KEY(Timestamp, SensorID)
);
...then the values seem to be written in some binary format to the .db file.
Since I need to write several millions of rows to the database, I have concerns about the file size this data produces and so would like to avoid value storage in plain text form.
Can I force SQLite to use binary representation in it's database file without using explicitly typed columns?
Note: Rows are currently written with PHP::PDO using prepared statements.
The example in section 3.4 in the sqlite docs about types demonstrates the insertion of a number as int in a column without an explicit declaration of type. I guess the trick is leaving out the quotes around the number, which would convert it to a string (which, in the case of typed columns, would be coerced back into a number).
Section 2 in the page linked above also provides a lot of info about the type conversions taking place.
I'm familiar with how type affinity works in SQLite: You can declare column types as anything you want, and all that matters is whether the type name contains "INT", "CHAR", "FLOA", etc. But is there a commonly-used convention on what type names to use?
For example, if you have an integer column, is it better to distinguish between TINYINT, SMALLINT, MEDIUMINT, and BIGINT, or just declare everything as INTEGER?
So far, I've been using the following:
INTEGER
REAL
CHAR(n) -- for strings with a known fixed with
VARCHAR(n) -- for strings with a known maximum width
TEXT -- for all other strings
BLOB
BOOLEAN
DATE -- string in "YYYY-MM-DD" format
TIME -- string in "HH:MM:SS" format
TIMESTAMP -- string in "YYYY-MM-DD HH:MM:SS" format
(Note that the last three are contrary to the type affinity.)
I would recommend not using self-defined types. I have observed in version 3.5.6 that types not already defined could sometimes cause an INSERT command to be refused. Maybe 1 out of 1000. I don't know if this was addressed since.
In any case, there is no sizing advantage in typing a column TINYINT or SMALLINT. The only advantage would be outside SQLite, for either parsing your column types with another program or to satisfy your personal need for tidiness. So I strongly recommend using the base types defined by SQLite and sticking to those.
Since SQLite is typeless, use whatever types make it easier for you to see what the schema looks like. Or you can match the types to your codebase.
I'm going to go with Kevin on this one. In short, knock yourself out. Make up brand new areas of mathematics if it suits your schema. Use the classnames of your ORM. Or name every type (except the PRIMARY KEY INTEGER ones) for ex-girlfriends. In the end SQLite is more about how you access and use the data.
I have a classic ASP page that gets POSTed to. The data gets POSTed as UTF-8 (I can see this in Fiddler). I then open an ADODB connection to a database and store the data in a VARCHAR field. If the data can be represented by 8859-1 (e.g. iñtërnâtiônàlizætiøn) it is stored correctly in the varchar field. If I try strings that can't be mapped to 8859 (e.g. Здравствуйте!) I get ????????????!. This all makes sense as the varchar field cannot hold unicode. I also understand the using an nvarchar field should enable me to store utf-8 strings.
My question is this. What settings in SQL Server or in the ADODB object control how the strings are converted from UTF-8 to 8859-1? Does VBScript (ASP) send the strings to ADODB.Connection.Execute as UTF-8 (or what I think it is actually doing - UTF-16) and the database itself handles the conversion? Is this controlled by the collation of the database (SQL_Latin1_General_CP1_CI_AS in this case)?
If you switch to using NVARCHAR instead then you'll need to remember to use the N specifier in your SQL commands like so whenever you use a string which is Unicode
INSERT INTO SOME_TABLE (someField) VALUES (N'Some Unicode Text')
SELECT * FROM SOME_TABLE WHERE someField=N'Some Unicode Text'
If you don't do this then the strings won't get treated as Unicode and your data will be silently converted to Latin1 or whatever the default character set for the relevant database/table/field even if that field is a NVARCHAR
You are correct.
VBScript and ADODB only know strings as Unicode (or UTF-16 as its sometimes refered to).
Its part of the DBs collation settings that determine how the VARCHAR fields are encoded.
In SQL_Latin1_General_CP1_CI_AS its really the CP1 bit which is determining the CodePage to use. In this case 1 is a legacy reference to Windows-1252 which is a superset of ISO-8859-1.