Declaring field to use Unicode UTF-16 - teradata

I am trying to use the Unicode UTF-16 character set, but I am unsure how to do this, by default when I use the Unicode character set it uses UTF-8 which changes foreign Spanish, Arabic, etc. characters into ?. I am currently using Teradata 14.

Related

How can I use QSettings to write UTF-8 characters into [section] and [name] of *.ini file properly?

My code snippet is here:
QSettings setting("xxx.ini", QSettings::Format::IniFormat);
setting.setIniCodec(QTextCodec::codecForName("UTF-8"));
setting.beginGroup(u8"运动控制器");
setting.setValue(u8"运动控制器", u8"运动控制器");
setting.endGroup();
But what is written looks like this:
[%U8FD0%U52A8%U63A7%U5236%U5668]
%U8FD0%U52A8%U63A7%U5236%U5668=运动控制器
So it seems I did set the encoding correctly (partly), but what should I do to change the section and name into text from some per-cent-sign code?
Environment is Qt 5.12.11 and Visual Studio 2019
Unfortunately, this is hard-coded behavior in QSettings that you simply cannot change.
In section and key names, Unicode characters <= U+00FF (other than a..z, A..Z, 0..9, _, -, or .) are encoded in %XX hex format, and higher characters are encoded in %UXXXX format. The codec specified in setIniCodec() has no effect on this behavior.
Key values are written in the specified codec, in this case UTF-8.

Why one AL32UTF8 character not display the I-Acute, yet other one displays the tilde-N?

My Oracle 11g is configured with AL32UTF8
NLS_CHARACTERSET AL32UTF8
Why does the tilde-N display as tilde-N in the second record, but the Acute-I and K
not display with Acute-I and K in the first record?
Additional Information:
The hex code for the Accent-I is CD
When I take the HEX code from the dump and convert it using UNISTR(), the character displays with the accent.
select
unistr('\0052\0045\0059\004B\004A\0041\0056\00CD\004B')
as hex_to_unicode
from dual;
This is probably an issue with whatever client you are using to display the results than your database. What are you using?
You can check if the database results are correct using the DUMP function. If the value in your table has the correct byte sequence for your database character set, you're good.
Edit:
OK, I'm pretty sure your data is bad. You're talking about
LATIN CAPITAL LETTER I WITH ACUTE, which is Unicode code point U+00CD. That is not the same as byte 0xCD. You're using database character set AL32UTF8, which uses UTF-8 encoding. The correct UTF-8 encoding for the U+00CD character is the two-byte sequence 0xC38D.
What you have is UTF-8 byte sequence 0xCD4B, which I'm pretty sure is invalid.
The Oracle UNISTR function takes the code point in UCS-2 encoding, which is roughly the same as UTF-16, not UTF-8.
Demonstration here: http://sqlfiddle.com/#!4/7e9d1f/1

How to specify encoding while creating file?

I am using an R script to create and append a file. But I need the file to be saved in ANSI encoding,even though some characters are in Unicode format. How to ensure ANSI encoding?
newfile='\home\user\abc.ttl'
file.create(newfile)
text3 <- readLines('\home\user\init.ttl')
sprintf('readlines %d',length(text3))
for(k in 1:length(text3))
{
cat(text3[[k]],file=newfile,sep="\n",append=TRUE)
}
Encoding can be tricky, since you need to detect your encoding upon input, and then you need to convert it before writing. Here it sounds like your input file input.ttl is encoded as UTF-8, and you need it converted to ASCII. This means you are probably going to lose some non-translatable characters, since there may be no mapping from the UTF-8 characters to ASCII outside of the 128-bit lower range. (Within this range the mappings of UTF-8 to ASCII are the same.)
So here is how to do it. You will have to modify your code accordingly to test since you did not supply the elements needed for a reproducible example.
Make sure that your input file is actually UTF-8 and that you are reading it as UTF-8. You can do this by adding encoding = "UTF-8" to the third line of your code, as an argument to readLines(). Note that you may not be able to set the system locale to UTF-8 on a Windows platform, but the file will still be read as UTF-8, even though extended characters may not display properly.
Use iconv() to convert the text from UTF-8 to ASCII. iconv() is vectorised so it works on the whole set of text. You can do this using
text3 <- iconv(text3, "UTF-8", "ASCII", sub = "")
Note here that the sub = "" argument prevents the default behaviour of converting the entire character element to NA if it encounters any untranslatable characters. (These include the seemingly innocent but actually subtly evil things such as "smart quotes".)
Now when you write the file using cat() the output should be ASCII.

Teradata SQLA: Row size or Sort Key size overflow

When doing a select of all columns from a table consisting of 86 columns in SQLA, I always get the error Row size or Sort Key size overflow. The only way to avoid this error is to trim down the number of columns in the select, but this is an unconventional solution. There has to be a way to select all columns from this table in one select statement.
Bounty
I am adding this bounty because I cannot hack my way past this issue any longer. There has to be a solution to this. Right now, I am selecting from a table with Unicode columns. I am assuming this is causing the row size to exceed capacity. When I remove Session Character Set=UTF8 from my connection string, I get the error of The string contains an untranslatable character. I am using NET data provider 14.0.0.1. Is there a way to increase the size?
Update
Rob, you never cease to impress! You suggestion of using UTF16 works. It even works in SQLA after I update my ODBC config. I think my problem all along is my lack of understanding of ASCII, Latin, UTF8, and UTF16.
We also have an 80-column table that consists of all Latin columns, a few of which are `varchar(1000)'. I get the same error in SQLA when selecting from it in UTF8 and UTF16, but I can select from it just fine after updating my character set to ASCII or Latin mode in my ODBC config.
Rob, can you provide insight as to what's happening here? My theory is that, because it's in the Latin set, using UTF8 or UTF16 causes a conversion to a larger set of bytes which results in the error, especially for the varchar(1000)'s. If I use Latin as my session character set, no conversion is done and I get the string in its native encoding. As for the issue in question, UTF8 fails because the encoding cannot be "downgraded"?
Per request, here is the DDL of the table in question:
CREATE MULTISET TABLE mydb.mytable ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
FIELD1 VARCHAR(214) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
FIELD2 VARCHAR(30) CHARACTER SET UNICODE CASESPECIFIC,
FIELD3 VARCHAR(60) CHARACTER SET UNICODE CASESPECIFIC NOT NULL,
FIELD4 VARCHAR(4000) CHARACTER SET UNICODE CASESPECIFIC,
FIELD5 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD6 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD7 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD8 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD9 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD10 VARCHAR(900) CHARACTER SET UNICODE CASESPECIFIC,
FIELD11 VARCHAR(3600) CHARACTER SET UNICODE CASESPECIFIC,
FIELD12 VARCHAR(3600) CHARACTER SET UNICODE CASESPECIFIC,
FIELD13 VARCHAR(3600) CHARACTER SET UNICODE CASESPECIFIC,
FIELD14 VARCHAR(3600) CHARACTER SET UNICODE CASESPECIFIC)
PRIMARY INDEX ( FIELD1 );
Without seeing your table definition have you considered using UTF16 instead of UTF8 for your SESSION CHARSET?
Some more research on your error message found this post suggesting that UTF16 may afford you the ability to return records that UTF8 otherwise will not.
Edit:
If you recall from the link that I shared above, for a given VARCHAR(n) the bytes to store would be as follows:
LATIN: n bytes
UTF8: n*3 bytes
UTF16: n*2 bytes
This would mean that a VARCHAR(4000) UNICODE field in a UTF8 session should require 12KB. If you have to deal with UNICODE data consistently it may be to your advantage to leave or change your default session character set to UTF16. In my experience I have not had to work with UNICODE data so I couldn't tell you if what the pitfalls to changing your character set may introduce for LATIN data elsewhere in your database(s).
Hope this helps.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Resources