Writing a unicode file using dbms_xlsprocessor (from ALT16UTF16 Db) - oracle11g

My Oracle DB is in ALT16UTF16 charcter set.I need to generetae unicode text file which should be imported in another DB in AL32UTF8 encoding. I use for this a PL/sql code which calls the DBMS_XSLPROCESSOR.CLOB2FILE procedure.
In my code process:
1st ,I use an NCLOB to store the unicode lines which contains a chinese characters.
then I call the procedure as: DBMS_XSLPROCESSOR.CLOB2FILE(v_file, DIRE, fileName,873)
where v_file is the NCLOB variable which contains the file and 873 is the Oracle characteset of AL32UTF8
However when I check in the text file I find ¿¿ instead of the chinese caracters, could you help to resolve this or if you can suggest another procedure other than DBMS_XSLPROCESSOR.CLOB2FILE which allow writing a large file with the chinese caracters which is extracted from non unicode DB ?
Many Thanks

Related

Teradata query returns bad characters in string column but exporting to CSV from assistant console works

I am using DBI package in R to connect to teradata this way:
library(teradatasql)
query <- "
SELECT sku, description
FROM sku_table
WHERE sku = '12345'
"
dbconn <- DBI::dbConnect(
teradatasql::TeradataDriver(),
host = teradataHostName, database = teradataDBName,
user = teradataUserName, password = teradataPassword
)
dbFetch(dbSendQuery(dbconn, query), -1)
It returns a result as follows:
SKU DESCRIPTION
12345 18V MAXâ×¢ Collated Drywall Screwgun
Notice the bad characters â×¢ above. This is supposed to be superscript TM for trademarked.
When I use SQL assistant to run the query, and export the query results manually to a CSV file, it works fine as in the DESCRIPTION column has correct encoding.
Any idea what is going on and how I can fix this problem? Obviously, I don't want a manual step of exporting to CSV and re-reading results back into R data frame, and into memory.
The Teradata SQL Driver for R (teradatasql package) only supports the UTF8 session character set, and does not support using the ASCII session character set with a client-side character set for encoding and decoding.
If you have stored non-LATIN characters in a CHARACTER SET LATIN column in the database, and are using a client-side character set to encode and decode those characters for the "good" case, that will not work with the teradatasql package.
On the other hand, if you used the UTF8 or UTF16 session character set to store Unicode characters into a CHARACTER SET UNICODE column in the database, then you will be able to retrieve those characters successfully using the teradatasql package.

Read "application/octet-stream; charset=binary" file in Informatica

i am trying to load a file in Informatica having filetype 'application/octet-stream; charset=binary' and kept codepage 'ms windows latin 1 (ansi) superset of latin1'
Original data
1^\MI^\IN^\123^\Y^^
Replaced HEX character using TR command
1|MI|IN|123|Y
After file processing Hex characters ^\ are added at the end of each line as below. how to deal with these extra characters in Informatica.
"1|MI|IN|123|Y^\^\^\^\^\"

Sqlite3 .recover replaces accents with?

I have a corrupted sqlite file. If I trí to open it, I get this error:
Error: database disk image is malformed
I tried to run .recover, like this:
sqlite3 corrupted.db ".recover" | sqlite3 recovered.db
I was able to open recovered.db, and almost everything is there, but the accented characters are replaced with ??, for example:
Original: Pes jí bagetu
Restored: Pes j?? bagetu
It is unlikely, but possible, that the original file is not utf8 encoded.
If I run .dump, and produce an sql file with insert statement, then I don't experiene this issue. The accented charecters are displayed correctly in the dump.sql. But in this case, less than half of the database was exported, so I prefer .recover.
What do I wrong? The correct characters are there, as the dump shows, but for some reason, they are lost during the export.

How to read in more than 250,000 characters XML CLOB field from Oracle into R or SAS?

I need to read in this XML COLB column from Oracle table. I tried the simple read in like below:
xmlbefore <- dbGetQuery(conn, "select ID, XML_TXT from XML_table")
But I can only read in about 225,000 characters. When I compare with the sample XML file, it only read in maybe 2/3 or 3/4 of the entire field.
I assume R has limitation of maybe 225,000 characters and SAS has even less, like about only 1000 Characters.
How can I read in the entire field with all characters (I think it is about 250,000-270,000)?
SAS dataset variables have a 32k char limit, macro variables 64k. LUA variables in SAS however have no limit (other than memory) so you can read your entire XML file into a single variable in one go.
PROC LUA is available from SAS 9.4M3 (check &sysvlong for details). If you have an earlier version of SAS, you can still process your XML by parsing it a single character at a time (RECFM=N).

Why one AL32UTF8 character not display the I-Acute, yet other one displays the tilde-N?

My Oracle 11g is configured with AL32UTF8
NLS_CHARACTERSET AL32UTF8
Why does the tilde-N display as tilde-N in the second record, but the Acute-I and K
not display with Acute-I and K in the first record?
Additional Information:
The hex code for the Accent-I is CD
When I take the HEX code from the dump and convert it using UNISTR(), the character displays with the accent.
select
unistr('\0052\0045\0059\004B\004A\0041\0056\00CD\004B')
as hex_to_unicode
from dual;
This is probably an issue with whatever client you are using to display the results than your database. What are you using?
You can check if the database results are correct using the DUMP function. If the value in your table has the correct byte sequence for your database character set, you're good.
Edit:
OK, I'm pretty sure your data is bad. You're talking about
LATIN CAPITAL LETTER I WITH ACUTE, which is Unicode code point U+00CD. That is not the same as byte 0xCD. You're using database character set AL32UTF8, which uses UTF-8 encoding. The correct UTF-8 encoding for the U+00CD character is the two-byte sequence 0xC38D.
What you have is UTF-8 byte sequence 0xCD4B, which I'm pretty sure is invalid.
The Oracle UNISTR function takes the code point in UCS-2 encoding, which is roughly the same as UTF-16, not UTF-8.
Demonstration here: http://sqlfiddle.com/#!4/7e9d1f/1

Resources