Read UTF-8 XML with MSXML 4.0 - asp-classic

I have a problem with classc ASP / VBScript trying to read an UTF-8 encoded XML file with MSXML. The file is encoded correctly, I can see that with all other tools.
Constructed XML example:
<?xml version="1.0" encoding="UTF-8"?>
<itshop>
<Product Name="Backup gewünscht" />
</itshop>
If I try to do this in ASP...
Set fso = Server.CreateObject("Scripting.FileSystemObject")
Set ts = fso.OpenTextFile("input.xml", FOR_READING)
XML = ts.ReadAll
ts.Close
Set ts = nothing
Set fso = Nothing
Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")
myXML.loadXML(XML)
Set DocElement = myXML.documentElement
Set ProductNodes = DocElement.selectNodes("//Product")
Response.Write ProductNodes(0).getAttribute("Name")
' ...
... and Name contains special characters (german umlauts to be specific) the bytes of the umlaut "two-byte-code" get reencoded, so I end up with two totally crappy nonsense characters. What should be "ü" becomes "ü" - being FOUR bytes in my output, not two (correct UTF-8) or one (ISO-8859-#).
What am I doing wrong? Why is MSXML thinking that the input is ISO-8859-# so that it tries to convert it to UTF-8?

Set ts = fso.OpenTextFile("input.xml", FOR_READING, False, True)
The last parameter is the "Unicode" flag.
OpenTextFile() has the following signature:
object.OpenTextFile(filename[, iomode[, create[, format]]])
where "format" is defined as
Optional. One of three Tristate values used to indicate the format of
the opened file. If omitted, the file
is opened as ASCII.
And Tristate is defined as:
TristateUseDefault -2 Opens the file using the system default.
TristateTrue -1 Opens the file as Unicode.
TristateFalse 0 Opens the file as ASCII.
And -1 happens to be the numerical value of True.
Anyway, better is:
Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")
myXML.load("input.xml")
Why should you use a TextStream object to read in a file that MSXML can read perfectly on it's own.
The TextStream object also has no notion of the actual file encoding. The docs say "Unicode", but there is more than one way of encoding Unicode. The load() method of the MSXML object will be able to deal with all of them.

Related

Write less than greater than to text file in asp vbscript

I am having a very hard time trying to write out an xml file from asp vbscript to a text file using the Scripting.FileSystemObject. The issue is the less than and greater than chars. In order for me to add these characters to variables in the code i need to use &lt ; &gt ;. This causes a problem when writing the text. The results look like this
<copyright>request copyright</copyright>
<lastBuildDate>10/26/2012</lastBuildDate>
proper format should be as such
<copyright>request copyright</copyright>
<lastBuildDate>10/26/2012</lastBuildDate>
Is there some sort of trick to converting those segments while writing the text file, or do i need to do something a bit more extravagant?
Thanks in advance!
When writing in the TextStream, you could just surround your variables with two calls to Replace
TextStream.Write Replace(Replace(myString, "<","<"),">",">")
This way the variables aren't altered, but the written out data uses the right characters.
Try this:
Dim objStream
Set objStream = CreateObject("ADODB.Stream")
objStream.CharSet = "utf-8"
objStream.Open
objStream.WriteText "testdata"
objStream.SaveToFile "C:\test.txt", 2

FSO OpenTextFile with french characters

Using ASP's file system object (FSO), I'm trying to read a txt file with OpenTextFile that contains French characters (e and a with accents for e.g). Those characters come out wrong.
I tried specifying the format to TristateTrue to open the file as Unicode but to no avail.
I've been reading about using the ADO Stream object instead but I hoped there would be a way with FSO. Does anyone have any ideas?
Most likely the file is saved in UTF-8 encoding. The FileSystemObject does not handle UTF-8.
Either have the file saved as Unicode or use the ADODB.Stream object. The ADODB.Stream has a LoadFromFile method and does support UTF-8.
Dim s
Dim stream : Set stream = CreateObject("ADODB.Stream")
stream.CharSet = "UTF-8"
stream.LoadFromFile Server.MapPath("yourfile.txt")
s = stream.ReadAll
stream.Close

Fix Special Characters in String

I've got a program that in a nutshell reads values from a SQL database and writes them to a tab-delimited text file.
The issue is that some of the values in the database have special characters (TM, dash, ellipsis, etc.) When written to the text file, the formatting is lost and they come across as junk "™ or – etc"
When the value is viewed in the immediate window, before it is written to the txt file, everything looks fine. My guess is that this is an issue of encoding. But, I'm not real sure how to proceed, where to look, or what to look for.
Is this ASCII or UTF-8? If it's one of those how do I correct it before it's written to the text file.
Here's how I build the text file (where feedStr is a StringBuilder)
objReader = New StreamWriter(filePath)
objReader.Write(feedStr)
objReader.Close()
The default encoding for StreamWriter is UTF8 (with no byte order mark). Your result file is ok, the question is what do you open it in afterwards? If you open it in a UTF8 capable text editor, the characters should look the way you want.
You can also write the text file in another encoding, for example iso-8859-1 (latin1)
objReader = New StreamWriter(filePath, false, Encoding.GetEncoding("iso-8859-1"))

ASP Readline non-standard Line Endings

I'm using the ASP Classic ReadLine() function of the File System Object.
All has been working great until someone made their import file on a Mac in TextEdit.
The line endings aren't the same, and ReadLine() reads in the entire file, not just 1 line at a time.
Is there a standard way of handling this? Some sort of page directive, or setting on the File System Object?
I guess that I could read in the entire file, and split on vbLF, then for each item, replace vbCR with "", then process the lines, one at a time, but that seems a bit kludgy.
I have searched all over for a solution to this issue, but the solutions are all along the lines of "don't save the file with Mac[sic] line endings."
Anyone have a better way of dealing with this problem?
There is no way to change the behaviour of ReadLine, it will only recognize CRLF as a line terminator. Hence the only simply solution is the one you have already described.
Edit
Actually there is another library that ought to be available out of the box on an ASP server that might offer some help. That is the ADODB library.
The ADODB.Stream object has a LineSeparator property that can be assigned 10 or 13 to override the default CRLF it would normally use. The documentation is patchy because it doesn't describe how this can be used with ReadText. You can get the ReadText method to return the next line from the stream by passing -2 as its parameter.
Take a look at this example:-
Dim sLine
Dim oStreamIn : Set oStreamIn = CreateObject("ADODB.Stream")
oStreamIn.Type = 2 '' # Text
oStreamIn.Open
oStreamIn.CharSet = "Windows-1252"
oStreamIn.LoadFromFile "C:\temp\test.txt"
oStreamIn.LineSeparator = 10 '' # Linefeed
Do Until oStreamIn.EOS
sLine = oStreamIn.ReadText(-2)
'' # Do stuff with sLine
Loop
oStreamIn.Close
Note that by default the CharSet is unicode so you will need to assign the correct CharSet being used by the file if its not Unicode. I use the word "Unicode" in the sense that the documentation does which actually means UTF-16. One advantage here is that ADODB Stream can handle UTF-8 unlike the Scripting library.
BTW, I thought MACs used a CR for line endings? Its Unix file format that uses LFs isn't it?

Character Support Issue - How to Translate Higher ASCII Characters to Lower ASCII Characters

So I have an ASP.Net (vb.net) application. It has a textbox and the user is pasting text from Microsoft Word into it. So things like the long dash (charcode 150) are coming through as input. Other examples would be the smart quotes or accented characters. In my app I'm encoding them in xml and passing that to the database as an xml parameter to a sql stored procedure. It gets inserted in the database just as the user entered it.
The problem is the app that reads this data doesn't like these characters. So I need to translate them into the lower ascii (7bit I think) character set. How do I do that? How do I determine what encoding they are in so I can do something like the following. And would just requesting the ASCII equivalent translate them intelligently or do I have to write some code for that?
Also maybe it might be easier to solve this problem in the web page to begin with. When you copy the selection of characters from Word it puts several formats in the clipboard. The straight text one is the one I want. Is there a way to have the html textbox get that text when the user pastes into it? Do I have to set the encoding of the web page somehow?
System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text))
Code from the app that encodes the input into xml:
Protected Function RequestStringItem( _
ByVal strName As System.String) As System.String
Dim strValue As System.String
strValue = Me.Request.Item(strName)
If Not (strValue Is Nothing) Then
RequestStringItem = strValue.Trim()
Else
RequestStringItem = ""
End If
End Function
' I get the input from the textboxes into an array like this
m_arrInsertDesc(intIndex) = RequestStringItem("txtInsertDesc" & strValue)
m_arrInsertFolder(intIndex) = RequestInt32Item("cboInsertFolder" & strValue)
' create xml file for inserts
strmInsertList = New System.IO.MemoryStream()
wrtInsertList = New System.Xml.XmlTextWriter(strmInsertList, System.Text.Encoding.Unicode)
' start document and add root element
wrtInsertList.WriteStartDocument()
wrtInsertList.WriteStartElement("Root")
' cycle through inserts
For intIndex = 0 To m_intInsertCount - 1
' if there is an insert description
If m_arrInsertDesc(intIndex).Length > 0 Then
' if the insert description is of the appropriate length
If m_arrInsertDesc(intIndex).Length <= 96 Then
' add element to xml
wrtInsertList.WriteStartElement("Insert")
wrtInsertList.WriteAttributeString("insertdesc", m_arrInsertDesc(intIndex))
wrtInsertList.WriteAttributeString("insertfolder", m_arrInsertFolder(intIndex).ToString())
wrtInsertList.WriteEndElement()
' if insert description is too long
Else
m_strError = "ERROR: INSERT DESCRIPTION TOO LONG"
Exit Function
End If
End If
Next
' close root element and document
wrtInsertList.WriteEndElement()
wrtInsertList.WriteEndDocument()
wrtInsertList.Close()
' when I add the xml as a parameter to the stored procedure I do this
cmdAddRequest.Parameters.Add("#insert_list", OdbcType.NText).Value = System.Text.Encoding.Unicode.GetString(strmInsertList.ToArray())
How big is the range of these input characters? 256? (each char fits into a single byte). If that's true, it wouldn't be hard to implement a 256 value lookup table. I haven't toyed with BASIC in years, but basically you'd DIM an array of 256 bytes and fill in the array with translated values, i.e. the 'a'th byte would get 'a' (since it's OK as is) but the 150'th byte would get a hyphen.
This seems to work for long dash to short dash and smart quotes to regular quotes. As my html pages has the following as the content type. But it converts all the accented characters to questions marks. Which is not what the Text version of the clipboard has. So I'm closer, I just think I have the target encoding wrong.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(m_arrFolderDesc(intIndex)))
Edit: Found the correct target encoding for my purposes which is 1252.
System.Text.Encoding.GetEncoding(1252).GetString(System.Text.Encoding.GetEncoding("iso-8859-1").GetBytes(m_arrFolderDesc(intIndex)))
If you convert to a non-unicode character set, you will lose some characters in the process. If the legacy app reading the data doesn't need to do any string transformations, you might want to consider using UTF-7, and converting it back once it gets back into the unicode world - this will preserve all special characters.

Resources