ASP Readline non-standard Line Endings - asp-classic

I'm using the ASP Classic ReadLine() function of the File System Object.
All has been working great until someone made their import file on a Mac in TextEdit.
The line endings aren't the same, and ReadLine() reads in the entire file, not just 1 line at a time.
Is there a standard way of handling this? Some sort of page directive, or setting on the File System Object?
I guess that I could read in the entire file, and split on vbLF, then for each item, replace vbCR with "", then process the lines, one at a time, but that seems a bit kludgy.
I have searched all over for a solution to this issue, but the solutions are all along the lines of "don't save the file with Mac[sic] line endings."
Anyone have a better way of dealing with this problem?

There is no way to change the behaviour of ReadLine, it will only recognize CRLF as a line terminator. Hence the only simply solution is the one you have already described.
Edit
Actually there is another library that ought to be available out of the box on an ASP server that might offer some help. That is the ADODB library.
The ADODB.Stream object has a LineSeparator property that can be assigned 10 or 13 to override the default CRLF it would normally use. The documentation is patchy because it doesn't describe how this can be used with ReadText. You can get the ReadText method to return the next line from the stream by passing -2 as its parameter.
Take a look at this example:-
Dim sLine
Dim oStreamIn : Set oStreamIn = CreateObject("ADODB.Stream")
oStreamIn.Type = 2 '' # Text
oStreamIn.Open
oStreamIn.CharSet = "Windows-1252"
oStreamIn.LoadFromFile "C:\temp\test.txt"
oStreamIn.LineSeparator = 10 '' # Linefeed
Do Until oStreamIn.EOS
sLine = oStreamIn.ReadText(-2)
'' # Do stuff with sLine
Loop
oStreamIn.Close
Note that by default the CharSet is unicode so you will need to assign the correct CharSet being used by the file if its not Unicode. I use the word "Unicode" in the sense that the documentation does which actually means UTF-16. One advantage here is that ADODB Stream can handle UTF-8 unlike the Scripting library.
BTW, I thought MACs used a CR for line endings? Its Unix file format that uses LFs isn't it?

Related

Using com.opencsv.CSVReader on windows stops reading lines prematurely

I have two files that are identical except for the line ending codes. The one that uses the newline (linux/Unix)character works (reads all 550 rows of data) and the one that uses carriage return and line feed (Windows) stops returning lines after reading 269 lines. In both cases the data is read correctly up to the point where they stop.
If I run dos2unix on the file that fails, the resulting file works.
I would like to be able read CSV files regardless of their origin. If I could at least detect that the file is in the wrong format before reading part of the data that would be helpful
Even if I could tell at any time in the middle of reading the file that it was not going to work, I could output an error.
My current state of reading half the file and terminating with no error is dangerous.
The problem is that under the covers openCSV uses a BufferedReader which reads a line from the stream until it gets to the Systems line.seperator.
If you know beforehand what the line separator of the file is then in your application just do a System.setProperty("line.separator", newLine) where newLine is either "\n" or "\r\n" based on the file you are about to parse. Or you can pass that in as a parameter.
If you want to automatically detect the file character. Create a method that will take the file you want, create a BufferedReader and read a single line. If the last character is a '\r' then your system system uses "\n" but you want to set it to "\r\n". Else if line.contains("\n") returns true then you are on a system that uses "\r\n" and you want to set it to "\n". Otherwise the system and the file you are reading have compatible line feed characters.
Just note if you do change the system line feed character be sure to set it back after processing the file in case your program is processing multiple files.

Reading text files in Ada: Get_Line "reads" the byte-order mark as well

I'm trying to read a file line-by-line in Ada, it's a XML text file. I'm following the instructions here:
http://rosettacode.org/wiki/Read_a_file_line_by_line#Ada
However there's a problem that annoys me: the "Get_Line" function seems to be unaware of byte-order marks and reads them as part of the text itself, which means that when I raed the lines, the first one will always start with some extra bytes that should not be there.
While removing the extra bytes manually from the string is no big deal it seems strange to me that a function dedicated to text input/output is unaware of BOMs, there must be a way to read a text file in ada without having to worry about this... is there?
Ada.Text_IO is specified to handle ISO-8859-1 encoded text, so ignoring an UTF-8 feature is the proper thing to do.
If Ada.Wide_Text_IO and Ada.Wide_Wide_Text_IO also output the byte-order-mark, when asked to read UTF-8 encoded text, then you should consider reporting it as a bug to GCC - but as there is quite a lot of implementation defined details for the text I/O packages in Ada, you should be ready for a "wont fix" answer.
One possibility is using the stream attributes and making a UTF_8 file-type to handle the BOM reading-and-discarding.

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

Restrict user input to characters in IBM System i 00280 code page

We need to restrict user input in a classic ASP web site to the characters allowed by the 00280 code page of IBM System i.
Is there a way to do it in a sane way besides having a (JavaScript|VBScript) function checking every character of an input string against a string of allowed characters?
A basic classic ASP function I thought of:
Function CheckInput(text, replacement)
Dim output : output = ""
Dim haystack : haystack = "abcd.. " ' Insert here the allowed characters.
Dim i : i = 0
For i = 1 To Len(text)
Dim needle : needle = Mid(text, i, 1)
If InStr(haystack, needle) = 0 Then
needle = replacement
End If
output = output & needle
Next
CheckInput = output
End Function
Would - in my function - a RegExp be an overkill?
The short answer to your first question is: No. To your second question: RegEx might not help you here because not all RegEx implementation in browsers will support the characters you need to test and neither does VBScript version of RegEx.
Even using the code approach you are proposing would need some very careful thought. In order to be able to place the set of characters you want to support in as string literal the codepage that you save the ASP file would need to be one that covers all the characters needed or alternatively you would need to use AscW to help you build a string containing those characters.
One slightly simpler approach would be to use Javascript and have the page charset and codepage set to UTF-8. This would allow you to create a string literal containing anyset of characters.
Since it is generally not considered secure to rely on browser validation, you should consider changing your IBM i (formerly OS/400) application interface to accept UCS-2 data, and perform any necessary validation and conversion at the server side.

How to add encoding information to the response stream in ASP.NET?

I have following piece of code:
public void ProcessRequest (HttpContext context)
{
context.Response.ContentType = "text/rtf; charset=UTF-8";
context.Response.Charset = "UTF-8";
context.Response.ContentEncoding = System.Text.Encoding.UTF8;
context.Response.AddHeader("Content-disposition", "attachment;filename=lista_obecnosci.csv");
context.Response.Write("ąęćżźń󳥌ŻŹĆŃŁÓĘ");
}
When I try to open generated csv file, I get following behavior:
In Notepad2 - everything is fine.
In Word - conversion wizard opens and asks to convert the text. It suggest UTF-8, which is somehow ok.
In Excel - I get real mess. None of those Polish characters can be displayed.
I wanted to write those special encoding-information characters in front of my string, i.e.
context.Response.Write((char)0xef);
context.Response.Write((char)0xbb);
context.Response.Write((char)0xbf);
but that won't do any good. The response stream is treating that as normal data and converts it to something different.
I'd appreciate help on this one.
I ran into the same problem, and this was my solution:
context.Response.BinaryWrite(System.Text.Encoding.UTF8.GetPreamble());
context.Response.Write("ąęćżźń󳥌ŻŹĆŃŁÓĘ");
What you call "encoding-information" is actually a BOM. I suspect each of those "characters" is getting encoded separately. To write the BOM manually, you have to write it as three bytes, not three characters. I'm not familiar with the .NET I/O classes, but there should be a method available to you that takes a byte or byte[] parameter and writes them directly to the file.
By the way, the UTF-8 BOM is optional; in fact, its use is discouraged by the Unicode Consortium. If you don't have a specific reason for using it, save yourself some hassle and leave it out.
EDIT: I just remembered you can also write the actual BOM character, '\uFEFF', and let the encoder handle it:
context.Response.Write('\uFEFF');
I think the problem is with Excel based on Microsoft Excel mangles Diacritics in .csv files. To prove this, copy your sample output string of ąęćżźń󳥌ŻŹĆŃŁÓĘ and paste into a test file using your favorite editor, and save as a UTF-8 encoded .csv file. Open in Excel and see the same issues.
The answer from Alan Moore
translated to VB:
Context.Response.Write(""c)

Resources