After finally getting my XmlReader to work correctly on a project at work, I am now getting certain parsing errors when trying to create new Reader objects for certain XML files. For instance, this one that keeps occurring is an error trying to parse a hyphen (-). This slightly baffles me because I manually go in and replace that character with something else (like an underscore), and it reads fine - even when there are hyphens elsewhere in the document that are not changed.
So, unless there is a explanation to fix this (maybe some XmlReaderSettings? Have yet to use any so I don't know what they are capable of), what is the best syntax/method to cycle through every character and replace with ones that will parse correctly?
This program will run automatically once per day on a daily-added XML and length of run-time is not an issue.
Edit: Error Message:
System.Xml.XmlException: An error occurred while parsing EntityName. Line 2896, position 89.
Code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)
Dim dt As DataTable = ds.Tables(13)
Dim filecreatedate As String = IO.File.GetLastWriteTime(file1)
If the problem occurs in ONLY ONE HYPHEN in entire file, even if the file contains more hyphens, the problem may be related to:
1) The HYPHEN is really not an HYPHEN but a control-character or even be accomplished of a hidden control character.
2) The link has other interesting thinhs, like an ampersand ("&"), which in strings may cause some problems. Are you sure the problem is the Hyphen?
Related
I am facing a problem with the fwrite function from the DataTable package in R. In fact it appends the wrong way and I'd end up with something like:
**user ref version status Type DataExtraction**
user1 2.02E+11 1 Pending 1 No
user2 2.02E+11 1 Saved 2 No"user3" 2.01806E+11 1 Saved NB No
I am using the function as follows :
library(data.table)
fwrite(Save, "~/Downloads/Register.csv", append = TRUE, sep = ",", quote = TRUE)
Reproducible example:
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = FALSE)
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = TRUE)
I'm not sure if it isolates the problem, but it seems that if I manually change something in the .csv file (for instance rename DataExtraction to Extraction), the problem of appending in the wrong way occurs.
Does someone know what is going wrong?
Thanks!
When I run your example code I have no problems with the behavior - the file comes out as desired. Based on your comments about manually changing what is in the file, and what the undesired output looks like, here is what I believe is happening. When fwrite() (and many other similar IO functions) write to a file, each line has at the end of it a line break (in R, this is generally represented as \n). This is desired, so that subsequent lines of data indeed appear on subsequent lines of the file. Generally this will also mean that when you open a file in a text editor, there will be a blank line at the very end, since this reflects the line break in the last line that was written. (different editors handle this differently though). So, I suspect what is happening is that when you go in and manually edit the file in your editor, you are somehow losing that last line break. What this means is that when you go to write again using append, there is no line break at the end of the file, and therefore you get the undesired behavior of two lines of data on a single line of the file.
So, the solution would be to either find how to prevent your manual editing from deleting that last line break character. Barring that, there are ways to just write a single line break character to the file using R. E.g. with the cat() function.
I've been using asd=readcsv(filename) to read a csv file in Julia.
The first row of the csv file contains strings which describe the column contents; the rest of the data is a mix of integers and floats. readcsv reads the numbers just fine, but only reads the first 4+1/2 string entries.
After that, it renders "". If I ask the REPL to display asd[1,:], it tells me it is 1x65 Array{Any,2}.
The fifth column in the first row of the csv file (this seems to be the entry it chokes on) is APP #1 bias voltage [V]; but asd[1,5] is just APP . So it looks to me as though readcsv has choked on the "#" character.
I tried using "quotes=false" keyword in readcsv, but it didn't help.
I used to use xlsread in Matlab and it worked fine.
Has anybody out there seen this sort of thing before?
The comment character in Julia is #, and this applies when reading files from delimited text files.
But luckily, the readcsv() and readdlm() functions have an optional argument to help in these situations.
You should try readcsv(filename; comment_char = '/').
Of course, the example above assumes that you don't have any / characters in your first line. If you do, then you'll have to change that / above to something else.
I have a flat file that I need to take and insert a carriage return every 410 characters. I know this sounds weird, but for whatever reason my work was given several huge flat files from a clearinghouse, and I need to parse it out.
There is nothing that seperates what is supposed to be each new line, but it is exactly 410 characters. So I can't even search for anything specific and then do it.
There are 21 files total, each about 12-13mb.
I have asked for a CSV file, and they are unable to provide that.
I am trying to see if Notepad++ will do a Character count and then I can just hit "enter" after every 410th.
Also I am trying to see if I can do this in Java.
Any help you all can provide would be appreciated.
In Notepad++ you can search for the regular expression (.{410}) and replace it with \1\r.
It has happened to me that Notepad++ swallowed some characters when doing regex-based search and replace operations in large files, so I would try this for one file, then remove all the carriage returns again and compare the result size to the original size, just to make sure that nothing got swallowed during the replace operation.
For an XML document containing the escape characters, I have seen several options to work around. What is the fastest/smallest possible method to either ignore invalid characters or replace them with correct format?
The data is going into a database and the column that this data with the potential for funny characters is going into (location address) is the least important.
I'm getting the entity_name parsing error at the dataset.ReadXml command
Here is my code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)
I'm using the ASP Classic ReadLine() function of the File System Object.
All has been working great until someone made their import file on a Mac in TextEdit.
The line endings aren't the same, and ReadLine() reads in the entire file, not just 1 line at a time.
Is there a standard way of handling this? Some sort of page directive, or setting on the File System Object?
I guess that I could read in the entire file, and split on vbLF, then for each item, replace vbCR with "", then process the lines, one at a time, but that seems a bit kludgy.
I have searched all over for a solution to this issue, but the solutions are all along the lines of "don't save the file with Mac[sic] line endings."
Anyone have a better way of dealing with this problem?
There is no way to change the behaviour of ReadLine, it will only recognize CRLF as a line terminator. Hence the only simply solution is the one you have already described.
Edit
Actually there is another library that ought to be available out of the box on an ASP server that might offer some help. That is the ADODB library.
The ADODB.Stream object has a LineSeparator property that can be assigned 10 or 13 to override the default CRLF it would normally use. The documentation is patchy because it doesn't describe how this can be used with ReadText. You can get the ReadText method to return the next line from the stream by passing -2 as its parameter.
Take a look at this example:-
Dim sLine
Dim oStreamIn : Set oStreamIn = CreateObject("ADODB.Stream")
oStreamIn.Type = 2 '' # Text
oStreamIn.Open
oStreamIn.CharSet = "Windows-1252"
oStreamIn.LoadFromFile "C:\temp\test.txt"
oStreamIn.LineSeparator = 10 '' # Linefeed
Do Until oStreamIn.EOS
sLine = oStreamIn.ReadText(-2)
'' # Do stuff with sLine
Loop
oStreamIn.Close
Note that by default the CharSet is unicode so you will need to assign the correct CharSet being used by the file if its not Unicode. I use the word "Unicode" in the sense that the documentation does which actually means UTF-16. One advantage here is that ADODB Stream can handle UTF-8 unlike the Scripting library.
BTW, I thought MACs used a CR for line endings? Its Unix file format that uses LFs isn't it?