For an XML document containing the escape characters, I have seen several options to work around. What is the fastest/smallest possible method to either ignore invalid characters or replace them with correct format?
The data is going into a database and the column that this data with the potential for funny characters is going into (location address) is the least important.
I'm getting the entity_name parsing error at the dataset.ReadXml command
Here is my code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)
Related
I'm having problems with editing a JSON file and saving the results in a usable form.
My starting point is : modify-json-file-using-some-condition-and-create-new-json-file-in-r
In fact I want to do something even simpler and it still doesn't work! I'm using the jsonlite package.
An equivalent sample would look like this ...
$Apples
$Apples$Origin
$Apples$Origin$Id
[1] 2615
$Apples$Origin$season
[1] "Fall"
$Oranges
$Oranges$Origin
$Oranges$Origin$Id
[1] 2615
$Oranges$Origin$airportLabel
[1] "Orange airport"
$Oranges$Shipping
$Oranges$Shipping$ShipperId
[1] 123
$Oranges$Shipping$ShipperLabel
[1] "Brighter Orange"
I read the file, make some changes and save the resulting file back to HDD. Nothing simpler right?
json_list = read_json(path = "../documents/dummy.json")
json_list$Apples$Origin$Id = 1234
json_list$Oranges$Origin$Id = 4567
json_list$Oranges$Shipping$ShipperLabel = "Suntan Blue"
json_modified <- toJSON(json_list, pretty = TRUE)
write_json(json_modified, path = "../documents/dummy_new.json")
json_list appears as character format under the Rstudio file type column.
json_modified appears as json format under the Rstudio file type column.
Why this difference?
Now if I run the original file it works but the modified file fails. The JSON format checks out and I can't see any errors.
The real file is bigger than the example above but the method I've used is the same.
Am I doing something wrong in the way I edit or save the file?
I'm really new to JSON and this is really frustrating!
Any Ideas?
Thanks
In the absence of reproducible data, I can diagnose at least one potential problem.
Background
Within the jsonlite package, there exist functions that are mutual inverses:
jsonlite::fromJSON() converts from raw text (in JSON format) to R objects.
jsonlite::toJSON() converts from R objects to raw text (in JSON format).
Now this raw text (txt) might be
a JSON string, URL or file
As for jsonlite::read_json() and jsonlite::write_json(), they are also a pair of mutual inverses, which are like the former pair
except [that] they explicitly distinguish between path and literal input, and do not simplify by default.
That is, the latter are simply designed to handle file(path)s rather than strings of raw text.
So toJSON(fromJSON(txt = ...)) should return unchanged the text passed to txt, just as write_json(read_json(path = ...)) should write a file identical to that passed to path.
In short, toJSON() belongs with fromJSON(); while write_json() belongs with read_json().
The Problem
However, you have added a spurious step by mingling toJSON() with read_json() and write_json():
json_list = read_json(...)
# ...
json_modified <- toJSON(json_list, ...) # SPURIOUS STEP
# ...
write_json(json_modified, ...)
You see, write_json() already converts "to JSON", so toJSON() is wholly unnecessary. Indeed, toJSON() actually sabotages the process, since its textual return value is passed on (in json_modified) to write_json(), which expects (a structure of) R objects rather than text.
The Fix
Once you're done modifying json_list, just go straight to writing it:
json_list = read_json(path = "../documents/dummy.json")
json_list$Apples$Origin$Id = 1234
# Further modifications...
write_json(json_list, path = "../documents/dummy_new.json", pretty = TRUE)
I am trying to compare a string (in memory) to the contents of a file to see if they are the same. Boring details on motivation are below the question if anyone cares.
My confusion is that when I hash file contents, I get a different result than when I hash the string.
library(readr)
library(digest)
# write the string to the file
the_string <- "here is some stuff"
the_file <- "fake.txt"
readr::write_lines(the_string, the_file)
# both of these functions (predictably) give the same hash
tools::md5sum(the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
digest(file = the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
# now read it back to a string and get something different
back_to_a_string <- readr::read_file(the_file)
# "here is some stuff\n"
digest(back_to_a_string)
# "03ed1c8a2b997277100399bef6f88939"
# add a newline because that's what write_lines did
orig_with_newline <- paste0(the_string, "\n")
# "here is some stuff\n"
digest(orig_with_newline)
# "03ed1c8a2b997277100399bef6f88939"
What I want to do is just digest(orig_with_newline) == digest(file = the_file) to see if they're the same (they are) but that returns FALSE because, as shown, the hashes are different.
Obviously I could either read the file back to a string with read_file or write the string to a temp file, but both of those seem a bit silly and hacky. I guess both of those are actually fine solutions, I really just want to understand why this is happening so that I can better understand how the hashing works.
Boring details on motivation
The situation is that I have a function that will write a string to a file, but if the file already exists then it will error unless the user has explicitly passed .overwrite = TRUE. However, if the file exists, I would like to check whether the string about to be written to the file is in fact the same thing that's already in the file. If this is the case, then I will skip the error (and the write). This code could be called in a loop and it will be obnoxious for the user to continually see this error that they are about to overwrite a file with the same thing that's already in it.
Short answer: I think you need to set serialize=FALSE. Supposing that the file doesn't contain the extra newline (see below),
digest(the_string,serialize=FALSE) == digest(file=the_file) ## TRUE
(serialize has no effect on the file= version of the command)
dealing with newlines
If you read ?write_lines, it only says
sep: The line separator ... [information about defaults for different OSes]
To me, this seems ambiguous as to whether the separator will be added after the last line or not. (You don't expect a "comma-separated list" to end with a comma ...)
On the other hand, ?base::writeLines is a little more explicit,
sep: character string. A string to be written to the connection
after each line of text.
If you dig down into the source code of readr you can see that it uses
output << na << sep;
for each line of code, i.e. it's behaving the same way as writeLines.
If you really just want to write the string to the file with no added nonsense, I suggest cat():
identical(the_string, { cat(the_string,file=the_file); readr::read_file(the_file) }) ## TRUE
After finally getting my XmlReader to work correctly on a project at work, I am now getting certain parsing errors when trying to create new Reader objects for certain XML files. For instance, this one that keeps occurring is an error trying to parse a hyphen (-). This slightly baffles me because I manually go in and replace that character with something else (like an underscore), and it reads fine - even when there are hyphens elsewhere in the document that are not changed.
So, unless there is a explanation to fix this (maybe some XmlReaderSettings? Have yet to use any so I don't know what they are capable of), what is the best syntax/method to cycle through every character and replace with ones that will parse correctly?
This program will run automatically once per day on a daily-added XML and length of run-time is not an issue.
Edit: Error Message:
System.Xml.XmlException: An error occurred while parsing EntityName. Line 2896, position 89.
Code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)
Dim dt As DataTable = ds.Tables(13)
Dim filecreatedate As String = IO.File.GetLastWriteTime(file1)
If the problem occurs in ONLY ONE HYPHEN in entire file, even if the file contains more hyphens, the problem may be related to:
1) The HYPHEN is really not an HYPHEN but a control-character or even be accomplished of a hidden control character.
2) The link has other interesting thinhs, like an ampersand ("&"), which in strings may cause some problems. Are you sure the problem is the Hyphen?
I have a large dataset in dbf file and would like to export it to the csv type file.
Thanks to SO already managed to do it smoothly.
However, when I try to import it into R (the environment I work) it combines some characters together, making some rows much longer than they should be, consequently breaking the whole database. In the end, whenever I import the exported csv file I get only half of the db.
Think the main problem is with quotes in string characters, but specifying quote="" in R didn't help (and it helps usually).
I've search for any question on how to deal with quotes when exporting in visual foxpro, but couldn't find the answer. Wanted to test this but my computer catches error stating that I don't have enough memory to complete my operation (probably due to the large db).
Any helps will be highly appreciated. I'm stuck with this problem on exporting from the dbf into R for long enough, searched everything I could and desperately looking for a simple solution on how to import large dbf to my R environment without any bugs.
(In R: Checked whether have problems with imported file and indeed most of columns have much longer nchars than there should be, while the number of rows halved. Read the db with read.csv("file.csv", quote="") -> didn't help. Reading with data.table::fread() returns error
Expected sep (',') but '0' ends field 88 on line 77980:
But according to verbose=T this function reads right number of rows (read.csv imports only about 1,5 mln rows)
Count of eol after first data row: 2811729 Subtracted 1 for last eol
and any trailing empty lines, leaving 2811728 data rows
When exporting to TYPE DELIMITED You have some control on the VFP side as to how the export formats the output file.
To change the field separator from quotes to say a pipe character you can do:
copy to myfile.csv type delimited with "|"
so that will produce something like:
|A001|,|Company 1 Ltd.|,|"Moorfields"|
You can also change the separator from a comma to another character:
copy to myfile.csv type delimited with "|" with character "#"
giving
|A001|#|Company 1 Ltd.|#|"Moorfields"|
That may help in parsing on the R side.
There are three ways to delimit a string in VFP - using the normal single and double quote characters. So to strip quotes out of character fields myfield1 and myfield2 in your DBF file you could do this in the Command Window:
close all
use myfile
copy to mybackupfile
select myfile
replace all myfield1 with chrtran(myfield1,["'],"")
replace all myfield2 with chrtran(myfield2,["'],"")
and repeat for other fields and tables.
You might have to write code to do the export, rather than simply using the COPY TO ... DELIMITED command.
SELECT thedbf
mfld_cnt = AFIELDS(mflds)
fh = FOPEN(m.filename, 1)
SCAN
FOR aa = 1 TO mfld_cnt
mcurfld = 'thedbf.' + mflds[aa, 1]
mvalue = &mcurfld
** Or you can use:
mvalue = EVAL(mcurfld)
** manipulate the contents of mvalue, possibly based on the field type
DO CASE
CASE mflds[aa, 2] = 'D'
mvalue = DTOC(mvalue)
CASE mflds[aa, 2] $ 'CM'
** Replace characters that are giving you problems in R
mvalue = STRTRAN(mvalue, ["], '')
OTHERWISE
** Etc.
ENDCASE
= FWRITE(fh, mvalue)
IF aa # mfld_cnt
= FWRITE(fh, [,])
ENDIF
ENDFOR
= FWRITE(fh, CHR(13) + CHR(10))
ENDSCAN
= FCLOSE(fh)
Note that I'm using [ ] characters to delimit strings that include commas and quotation marks. That helps readability.
*create a comma delimited file with no quotes around the character fields
copy to TYPE DELIMITED WITH "" (2 double quotes)
I am studying mathematical computation and I am completely stuck on this task! I don't even know how to go about starting it!
**Write a program in Fortran that can parse a single line of well-formed HTML or XML markup so that it takes input on a single line (guaranteed to not exceed 80 characters in total) like
-lots of lovely text
where
tag might be anything from 1 to 37 ASCII characters and will not contain spaces
text could contain spaces and be anything from 1 to 73 characters in length
so that the program outputs one of two lines:
tag : text if the two occurrences of tag match inside <...> and
syntax error if anything else is input.
Any help is hugely appreciated !**
There are a number of intrinsic functions for working with strings that may be helpful.
result = index(string, substring) - returns the position of the start of the first occurrence of string substring as a substring in string, counting from one. (Fortran 77)
result = scan(string, set) - scans a string for any of the characters in a set of characters. (Fortran 95)
result = verify(string, set) - verifies that all the characters in a string are present in a set. (Fortran 95)
There are a few user-contributed string tokenization functions on the Fortran Wiki that might be helpful:
delim, strtok, and find_field. Also, FLIBS includes some string manipulation and tokenization routines that might be useful as examples.
Finally, there are a number of existing open-source XML parsers written in Fortran: xmlf90 and xml-fortran. Looking at the source code for these libraries should be helpful.