xml schema in R - r

I am quite a newbie with xml. I used XML in R to parse content in xml and put into R objects. I have to deal with nearly 1TB xml data and it took me around 5 hours to parse 2.4 GB data. I know that xmlschema is used to generate xml. I wonder if there is any better method to convert xml to data or another method to use xmlschema to read xml and put values back into raw data other than xmlParse?
I now have 5 xmlschema and xml. (I thought it is complex xml)
xmlns:nxce="http://tfm.faa.gov/tfms/NasXCoreElements"
xmlns:mmd="http://tfm.faa.gov/tfms/MessageMetaData"
xmlns:nxcm="http://tfm.faa.gov/tfms/NasXCommonMessages"
xmlns:idr="http://tfm.faa.gov/tfms/TFMS_IDRS"
xmlns:xis="http://tfm.faa.gov/tfms/TFMS_XIS"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://tfm.faa.gov/tfms/TFMS_XIS
sample data: http://www.fly.faa.gov/ASDI/asdidocs/asdi_sample_data.zip
I want to extract all flightManagementInfomation data out using SAX
Thanks in advance.

Schemas use won't improve the performance of XML loading - they tell you something about the expected structure of the parsed XML, but have nothing to do with the parsing process itself.
You need to use a different parser - if one is available in R (as suggested by Martin), or convert the XML data into something that R can handle more easily using some other language

Related

How to read a BLOB with qt-type compression?

I have a file (about 100k files, to be specific) containing a data from weather radars - one file is a one radar image. It is a mosaic of data from several radars, creating a map of a reflectivity over whole country.
The files have extension .cmax and I need to convert them to something more useful (eg. array of reflectivities) for further uses.
I have asked data provider how to read those files. They responded:
The standard product format in our system (.cmax) is the internal format of the company that provides us with the software. It consists of an xml and binary part. It can be read by reading as a stream of bytes. Firstly, parse the initial bytes as xml, then treat the rest (BLOBs) as a binary data compressed with the "qt" method. You need to unpack them using a library that supports this compression mode. In general, you have to work a little, but it can be done in virtually any programming language.
The main issue is with the binary part of data. I have tried to decompress it with zlib (googling qt compression it comes out) and reading as a binary data in C++. None of them worked. It also doesn't seem resonable to me to try reading that data as binary in Qt.
The file begins with those lines:
<product version="5.44.5" datetime="2017-01-01T18:00:00" datatype="dBZ" type="cmax" name="CMAX" owner="">
<data time="18:00:00" date="2017-01-01">
Then, there are radars specifications and image details (active radars, min and max reflectivity etc). XML part ends with:
</product>
<!-- END XML -->
<BLOB blobid="0" size="79617" compression="qt">(here are lots of binary data)</BLOB>
I'm looking for a way (tool?) to convert that binary data. For example, it could be that mentioned library.
Looking at the details, this is most likely Leonardo (Selex/Gematronic) Rainbow5 format. zlib is the right lib for decompression. But there are some tricks to it. A python reader is implemented in the wradlib library (https://github.com/wradlib). Maybe you can adapt from that code. Disclaimer: I'm one of the wradlib devs.
Did you try simply using the qUncompress() function? https://doc.qt.io/qt-5/qbytearray.html#qUncompress

Retrieving data from large xml file using node path in R

I am new to xml, and many xml nodes I found are not the same as my file. I want to extract data from large xml file using R (dummy xml file is below). I know even though R has memory limitation, extract specific nodes from large xml file is possible using xmlEventParse() from r XML package. properly naming file path to reach my target data. My final output in form of dataframe should have columns that reflects these nodes N9:Shareholder, N5:IdentifierElement, N2:NameElement. Thanks for your help.
XML code
FOO LIMITED
120801
Companies Register

Error while parsing a very large (10 GB) XML file in R, using the XML package

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !
This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Can SAS still read or create the combination of {.dat fixed-column ascii data file, .sas syntax file}, or is it obsolete?

In the past I have used the excellent SAScii package in R to read in this type of data: {.dat fixed-column data file + the corresponding .sas "syntax" file}. I want to be quite precise about that because there is no end of ambiguity surrounding phrases like "SAS file". These .dat files contain only integers, and the .sas files specify both the way to parse the columns and the way the integers represent the values in the actual data (this feature is sometimes called the "codebook".) I have found very good data in that format (i.e. in the form of the pair of files {.dat, .sas}) from places like Minnesota Population Center's IPUMS https://usa.ipums.org/usa/, and built up a lot of tools to analyze it using R and SAScii.
Now I have access to SAS itself, and but would still like to re-use some of my tools and techniques. However I can find no reference in SAS to data like that {fixed-column data in .dat, syntax file in .sas}. Has that format been entirely superseded within SAS (perhaps by the SAS7BDAT format)? Or perhaps the {.dat,.sas} format was never used within SAS?? The reason I ask is, now that I have access to SAS and so much data in SAS7BDAT format, I would like to be able to export some of it in {.dat, .sas} format for use with my own tools.
Thanks very much, and cheers - Ed
I don't think this is something built into SAS. You could, however, write such a program pretty easily.
First off, Chris Hemidinger has written something that basically does this (it creates datalines, not .dat file, but that shouldn't be too hard to modify if you know .NET and/or to modify the R module to accept). That is discussed and available here. The title of the post is "Turn your data set into a data step program". This is roughly equivalent to the SQL Server task that creates "Create Table" code out of a table. This would only work in Enterprise Guide, although you should be able to do roughly the same thing in a standalone .NET program.
Second, you can easily write something like this in Base SAS. Creating the datalines is easy, numerous ways to write out to a file.
For a CSV, for example, you can do this.
ods csv file="c:\temp\mydata.csv";
proc print data=mydata;
run;
ods csv close;
If you're going to write a flat file, you might as well make the input/output .sas first - after all it can be almost the same code. You can query dictionary.columns to generate the code, both the input and output code. Create a table with the variable names, lengths, and formats for each variable, then process it in a data step advancing the start variable by the length of each variable (so it moves to the next position after the last one finished). If you need formats for your R project, then proc format cntlout=<datasetname> will generate a dataset that contains those formatted value translations, and you can write that out in whatever format you need as well.

Resources