Huge XML-Parsing/converting using R or RDotnet - r

I have XML file of 780GB (yes yes, indeed, 5GB pcap file which was converted to XML).
The name of the XML file is tmp.xml.
I am trying to commit this operation in R-Studio:
require("XML")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
When I am trying to do it with R I get errors (memory allocation failure, R session aborted etc).
Is there any benefit to use RDotnet instead of the regular R?
Is there any way to use R to do this?
Do you know another strong tool to convert this huge xml to csv or easier format?
Thank you!

Related

Getting Data in and out of Rhipe [R + Hadoop]

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.
Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.
For reference related to rmr have a look into: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
However for rhipe I did not get any such document and I tried various ways it failed.
So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

importing compressed csv into 'h2o' using r

The 'h2o' package is a fun ML java tool that is accessible via R. The R package for accessing 'h2o' is called "h2o".
One of the input avenues is to tell 'h2o' where a csv file is and let 'h2o' upload the raw CSV. It can be more effective to just point out the folder and tell 'h2o' to import "everything in it" using the h2o.importFolder command.
Is there a way to point out a folder of "gzip" or "bzip" csv files and get 'h2o' to import them?
According to this link (here) the h2o can import compressed files. I just don't see the way to specify this for the importFolder approach.
Is it faster or slower to import the compressed form? If I have another program that makes output does it save me time in the h2o import process speed if they are compressed? If they are raw text? Guidelines and performance best practices are appreciated.
as always, comments, suggestions, and feedback are solicited.
I took the advice of #screechOwl and asked on the 0xdata.atlassian.net board for h2o and was given a clear answer:
It was supplied by user "cliff".
Hi, yes H2O - when importing a folder - takes all the files in the folder; it unzips gzip'd or zip'd files as needed, and parses them all into one large CSV. All the files have to be compatible in the CSV sense - same number and kind of columns.
H2O does not currently handle bzip files.

R integration with Tableau

I am facing difficulty in integrating R with Tableau.
Initially when I created calculated field it was asking for Rserve package in R and was not alowing to drag field to worksheet. I have installed this package but still it shows error saying
"Error occurred while communicating with the Resrve service.Tableau i unable to connect to the service.Verify that server is running and that you have access privileges"
Any inputs. Thank you
You need to start Rserve. If you successfully install Rserve package, simply run this (on RGui, RStudio or wherever you run R scripts)
> library(Rserve)
> Rserve()
You can test your connection to RServe on Tableau, on Help, Settings and Performance, Manage R Connection.
As of Tableau 9, you can use *.rdata files with Tableau. Tableau 9 will read the first item stored in the *.rdata file. Just open an *.rdata file under "Statistical Files" in the Tableau intro screen.
To do this do:
save(myDataframe, "Myfile.rdata")
This will save the file with the dataframe stored in it. You can save as many items as you want, but Tableau will only read the first. It can read vectors and variables as well if they are in the first item. Note that rdata files compress data quite a bit. I recently compressed 900mb to 25mb. However Tableau will still need to decompress it to use it so be careful about memory.

Reading LabVIEW TDMS files with R

As part of a transition from MATLAB to R, I am trying to figure out how to read TDMS files created with National Instruments LabVIEW using R. TDMS is a fairly complex binary file format (http://www.ni.com/white-paper/5696/en/).
Add-ons exist for excel and open-office (http://www.ni.com/white-paper/3727/en/), and I could make something in LabVIEW to make the conversion, but I am looking for a solution that would let me read the TDMS files directly into R. This would allow us to test out the use of R for certain data processing requirements without changing what we do earlier in the data acquisition process. Having a simple process would also reduce the barriers to others trying out R for this purpose.
Does anyone have any experience with reading TDMS files directly into R, that they could share?
This is far from supporting all TDMS specifications but I started a port of a python npTDMS package into R here https://github.com/msuefishlab/tdmsreader and it has been tested out in the context of a shiny app here
You don't say if you need to automate the reading of these files using R, or just convert the data manually. I'm assuming you or your colleagues don't have any access to LabVIEW yourselves otherwise you could just create a LabVIEW tool to do the conversion (and build it as a standalone application or DLL, if you have the professional development system or app builder - you could run the built app from your R code by passing parameters on a command line).
The document on your first link refers to (a) add-ins for OpenOffice Calc and for Excel, which should work for a manual conversion and which you might be able to automate using those programs' respective macro languages, and (b) a C DLL for reading TDMS - would it be possible for you to use one of those?

Apache log file format analysis by R

I was trying to do the analysis of weblog files by R. I am comfortable to deal with the date and bytes, wherever numeric data is present but fail to deal with the strings.
From the log file (log file in CSV format), I want to find out the particular user (with help of IP and Agents) and its total spending on the web page.
There are numurous libraries to do this kind of analysis, although I could find none in R. A google for parse apache logfile yielded a library in Perl, and python parse apache logfile yields the Scratchy library. Both rely on regular expressions to parse the contents of the file.
From here there are two ways to deal with the apache logfile:
Call perl or python from R, either using a direct link, or using a system call (this is simpler).
Take the idea from the perl or python lib and use it to implement R versions of the functions. This will take a lot of time.
You refer to a csv file, but I think the libraries above work with the original text file with the Apache log, so I'd use those, and not your csv file.
In addition, this SO post mentions an answer by #doug (profile) where he states that he has created some functions to create visualizations of apache logfile data, parsed by Python. Maybe you could send him a message or mail and see if he is willing to share the code.
Logfile analysis in R is an interesting topic we had before, you can find our discussion right here. Maybe this discussion might also help you to adjust to the SO etiquette in order to get better feedback (not to take anything away from yours, Paul).

Resources