importing compressed csv into 'h2o' using r - r

The 'h2o' package is a fun ML java tool that is accessible via R. The R package for accessing 'h2o' is called "h2o".
One of the input avenues is to tell 'h2o' where a csv file is and let 'h2o' upload the raw CSV. It can be more effective to just point out the folder and tell 'h2o' to import "everything in it" using the h2o.importFolder command.
Is there a way to point out a folder of "gzip" or "bzip" csv files and get 'h2o' to import them?
According to this link (here) the h2o can import compressed files. I just don't see the way to specify this for the importFolder approach.
Is it faster or slower to import the compressed form? If I have another program that makes output does it save me time in the h2o import process speed if they are compressed? If they are raw text? Guidelines and performance best practices are appreciated.
as always, comments, suggestions, and feedback are solicited.

I took the advice of #screechOwl and asked on the 0xdata.atlassian.net board for h2o and was given a clear answer:
It was supplied by user "cliff".
Hi, yes H2O - when importing a folder - takes all the files in the folder; it unzips gzip'd or zip'd files as needed, and parses them all into one large CSV. All the files have to be compatible in the CSV sense - same number and kind of columns.
H2O does not currently handle bzip files.

Related

How to read in-memory .gz file in R

I have an API pull request that returns a .gz file, which R recognizes as a "raw" file. I have not been able to find any R package that can decompress it in-memory from the saved file. I've tried fread(), rawToChar() and unzip().
The structure of the file I need to unzip is below. Specifically req$content.
Thanks for your responses. I ended up saving the variables as .gz files in a folder and decompressed them using foreach and fread() from data.table. This solution worked better because data set is really large >80gb so it would not have made sense to do it in-memory. Apologies for not being able to give a fully reproducible example as the data is purchased and not possible to share on public forum.

Why is read_excel very slow while the excel file to read from R is also opened in the Excel?

The environment is:
R: 3.6.1
readxl version: ‘1.3.1’
When I close the Excel program, read_excel takes a second or 2, but when I have the file opened in Excel, then read_excel in R can take a few minutes.
I wonder why was that?
Some programs, like Excel, put access restrictions on files while the files are open. This prevents accidental conflicts from external changes to the same file while it is open.
I don't know why specifically it would affect other tools from reading the file and why the effect would manifest as slower speed instead of complete inability. Maybe Excel is trying to monitor the access to the file and compare it to the content it has loaded.

Is there an R package to import VSAM files as a tbble or dataframe?

I am looking for ways to process VSAM files with R and export as a csv.
I have been searching the web and have not been able to find any methods of using R to read VSAM files.
A little more information would be of use. How are you going to get the data from the VSAM files? Are you reading directly from an IBM system? What access method will you be using? What is the structure of the file you are reading since since if you want it to be put in a data.frame, is it something like a CSV file already?. So any other particulars would be helpful.

Is it possible to download software using R?

I am writing a user-friendly function to import Access tables using R. I have found that most of the steps will have to be done outside of R, but I want to keep most of this within the script if possible. The first step is to download a Database driver from microsoft: https://www.microsoft.com/en-US/download/details.aspx?id=13255.
I am wondering if it is possible to download software from inside R, and what function/package I can use? I have looked into download.file but this seems to be for downloading information files rather than software.
Edit: I have tried
install_url(https://download.microsoft.com/download/2/4/3/24375141-E08D-
4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe)
But I get an error:
Downloading package from url: https://download.microsoft.com/download/2/4/3/24375141-E08D-4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe
Installation failed: Don't know how to decompress files with extension exe

Huge XML-Parsing/converting using R or RDotnet

I have XML file of 780GB (yes yes, indeed, 5GB pcap file which was converted to XML).
The name of the XML file is tmp.xml.
I am trying to commit this operation in R-Studio:
require("XML")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
When I am trying to do it with R I get errors (memory allocation failure, R session aborted etc).
Is there any benefit to use RDotnet instead of the regular R?
Is there any way to use R to do this?
Do you know another strong tool to convert this huge xml to csv or easier format?
Thank you!

Resources