I'm trying to import a GTFS realtime data file into R using the ProtoBuf package, but can't get it to work. This is what I've tried, but I think it's way off track.
library(RProtoBuf)
setwd("c:\\temp\\")
proto <- readProtoFiles("seq")
The gtfsway package reads the data from the site, like this question, but the authors say the package is outdated.
Related
I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.
I'm experimenting with GitHub and I created a little package for my colleagues to use. They install it with the devtools package and install_github() function directly in R. I also have some example data and a R-Markdown file that shows the usage of all functions in the package and can be published via GitHub Pages.
I would like to know what would be the best practice to enable others to use this example data to learn the package.
I can think of two different options:
Host the data in a separate directory which is not part of the installation and tell people to download it manually or use something like the download.file() function from R at the beginning of the example script to download all data that could be packed into a .zip.
Make the data part of the package installation, however this would require the data to be fairly small which is difficult in my particular case (data is 10MB).
Ideally the examples in the R-documentation (.Rd files in the man folder) could also use the same examples as in the markdown file. also in this case, option (2) seems to be favorable.
Could anybody give me some advice what would be the best way to go, sort of the "industry standard" if there is any.
I am looking for ways to process VSAM files with R and export as a csv.
I have been searching the web and have not been able to find any methods of using R to read VSAM files.
A little more information would be of use. How are you going to get the data from the VSAM files? Are you reading directly from an IBM system? What access method will you be using? What is the structure of the file you are reading since since if you want it to be put in a data.frame, is it something like a CSV file already?. So any other particulars would be helpful.
I am writing a user-friendly function to import Access tables using R. I have found that most of the steps will have to be done outside of R, but I want to keep most of this within the script if possible. The first step is to download a Database driver from microsoft: https://www.microsoft.com/en-US/download/details.aspx?id=13255.
I am wondering if it is possible to download software from inside R, and what function/package I can use? I have looked into download.file but this seems to be for downloading information files rather than software.
Edit: I have tried
install_url(https://download.microsoft.com/download/2/4/3/24375141-E08D-
4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe)
But I get an error:
Downloading package from url: https://download.microsoft.com/download/2/4/3/24375141-E08D-4803-AB0E-10F2E3A07AAA/AccessDatabaseEngine_X64.exe
Installation failed: Don't know how to decompress files with extension exe
The 'h2o' package is a fun ML java tool that is accessible via R. The R package for accessing 'h2o' is called "h2o".
One of the input avenues is to tell 'h2o' where a csv file is and let 'h2o' upload the raw CSV. It can be more effective to just point out the folder and tell 'h2o' to import "everything in it" using the h2o.importFolder command.
Is there a way to point out a folder of "gzip" or "bzip" csv files and get 'h2o' to import them?
According to this link (here) the h2o can import compressed files. I just don't see the way to specify this for the importFolder approach.
Is it faster or slower to import the compressed form? If I have another program that makes output does it save me time in the h2o import process speed if they are compressed? If they are raw text? Guidelines and performance best practices are appreciated.
as always, comments, suggestions, and feedback are solicited.
I took the advice of #screechOwl and asked on the 0xdata.atlassian.net board for h2o and was given a clear answer:
It was supplied by user "cliff".
Hi, yes H2O - when importing a folder - takes all the files in the folder; it unzips gzip'd or zip'd files as needed, and parses them all into one large CSV. All the files have to be compatible in the CSV sense - same number and kind of columns.
H2O does not currently handle bzip files.