Can we write BSON files in R without MongoDB - r

I am trying to explore alternate serialization/un-serialization in R.
Is it possible to read /write data in BSON file format in R without actually creating or using a MongoDB database?
I browsed the rmongodb package description, and it appeared that the package might require one to use a MongoDB database for reading/writing BSON files.

Related

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is awesome.
The problem is, nowhere in the R arrow docs can I find information about working with HDFS and also in general not much information about how to use the library properly.
I am basically looking for the R equivalent of:
from pyarrow import fs
filesystem = fs.HadoopFileSystem(host = 'my_host', port = 0, kerb_ticket = 'my_ticket')
Disclosure:
I know how to use odbc to read and write my data. While reading is fine (but slow), inserting larger amounts of data into impala/hive this way is pure awful (slow, often fails, and impala isn't really built to digest data this way).
I know I could probably use pyarrow to work with hdfs, but would like to avoid installing python in my docker image just for this purpose.
The bindings for this are not currently implemented in R; there is a ticket open here on the project JIRA, which at time of writing is still marked "Unresolved": https://issues.apache.org/jira/browse/ARROW-6981. I'll comment on the JIRA ticket to mention that there is user interest in implementing these bindings.

Importing and querying a mongodb bson file from external harddrive

I am new to mongodb. I have a bson file(collect.bson) and it is on my external hard drive, very large about 200 GB and I want to run a query and i want to do that from my terminal. Do I need to create a database first in order to do that? Considering the fact that it is such a large file I don't how much space it will consume. I installed mongobd from my terminal and I was curious how I can proceed to extract the attributes and columns into a csv/R? Please suggest.
Thanks

How to write a table reference working with rjdbc?

I am working with really big data with R. My data is on Hive and I am using rjdbc. I am thinking of using a reference table on R because its impossible to load the table onto R even just using 10% sample. I am using the tbl function from dplyr.
transaction <- tbl(conn,"transaction")
R gave me an error message :
the dbplyr package is required to communicate with the database
backends.
I am using a remote computer and it's impossible to install package on this R version.
Any other solutions to solve the problem?

How to access data stored in hbase from spark in R

I need to get data stored in hbase to analyse in R but I need to do it through Spark because the data does not fit in memory.
Does anybody know how to access data in hbase through Spark in R?
I've searched both the web and SO but no joy. I've found pages that explain how to access data in hbase from R but they don't do it through Spark. And all the pages I've seen explaining how to access data in R and Spark (with sparklyr) provide examples with the iris dataset :(
Any help is much appreciated!
One option seems to be to install rhbase and get the data from hbase and save it to csv, first, then use sparkr to read the data from the csv file and proceed to analyse etc. blogs.wandisco.com/2014/08/19/experiences-r-big-data/
Is there a better way? One that does not require saving the data to a csv file?

Getting Data in and out of Rhipe [R + Hadoop]

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.
Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.
For reference related to rmr have a look into: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
However for rhipe I did not get any such document and I tried various ways it failed.
So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

Resources