How to read parquet file from AWS S3 bucket using R without downloading it locally? - r

I'm able to download the parquet file from AWS S3 bucket on local and then read from it (see the below code). But is there any way that I can directly read parquet file from S3 and read, without storing in local?
save_object("Financial_Sample.parquet", file = "Financial_Sample.parquet", bucket = 'my-bucket')
df <- read_parquet("Financial_Sample.parquet")```

Take a look at the arrow package: https://arrow.apache.org/docs/r/index.html
It can directly read from s3 and even filter before reading using some dplyr verbs.

Related

How does the InputFile function in R working

I am developing an R shiny App, and the app will receive video from the user and upload to AWS s3 bucket. I am not clear about how does this video been uploaded if I use R connect to deploy the app. Does it go through https or http? I know it will be saved to the R shiny server and then upload to s3 bucket but if there is a way to directly save the video to s3 bucket?
From my own research, the caveat I have found is that you must use the writeBin() function on your upload file, saving it in a temporary directory, before saving it in the aws.s3 bucket.

Is it possible to write parquet files to local storage from h2o on hadoop?

I'm working with h2o (latest version 3.26.0.10) on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.
I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?
I tried using h2o.exportFile, documentation here: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.exportFile.html but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.
example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")
On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.

Reading CSV files from a public S3 instance is slow for the RShiny app in my EC2 instance

I am building a web app that I am currently hosting in an EC2 instance. It is an RShiny app, and I have installed ShinyServer on my instance. Although I am considering switching over to a database (which would require me learning how to set up a database, and connecting it to my code), currently the app requires loading in 6 CSV files that I have saved in an S3 bucket. The files are public so i can read them in.
Here is two different ways to read the CSV files into the app:
library(RCurl)
library(readr)
my.df <- read_csv('mycsvfile.csv')
myy.df <- read.csv(
textConnection(
getURL("https://s3.amazonaws.com/mybucket/mycsvfile.csv")),
sep=",",
header = TRUE,
stringsAsFactors = FALSE)
The first method uses readr's read_csv function. I use this on my local machine, however since I don't have the CSV files in my EC2 instance, i cannot use it for my hosted app. The second method uses RCurl to grab the files from the S3 bucket.
My issue is this - reading locally takes <1 second, whereas grabbing from S3 takes ~7-10 seconds. For 6 CSV files, this issue adds up quickly. Other than creating a database on my EC2 instance, is there a quicker way for me to read the CSV files into my hosted app?
Thanks!
EDIT - I'm currently scp'ing the files into the instance. I'm probably going to set up a git repo that links between the instance and my local machine to get this to work better.

using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster

I'm pretty new to cluster computing, so not sure if this is even possible.
I am successfully creating a spark_context in Rstudio (using sparklyr) to connect to our local Spark cluster. Using copy_to I can upload data frames from R to Spark, but I am trying to upload a locally stored CSV file directly to the Spark cluster using spark_read_csv without importing it into the R environment first (it's a big 5GB file). It's not working (even prefixing location with file:///), and it seems that it can only upload files that are ALREADY stored in the cluster.
How do I upload a local file directly to spark without loading it into R first??
Any tips appreciated.
You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.
You can upload the files from local to spark by using spark_read_csv() method. Please pass the path properly.
Note: It is not necessary to load the data first into R environment.

How to export a graph png file in R to Amazon S3 cloud storage

In a VM server environment, my R script needs to export a png file directly to Amazon S3 - does anyone know if there is a way to do it directly in R? (As opposed a separate listener process that picks up new files and moves them to S3 storage).

Resources