I'm able to download the parquet file from AWS S3 bucket on local and then read from it (see the below code). But is there any way that I can directly read parquet file from S3 and read, without storing in local?
save_object("Financial_Sample.parquet", file = "Financial_Sample.parquet", bucket = 'my-bucket')
df <- read_parquet("Financial_Sample.parquet")```

Take a look at the arrow package:
It can directly read from s3 and even filter before reading using some dplyr verbs.


How does the InputFile function in R working

I am developing an R shiny App, and the app will receive video from the user and upload to AWS s3 bucket. I am not clear about how does this video been uploaded if I use R connect to deploy the app. Does it go through https or http? I know it will be saved to the R shiny server and then upload to s3 bucket but if there is a way to directly save the video to s3 bucket?
From my own research, the caveat I have found is that you must use the writeBin() function on your upload file, saving it in a temporary directory, before saving it in the aws.s3 bucket.

Is it possible to write parquet files to local storage from h2o on hadoop?

I'm working with h2o (latest version on a Hadoop cluster. I've read in a parquet file from HDFS and have performed some manipulation on it, built a model, etc.
I've stored some important results in an H2OFrame that I wish to export to local storage, instead of HDFS. Is there a way to export this file as a parquet?
I tried using h2o.exportFile, documentation here: but all the examples are for writing .csv. I tried using the a file path with .parquet as an extension and that didn't work. It wrote a file but I think it was basically a .csv as it was identical file size to the .csv.
example: h2o.exportFile(iris_hf, path = "/path/on/h2o/server/filesystem/iris.parquet")
On a related note, if I were to export my H2OFrame to HDFS instead of local storage, would it be possible to write that in parquet format? I could at least then move that to local storage.

Reading CSV files from a public S3 instance is slow for the RShiny app in my EC2 instance

I am building a web app that I am currently hosting in an EC2 instance. It is an RShiny app, and I have installed ShinyServer on my instance. Although I am considering switching over to a database (which would require me learning how to set up a database, and connecting it to my code), currently the app requires loading in 6 CSV files that I have saved in an S3 bucket. The files are public so i can read them in.
Here is two different ways to read the CSV files into the app:
my.df <- read_csv('mycsvfile.csv')
myy.df <- read.csv(
header = TRUE,
stringsAsFactors = FALSE)
The first method uses readr's read_csv function. I use this on my local machine, however since I don't have the CSV files in my EC2 instance, i cannot use it for my hosted app. The second method uses RCurl to grab the files from the S3 bucket.
My issue is this - reading locally takes <1 second, whereas grabbing from S3 takes ~7-10 seconds. For 6 CSV files, this issue adds up quickly. Other than creating a database on my EC2 instance, is there a quicker way for me to read the CSV files into my hosted app?
EDIT - I'm currently scp'ing the files into the instance. I'm probably going to set up a git repo that links between the instance and my local machine to get this to work better.

using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster

I'm pretty new to cluster computing, so not sure if this is even possible.
I am successfully creating a spark_context in Rstudio (using sparklyr) to connect to our local Spark cluster. Using copy_to I can upload data frames from R to Spark, but I am trying to upload a locally stored CSV file directly to the Spark cluster using spark_read_csv without importing it into the R environment first (it's a big 5GB file). It's not working (even prefixing location with file:///), and it seems that it can only upload files that are ALREADY stored in the cluster.
How do I upload a local file directly to spark without loading it into R first??
Any tips appreciated.
You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.
You can upload the files from local to spark by using spark_read_csv() method. Please pass the path properly.
Note: It is not necessary to load the data first into R environment.

How to export a graph png file in R to Amazon S3 cloud storage

In a VM server environment, my R script needs to export a png file directly to Amazon S3 - does anyone know if there is a way to do it directly in R? (As opposed a separate listener process that picks up new files and moves them to S3 storage).
