Read large csv file from S3 into R - r

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:
library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
Now, with the file being much bigger than usual, I receive an error
> csvcharobj <- rawToChar(obj)
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68
Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.
At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.
Thus
data <-
aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")
Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:
data <-
aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
janitor::clean_names()
Previously the more verbose method below was required:
library(aws.s3)
data <-
save_object("s3://myBucketName/directoryName/fileName.csv") %>%
data.table::fread()
It works for files up to at least 305 MB.
A better alternative to filling up your working directory with a copy of every csv you load:
data <-
save_object("s3://myBucketName/directoryName/fileName.csv",
file = tempfile(fileext = ".csv")
) %>%
fread()
If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..

You can use AWS Athena and mount your S3 files to athena and query only selective records to R. How to run r with athena is explained in detail below.
https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
Hope it helps.

If you are on Spark or similar, a other workaround would be to
- read/load the csv to DataTable and
- continue processing it with R Server / sparklyr

Related

Saving multiple files as .Rdata

I was wondering if there is a way of using the save() option to save multiple files of data. I wanted to save all these files in the form of .Rdata, but wasn't sure how to approach this without using save() multiple times. I am new to R.
Checking directory
As someone already mentioned above, you can save something as multiple objects and run it that way. Here I used the datasets included in the 'datasets' package within R. First check your directory to see where its getting saved:
getwd()
Then see where that is from the output:
[1] "C:/Users/DELL/Dropbox/My PC (DESKTOP-SUOCLVS)/Desktop/Research Tools/R Directory"
Creating basic rdata file
Then go ahead and run the code:
df1 <- iris
df2 <- mtcars
save(df1, df2,
file = "mydata.rdata")
You'll see now its saved in the directory:
Other file types
If you mean saving multiple objects of different types, that is a bit more of an issue, as something like a csv or spss file isn't easy to coerce. One option is to include the mapply function. I've also used R datasets here as an example:
library(tidyverse)
myList <- list(diamonds = diamonds,
cars = cars)
mapply(write.csv, myList, file=paste0(names(myList), '.csv'))
Which you can now see in the directory:

What is a good trick for loading a db file into R that exhausts memory?

I have a .txt file that is roughly 28 gb and 100 million rows, and originally we were using Python and a mix of Pandas and SQLite3 to load this data into a .db file that we could query on. However, my team is more familiar with R, so we want to load the .db file into R instead. However, that comes with a memory limit. Is there a workaround for this error? Is it possible to partially load some of the data into R?
library(RSQLite)
filename <- "DB NAME.db"
sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver,
dbname = filename)
## Some operations
dbListTables(db)
mytable <- dbReadTable(db,"TABLE NAME")
Error: vector memory exhausted (limit reached?)
You might want to try the vroom package to load the .txt file you mention, which allows you to select which columns to load, thus conserving memory.
filename <- "DB_NAME.txt"
db <- vroom::vroom(filename, col_select = c(col_1,col_2))
Have a look at the vroom vignette document for more information about how memory is conserved.

Reading csv into Rstudio from google cloud storage

I want to read csv file from google cloud storage with a function similar to
read.csv.
I used library googleCloudStorageR and I can't find a function for that. I don't want to download it, I just want to read it in environment like a data frame.
If you download a .csv file, googleCloudStorageR will by default put it into a data.frame for you via write.csv - you can turn off the behaviour by specifying saveToDisk
# will make a data.frame
gcs_get_object("mtcars.csv")
# save to disk as a CSV
gcs_get_object("mtcars.csv", saveToDisk = "mtcars.csv")
You can specify your own parse function by supplying it via parseFunction
## default gives a warning about missing column name.
## custom parse function to suppress warning
f <- function(object){
suppressWarnings(httr::content(object, encoding = "UTF-8"))
}
## get mtcars csv with custom parse function.
gcs_get_object("mtcars.csv", parseFunction = f)
I’ve tried running a sample csv file with the as.data.frame() function.
In order to run this code snippet make sure you install (install.packages("data.table")) and included the library library(“data.table”)
Also be sure that you include the fread() within the as.data.frame() function in order to read the file from it’s location.
Here is the code snippet I ran and managed to display the data frame for my data set:
library(“data.table”)
MyData <- as.data.frame(fread(file="$FILE_PATH",header=TRUE, sep = ','))
print(MyData)
Reading Data with TensorFlow:
There is one other way you can read a csv from your cloud storage with the TensorFlow API. I would assume you are accessing this data from a bucket? Firstly, you would need to install the “readr” and “cloudml” packages for these functionalities to work. Then you would need to use gs_data_dir(“gs://your-bucket-name”) along with specifying the file path file.path(data_dir, “something.csv”). You would then want to read data from the file path with read_csv(file.path(data_dir, “something.csv”)). If you want it formatted as a data frame it should look something like this.
library(“data.table”)
library(cloudml)
library(readr)
data_dir <- gs_data_dir(“gs://your-bucket-name”)
MyData <- as.data.frame(read_csv(file.path(data_dir, “something.csv”)))
print(MyData)
Make sure you have properly authenticated access to your storage
More information in this link

How to upload a R data frame into a google drive ?

I am using googledrive package from CRAN. But, function - drive_upload lets you upload a local file and not a data frame. Can anybody help with this?
Just save a data_frame in question to a local file. Most basic options would be saving to CSV or saving an RData.
Example:
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
rm(test)
load("test.Rds")
exists("test")
Since clarified it is not possible to use temporary file we could use a file connection.
test <- data.frame(a = 1)
tempFileCon <- file()
write.csv(test, file = tempFileCon)
And now we have the file conneciton in memory that we can use to provide for other functions. Caveat - use literal object name to address it and not quotations like you would with actual files.
Unfortunately I can find no way to push the dataframe up directly, but just to document for others trying to get the basics accomplished that this question touches upon is with the following code that writes a local .csv and then bounces it up through tidyverse::googledrive to express itself as a googlesheet.
write_csv(iris, 'df_iris.csv')
drive_upload('df_iris.csv', type='spreadsheet')
You can achieve this using gs_add_row from googlesheets package. This API accepts dataframes directly as input parameter and uploads data to the specified google sheet. Local files are not required.
From the help section of ?gs_add_row:
"If input is two-dimensional, internally we call gs_add_row once per input row."
This can be done in two ways. Like mentioned by others, a local file can be created and this can be uploaded. It is also possible to create a new spreadsheet in your drive. This spreadsheet will be created in the main folder of your drive. If you want it stored somewhere else, you can move it after creation.
# install the packages
install.packages("googledrive", "googlesheets4")
# load the libraries
library(googledrive)
library(googlesheets4)
## With local storage
# Locally store the file
write.csv(x = iris, file = "iris.csv")
# Upload the file
drive_upload(media = "iris.csv", type='spreadsheet')
## Direct storage
# Create an empty spreadsheet. It is stored as an object with a sheet_id and drive_id
ss <- gs4_create(name = "my_spreadsheet", sheets = "Sheet 1")
# Put the data.frame in the spreadsheet and provide the sheet_id so it can be found
sheet_write(data=iris, ss = ss, sheet ="Sheet 1")
# Move your spreadsheet to the desired location
drive_mv(file = ss, path = "my_creations/awesome location/")

Import excel from Azure blob using R

I have the basic setup done following the link below:
http://htmlpreview.github.io/?https://github.com/Microsoft/AzureSMR/blob/master/inst/doc/tutorial.html
There is a method 'azureGetBlob' which allows you to retrieve objects from the containers. however, it seems to only allow "raw" and "text" format which is not very useful for excel. I've tested the connections and etc, I can retrieve .txt / .csv files but not .xlsx files.
Does anyone know any workaround for this?
Thanks
Does anyone know any workaround for this?
There is no file type on the azure blob storage, it is just a blob name. The extension type is known for OS. If we want to open the excel file in the r, we could use the 3rd library to do that such as readXl.
Work around:
You could use the get blob api to download the blob file to local path then use readXl to read the file. We also get could more demo code from this link.
# install
install.packages("readxl")
# Loading
library("readxl")
# xls files
my_data <- read_excel("my_file.xls")
# xlsx files
my_data <- read_excel("my_file.xlsx")
Solved with the following code. Basically, read the file in byte then wrote the file to disk then read it into R
excel_bytes <- azureGetBlob(sc, storageAccount = "accountname", container = "containername", blob=blob_name, type="raw")
q <- tempfile()
f <- file(q, 'wb')
writeBin(excel_bytes, f)
close(f)
result <- read.xlsx(q, sheetIndex = sheetIndex)
unlink(q)

Resources