R Converting large CSV files to HDFS - r

I am currently using R to carry out analysis.
I have a large number of CSV files all with the same headers that I would like to process using R. I had originally read each files sequentially into R and row binded them together before carrying out the analysis together.
The number of files that need to be read in is growing and so keeping them all in memory to carry out manipulations to the data is becoming infeasible.
I can combine all of the CSV files together without using R and thus not keeping it in memory. This leaves a huge CSV file would converting it to HDFS make sense in order to be able to carry out the relevant analysis? And in addition to this...or would be make more sense to carry out the analysis on each csv file separately and then combine it at the end?
I am thinking that perhaps a distributed file system and using a cluster of machines on amazon to carry out the analysis efficiently.
Looking at rmr here, it converts data to HDFS but apparently its not amazing for really big data...how would one convert the csv in a way that would allow efficient analysis?

You can build a composite csv file into the hdfs. First, you can create an empty hdfs folder first. Then, you pull each csv file separately into the hdfs folder. In the end, you will be able to treat the folder as a single hdfs file.
In order to pull the files into the hdfs, you can either use a terminal for loop, the rhdfs package, or load your files in-memory and user to.dfs (although I don't recommend you the last option). Remember to take the header off from the files.
Using rmr2, I advise you to first convert the csv into the native hdfs format, then perform your analysis on it. You should be able to deal with big data volumes.

HDFS is a file system, not a file format. HDFS actually doesn't handle small files well, as it usually has a default block size of 64MB, which means any file from 1B to 63MB will take 64MB of space.
Hadoop is best to work on HUGE files! So it would be best for you to concatenate all your small files into one giant file on HDFS that your Hadoop tool should have a better time handling.
hdfs dfs -cat myfiles/*.csv | hdfs dfs -put - myfiles_together.csv

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

Partially read really large csv.gz in R using vroom

I have a csv.gz file that (from what I've been told) before compression was 70GB in size. My machine has 50GB of RAM, so anyway I will never be able to open it as a whole in R.
I can load for example the first 10m rows as follows:
library(vroom)
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7)
For what I have to do, it is fine to load 10m rows at the time, do my operations, and continue with the next 10m rows. I could do this in a loop.
I was therefore trying the skip argument.
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7, skip = 10^7)
This results in an error:
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
* Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
I increased this with Sys.setenv("VROOM_CONNECTION_SIZE" = 131072*1000), however, the error persists.
Is there a solution to this?
Edit: I found out that random access to a gzip compressed csv (csv.gz) is not possible. We have to start from top. Probably the easiest is to decompress and save, then skip should work.
I haven't been able to figure out vroom solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
Use split sub-command inxsv from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder.
Read all chunks using data.table::fread one-by-one (to avoid low-memory error) using a for loop and save all of them into a folder as compressed parquet files using arrow package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save the parquet files partitioned by the fields by which you need to frequently filter.
Now you can use arrow::open_dataset and query that multi-file parquet folder using dplyr commands. It takes minimum disk space and gives the fastest results in my experience.
I use data.table::fread with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv has also been accurate but slower. However, auto-assignment of column classes by read_csv as well as the ways in which you can custom-define column classes by read_csv is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
a folder using disk.frame package. Then that folder may be queried using dplyr or data.table commands explained in the documentation. It has facility to save in compressed fst files which saves space, though not as much as parquet files.
a table in DuckDB database which allows querying with SQL or dplyr commands. Using database-tables approach won't save you disk space. But DuckDB also allows querying partitioned/un-partitioned parquet files (which saves disk space) with SQL commands.
EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv) and all chunks will be written in parquet format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet, you want to use -
disk.frame, the callback function may be used to create chunked compressed fst files for dplyr or data.table style querying.
DuckDB, the callback function may be used to append the chunks into a database table for SQL or dplyr style querying.
By judiciously choosing the chunk_size parameter of readr::read_csv_chunked command, the computer should never run out of RAM while running queries.
PS: I use gzip compression for parquet files since they can then be previewed with ParquetViewer from https://github.com/mukunku/ParquetViewer. Otherwise, zstd (not currently supported by ParquetViewer) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet chunks. Using my previous method of splitting the csv file with xsv and then looping over them creating parquet chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at
https://duckdb.org/docs/data/parquet
and
https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr verbs mutate(), transmute(), select(), rename(), relocate(), filter(), and arrange(). Aggregation is not yet supported, so before you call summarise() or other verbs with aggregate functions, use collect() to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect() on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

R and zipped files

I have about ~1000 tar.gz files (about 2 GB/file compressed) each containing bunch of large .tsv (tab separated) files e.g. 1.tsv, 2.tsv, 3.tsv, 4.tsv etc.
I want to work in R on a subset of the .tsv files (say 1.tsv, 2.tsv) without extracting the .tar.gz files, in order to save space/time.
I tried looking around but couldn't find a library or a routine to stream the tar.gz files through memory and extracting data from them on the fly. In other languages there are ways of doing this efficiently. I would be surprised if one couldn't do this in R
Does anyone know of a way to accomplish this in R? Any help is greatly appreciated! Note: Unzipping/untarring the file is not an option. I want to extract relevant fields and save them in a data.frame without extracting the file

Hadoop InputFormat for Excel

I need to create a map-reducing program which reads an Excel file from HDFS and does some analysis on it. From there store the output in the format of excel file. I know that TextInputFormat is used to read a .txt file from HDFS but which method or which inputformat should I have to use?
Generally, hadoop is overkill for this scenario, but some relevant solutions
parse the file externally and convert to an hadoop compatible format
read the complete file as a single record see this answer
use two chained jobs. the 1st like in 2, reads the file in bulk, and emits each record as input for the next job.

Resources