Sparklyr - How to change the parquet data types - r

Is there a way to change data types of columns when reading parquet files?
I'm using the spark_read_parquet function from Sparklyr, but it doesn't have the columns option (from spark_read_csv) to change it.
In csv files, I would do something like:
data_tbl <- spark_read_csv(sc, "data", path, infer_schema = FALSE, columns = list_with_data_types)
How could I do something similar with parquet files?

Specifying data types only makes sense when reading a data format that does not have built in metadata on variable types. This is the case with csv or fwf files, which, at most, have variable names in the first row. Thus the read functions for such files have that functionality.
This sort of functionality does not make sense for data formats that have built in variable types, such as Parquet (or .Rds and .Rds in R).
This in this case you should:
a) read the Parquet file into Spark
b) make the necessary data transformations
c) save the transformed data into a Parquet file, overwriting the previous file

Related

is there a library that can generate csv files given a data dictionary and data model in some format

is there a library in any language that can generate .csv files for each entity of the data model that complies with a data dictionary.
For example:
data dictionary is specified in a csv file with these column names - field,regex,description
data model is specified in another csv file with these column names - entity,field
faker comes very close however it needs some programming to work for a data model. If there is a wrapper around faker, that might work great I suppose.

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.
The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.
Any help you can provide would be much appreciated!
arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset().
One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.
You can do it in this way:
library(arrow)
library(dplyr)
csv_file <- "obs.csv"
dest <- "obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string())
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite")
In my case (56GB csv file), I had a really weird situation with the resulting parquet tables, so double check your parquet tables to spot any funky new rows that didn't exist in the original csv. I filed a bug report about it:
https://issues.apache.org/jira/browse/ARROW-17432
If you also experience the same issue, use the Python Arrow library to convert the csv into parquet and then load it into R. The code is also in the Jira ticket.

import multiple csv from web into one data frame

i want to read out several csv files from the web and save the data into a data frame.
If the files were on my computer this would be very easy as I have seen but I don't always want to download the files.
The example:
"https://www.football-data.co.uk/mmz4281/1819/F1.csv",
"https://www.football-data.co.uk/mmz4281/1718/F1.csv",
"https://www.football-data.co.uk/mmz4281/1617/F1.csv",
"https://www.football-data.co.uk/mmz4281/1516/F1.csv",
"https://www.football-data.co.uk/mmz4281/1415/F1.csv",
"https://www.football-data.co.uk/mmz4281/1314/F1.csv",
"https://www.football-data.co.uk/mmz4281/1213/F1.csv",
"https://www.football-data.co.uk/mmz4281/1112/F1.csv",
"https://www.football-data.co.uk/mmz4281/1011/F1.csv"
These are the CSV files. maybe its possible with a function or a loop but i dont know how.
Maybe you can help me.
Greetings
Reading files from the web is just as easy as reading them from your file system; you can just pass a URL instead of a file-path to readr::read_csv() (you tagged your question with readr so I assume you want to use that).
Assuming your files are in a vector:
files <- c("https://www.football-data.co.uk/mmz4281/1819/F1.csv",
"https://www.football-data.co.uk/mmz4281/1718/F1.csv",
"https://www.football-data.co.uk/mmz4281/1617/F1.csv",
"https://www.football-data.co.uk/mmz4281/1516/F1.csv",
"https://www.football-data.co.uk/mmz4281/1415/F1.csv",
"https://www.football-data.co.uk/mmz4281/1314/F1.csv",
"https://www.football-data.co.uk/mmz4281/1213/F1.csv",
"https://www.football-data.co.uk/mmz4281/1112/F1.csv",
"https://www.football-data.co.uk/mmz4281/1011/F1.csv")
You can use readr::read_csv to read a specific file, and combine them into one data-frame with purrr::map_dfr:
df <- purrr::map_dfr(files, readr::read_csv)
This iterates over the contents of files, applies readr::read_csv to each of those elements, and combines them into one data frame, rowwise (hence dfr).

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark)
The situation
created parquet files with a Spark pipe and save on S3
read with RStudio/RShiny with one column as index to do further analysis
The parquet file structure
The parquet files created from my Spark consists of several parts
tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc
How do I read this component_mapping.parquet into R?
What I tried
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")
but this fails with the error
IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory
It works if I just read one file of the directory
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")
but I need to load all in order to query on it
What I found in the documentation
In the apache arrow documentation
https://arrow.apache.org/docs/r/reference/read_parquet.html and
https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html
I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.
read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)
How do I set the properties correctly to read the full directory?
# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)
Help would be very appreciated
As #neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.
I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)
Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)
The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html
The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.
library(arrow)
## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)
Solution for: Read partitioned parquet files from local file system into R dataframe with arrow
As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?
I solved my task now with your proposal using arrow together with lapply and rbindlist
my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))
looking forward until the apache arrow functionality is available
Thanks
Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply/map over the directory listing and rbind/bind_rows into a single data.frame. There's probably a purrr function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.
In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.
Solution for: Read partitioned parquet files from S3 into R dataframe using arrow
As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3
library(arrow)
library(aws.s3)
bucket="mybucket"
prefix="my_prefix"
# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key
# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})
# concatenate all data together into one data.frame
data <- do.call(rbind, data)
What a mess but it works.
#neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R
I am working on this package to make this easier. https://github.com/mkparkin/Rinvent
Right now it can read from Local, AWS S3 or Azure Blob. parquet files or deltafiles
# read parquet from local with where condition in the partition
readparquetR(pathtoread="C:/users/...", add_part_names=F, sample=F, where="sku=1 & store=1", partition="2022")
#read local delta files
readparquetR(pathtoread="C:/users/...", format="delta")
your_connection = AzureStor::storage_container(AzureStor::storage_endpoint(your_link, key=your_key), "your_container")
readparquetR(pathtoread="blobpath/subdirectory/", filelocation = "azure", format="delta", containerconnection = your_connection)

Replacing data in .Rdata file

Is there a way I could replace the table in .Rdata file with another one? I can edit it with edit(x) command, but it would take an enormous amount of time to do this manually; besides, I haven't found a way to add rows to it.
I think you need to read a few 'intro to R' guides.
A .Rdata file is generally a saved session, and can have any number of 'things' saved in it, scalars, vectors, data.frames, lists, functions etc etc. I assume you have a data file that has been read into R into a data.frame and that is saved within a .Rdata file. You can load the .Rdata file with load("....Rdata") then you can 'replace' your table (a data frame), by loading another one over top, if that's what you want to do, so assuming it's called dat, dat <- read.csv("new_data.csv", ...), and then save the session again, save.image("....Rdata"). I've assumed a lot of things there though...

Resources