write multiple parquet files by chunk_size - r

I see the chunk_size argument in arrow::write_parquet(), but it doesn't seem to behave as expected. I would expect the code below to generate 3 separate parquet files, but only one is created, and nrow > chunk_size.
library(arrow)
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write with chunk_size 1e6, and view directory
write_parquet(df, tf, chunk_size = 1e6)
list.files(td)
Returns one file instead of 3:
[1] "25ff74854ba6.parquet"
# read parquet and show all rows are there
nrow(read_parquet(tf))
Returns:
[1] 3000000
We can't pass multiple file name arguments to write_parquet(), and I don't want to partition, so write_dataset() also seem inapplicable.

The chunk_size parameter refers to how much data to write to disk at once, rather than the number of files produced. The write_parquet() function is designed to write individual files, whereas, as you said, write_dataset() allows partitioned file writing. I don't believe that splitting files on any other basis is supported at the moment, though it is a possibility in future releases. If you had a specific reason for wanting 3 separate files, I'd recommend separating the data into multiple datasets first and then writing each of those via write_parquet().
(Also, I am one of the devs on the R package, and can see that this isn't entirely clear from the docs, so I'm going to open a ticket to update those - thanks for flagging this up)

Short of an argument to write_parquet() like max_row that defaults to reasonable number (like 1e6), we can do something like this:
library(arrow)
library(uuid)
library(glue)
library(dplyr)
write_parquet_multi <- function(df, dir_out, max_row = 1e6){
# Only one parquet file is needed
if(nrow(df) <= max_row){
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to 1 parquet file...")
write_parquet(
df,
glue("{dir_out}/{UUIDgenerate(use.time = FALSE)}.parquet"))
cat("done.\n")
}
# Multiple parquet files are needed
if(nrow(df) > max_row){
count = ceiling(nrow(df)/max_row)
start = seq(1, count*max_row, max_row)
end = c(seq(max_row, nrow(df), max_row), nrow(df))
uuids = UUIDgenerate(n = count, use.time = FALSE)
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to", count, "parquet files...")
for(j in 1:count){
write_parquet(
dplyr::slice(df, start[j]:end[j]),
glue("{dir_out}/{uuids[j]}.parquet"))
}
cat("done.\n")
}
}
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write parquet multi
write_parquet_multi(df, td)
list.files(td)
This returns:
[1] "7a1292f0-cf1e-4cae-b3c1-fe29dc4a1949.parquet"
[2] "a61ac509-34bb-4aac-97fd-07f9f6b374f3.parquet"
[3] "eb5a3f95-77bf-4606-bf36-c8de4843f44a.parquet"

Related

read only the first and last row of several file at once with R

I got many .csv files of different sizes. I choose some of them who correspond at a condition (those matching with my id in the example). They are ordered by date and can be huge. I need to know the minimum and maximum dates of these files.
I can read all of those wanted and only for the column date.hour, and then I can find easily the minimum and maximum of all the dates values.
But it would be a lot faster, as I repeat this for thousand ids, if I could read only the first and last rows of my files.
Does anyone got an idea of how to solve this ?
This code works well, but I wish to improve it.
function to read several files at once
`read.tables.simple <- function(file.names, ...) {require(plyr)
ldply(file.names, function(fn) data.frame(read.table(fn, ...)))}`
reading the files and selecting the minimum and maximum dates for all of theses
`diri <- dir()
dat <- read.tables.simple(diri[1], header = TRUE, sep = ";", colClasses = "character")
colclass <- rep("NULL", ncol(dat))
x <- which(colnames(dat) == "date.hour")
colclass[x] <- "character"
x <- grep("id", diri)
dat <- read.tables.simple(diri[x], header = TRUE, sep = ";", colClasses = colclass)
datmin <- min(dat$date.hour)
datmax <- max(dat$date.hour)`
In general, read.table is very slow. If you use read_tsv, read_csv or read_delim from the readr library, it will already be much, much faster.
If you are on Linux/Mac OS, you can also read only the first or last parts by setting up a pipe, which will be more or less instant, no matter how large your file is. Let's assume you have no column headers:
library(readr)
read_last <- function(file) {
read_tsv(pipe(paste('tail -n 1', file)), col_names=FALSE)
}
# Readr can already read only a select number of lines, use `n_max`
first <- read_tsv(file, n_max=1, col_names=FALSE)
If you want to go in on parallelism, you can even read files in parallel, see e.g., library(parallel) and ?mclapply
The following function will read the first two lines of your csv (the header row and the first data row), then seek to the end of the file and read the last line. It will then stick these three lines together to read them as a two-row csv in memory from which it returns the column date.time. This will have your minimum and maximum values, since the times are arranged in order.
You need to tell the function the maximum line length. It's OK if you over-estimate this, but make sure the number is less than a third of your file size.
read_head_tail <- function(file_path, line_length = 100)
{
con <- file(file_path)
open(con)
seek(con, where = 0)
first <- suppressWarnings(readChar(con, nchars = 2 * line_length))
first <- strsplit(first, "\n")[[1]][1:2]
seek(con, where = file.info(file_path)$size - line_length)
last <- suppressWarnings(readChar(con, nchars = line_length))
last <- strsplit(last, "\n")[[1]]
last <- last[length(last)]
close(con)
csv <- paste(paste0(first, collapse = "\n"), last, sep = "\n")
df <- read.csv(text = csv, stringsAsFactors = FALSE)[-1]
return(df$date.hour)
}

Select CSV files and read in pairs

I am comparing two pairs of csv files each at a time. The files I have each end with a number like cars_file2.csv, Lorries_file3.csv, computers_file4.csv, phones_file5.csv. I have like 70 files per folder and the way I am comparing is, I compare cars_file2.csv and Lorries_file3.csv then Lorries_file3.csv and
computers_file4.csv, and the pattern is 2,3,3,4,4,5 like that. Is there a smart way I can handle this instead of manually coming back and change file like the way I am reading here or I can use the last number on each csv to read them smartly. NOTE the files have same suffixes _file:
library(daff)
setwd("path")
# Load csvs to compare into data frames
x_original <- read.csv("cars_file2.csv", strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv("Lorries_file3.csv", strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
My intention is to compare each two pairs of csv and recorded, Field additions, deletions and modified
You may want to load all files at once and do your comparison with a full list of files.
This may help:
# your path
path <- "insert your path"
# get folders in this path
dir_data <- as.list(list.dirs(path))
# get all filenames
dir_data <- lapply(dir_data,function(x){
# list of folders
files <- list.files(x)
files <- paste(x,files,sep="/")
# only .csv files
files <- files[substring(files,nchar(files)-3,nchar(files)) %in% ".csv"]
# remove possible errors
files <- files[!is.na(files)]
# save if there are files
if(length(files) >= 1){
return(files)
}
})
# delete NULL-values
dir_data <- compact(dir_data)
# make it a named vector
dir_data <- unique(unlist(dir_data))
names(dir_data) <- sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(dir_data))
names(dir_data) <- as.numeric(substring(names(dir_data),nchar(names(dir_data)),nchar(names(dir_data))))
# remove possible NULL-values
dir_data <- dir_data[!is.na(names(dir_data))]
# make it a list again
dir_data <- as.list(dir_data)
# load data
data_upload <- lapply(dir_data,function(x){
if(file.exists(x)){
data <- read.csv(x,header=T,sep=";")
}else{
data <- "file not found"
}
return(data)
})
# setup for comparison
diffs <- lapply(as.character(sort(as.numeric(names(data_upload)))),function(x){
# check if the second dataset exists
if(as.character(as.numeric(x)+1) %in% names(data_upload)){
# first dataset
print(data_upload[[x]])
# second dataset
print(data_upload[[as.character(as.numeric(x)+1)]])
# do your operations here
comparison <- render(diff_data(data_upload[[x]],
data_upload[[as.character(as.numeric(x)+1)]],
ignore_whitespace=T,count_like_a_spreadsheet = F))
numbers <- c(x, as.numeric(x)+1)
# save both the comparison data and the numbers of the datasets
return(list(comparison,numbers))
}
})
# you can find the differences here
diffs
This script loads all csv-files in a folder and its sub-folders and puts them into a list by their numbers. In case there are no doubles, this will work. If you have doubles, you will have to adjust the part where the vector is named so that you can index the full names of the files afterwards.
A simple for- loop using paste will read-in the pairs:
for (i in 1:70) { # assuming the last pair is cars_file70.csv and Lorries_file71.csv
x_original <- read.csv(paste0("cars_file",i,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv(paste0("Lorries_file3",i+1,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
}
For simplicity I used 2 .csv files.
csv_1
1,2,4
csv_2
1,8,10
Load all the .csv files from folder,
files <- dir("Your folder path", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
#create empty list to store comparison output
diff <- c()
Loop through all loaded files and compare,
for (pos in 1:length(csv)) {
if (pos != length(csv)) { #ignore last one
#save comparison output
diff[[pos]] <- diff_data(as.data.frame(csv[pos]), as.data.frame(csv[pos + 1]), ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE)
}
}
Compared output by diff
[[1]]
Daff Comparison: ‘as.data.frame(tables[pos])’ vs. ‘as.data.frame(tables[pos + 1])’
+++ +++ --- ---
## X1 X8 X10 X2 X4

r- rollapply across a mutiple file database

I have a large database that I've split into multiple files. Each file is saved in the same directory, and there is a numerical sequence in the naming scheme so the order of the database is maintained. Ive done this to reduce the time and memory it takes to load and manipulate the database. I would like to start analyzing the database in sequence, which I intend to accomplish using a rollapply like function. I am having a problem when I want the window to span two files at once. Which is where I need help. Here is dummy dataset that will create five CSV files with a similar naming scheme to my database:
library(readr)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
Keep in mind that this database is huge, and causes memory and time issues on my current machine. The solution MUST have a component that "forgets". This means recurrently joining the files, or loading them all at once to the R environment is not an option. When a new file is loaded, the last file must be removed from the R environment. I can have at maximum three files loaded at once. For example files 1-3 can be loaded, and then file 1 needs to be removed before file 4 is loaded.
The output can be a single list of all files - the combination of files 1-5 in a single list.
For the sake of simplicity, lets say I want to use a window of 2, and I want to calculate the mean of this window. I'm imagining something like this (see below) but this maybe a failed approach, and I'm open to anything.
appreciated_function <- function(x){
Your greatly appreciated function
}
rollapply(df, 2, appreciated_function, by.column = FALSE, align = "left")
Suppose the window width is k. Iterate through all files and for each one read that file plus the first k-1 rows of the next (except for the last) and use rollapply on that appending what we get to what we have so far. Alternately, if the output is too large we could write out each result instead of appending it.
At the bottom we check that it gives the expected result.
library(readr)
library(zoo)
val <- c(1,2,3,4,5)
df_1 <- data.frame(val)
write_csv(df_1, "1_database.csv", col_names = TRUE)
write_csv(df_1, "2_database.csv", col_names = TRUE)
write_csv(df_1, "3_database.csv", col_names = TRUE)
write_csv(df_1, "4_database.csv", col_names = TRUE)
write_csv(df_1, "5_database.csv", col_names = TRUE)
d <- dir(pattern = "database.csv$")
k <- 2
r <- NULL
for(i in seq_along(d)) {
Next <- if (i != length(d)) read_csv(d[i+1], n_max = k-1)
DF <- rbind(read_csv(d[i]), Next)
r0 <- rollapply(DF, k, sum, align = "left")
# if output too large replace next statement with one to write out r0
r <- rbind(r, r0)
}
# check
r2 <- rollapply(data.frame(val = sequence(rep(5, 5))), k, sum, align = "left")
identical(r, r2)
## [1] TRUE

Mean values from multiple csv to data frame

After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...

Add selection crteria to read.table

Let's take the following simplified version of a dataset that I import using read.table:
a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")
In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.
My question is, can I just import these observations immediately without importing the rest of the data then using subset to refine my database. My computer's capacities are limited and so I have been using scan to import my data in chunks but this is extremely time consuming.
Is there a better solution?
Some approaches that might work:
1 - Use a packages like ff than can help you with RAM issues.
2 - Use other tools/languages to clean your data before load it to R.
3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):
# save each txt file once...
save.rdata = function(filepath, filebin) {
dataset = read.table(filepath)
save(dataset, paste(filebin, ".RData", sep = ""))
}
# then read from the .Rdata
get.dataset = function(filebin) {
load(filebin)
return(dataset)
}
This is much faster than read from a txt file, but i'm not sure if it applies to your case.
There should be several ways to do this. Here is one using SQL.
library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")
> result
Sex Age
1 F 17
2 F 18
There is also a read.csv.sql function that you can filter with the above statement to avoid reading in the whole text file!
This is almost the same as #Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:
# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F) # note: non-quoted strings
library(sqldf)
myData <- read.csv.sql(file="myData.csv", # looks for char M (no qoutes)
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127
write.csv(df, "myData.csv", row.names=F) # quoted strings...
myData <- read.csv.sql(file="myData.csv", # this fails
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv", # need quotes in the char literal
sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127

Resources