Poor performance Arrow Parquet multiple files

Poor performance Arrow Parquet multiple files - r

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total.
Doing a basic select returned all rows in less than 15 seconds! (Just out of this world result!!) At the webinar Neal Richardson from Ursa Labs was showcasing the Ny-Taxi dataset with 2 billions rows under 4 seconds.
I felt it was time to do something more daring like basic mean, sd, mode over a year worth of data, but that took a minute per month, so I was sitting 12.4 minutes waiting for a reply from R.
What is the issue? My badly written R-query? or simply too many files or granularity (decimal values?)??
Any ideas??
PS: I did not want to put a Jira-case in apache-arrow board as I see google search does not retrieve answers from there.

My guess (without actually looking at the data or profiling the query) is two things:
You're right, the decimal type is going to require some work in converting to an R type because R doesn't have a decimal type, so that will be slower than just reading in an int32 or float64 type.
You're still reading in ~350 million rows of data to your R session, and that's going to take some time. In the example query on the arrow package vignette, more data is filtered out (and the filtering is very fast).

Related

hvplot taking hours to render image

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.
For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:
...
import dask.dataframe as dd
import hvplot.dask
import glob
df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)
...
does anybody have any ideas what could be the issue here? Any help would be appreciated
Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)

Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:
df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])
Unless crucial, I'd also avoid doing .astype. It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.

time taken to read a large CSV file in Julia

I have a large CSV file - almost 28 million rows and 57 columns - 8.21GB - the data is of different types - integers, strings, floats - but nothing unusual.
When I load it in Python/Pandas it takes 161 seconds, using the following code.
df = pd.read_csv("file.csv", header=0, low_memory=False)
In Julia, it takes a little longer - over an hour. UPDATE: I am not sure why, but when I ran the code this morning (twice to check), it took around 702 and 681 seconds. This much better than an hour, but it is still way slower than Python.
My Julia code is also pretty simple:
df = CSV.File("file.csv") |> DataFrame
Am I doing something wrong? Is there something I can do to speed it up? Or is this just the price you pay to play with Julia?

From the CSV.jl documentation:
In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made.
so you could try
CSV.read("file.csv", DataFrame)

How can I generate a fully-formatted Excel document as a final output?

I have built a script that extracts data from numerous backend systems and I want the output to be formatted like this:
Which package is the best and/or easiest to do this with?
I am aware of the xlsx package, but am also aware that there are others available so would like to know which is best in terms of ease and/or simplicity to achieve my desired output.
A little more detail:
If I run the report across seven days, then the resulting data frame is 168 rows deep (1 row represents 1 hour, 168 hours per week). I want each date (00:00 - 23:00) to be broken out into day-long blocks, as per the image I have provided.
(Also note that I am in London, England, and as such am currently in timezone UTC+1, which means that right now, the hourly breakdown for each date will range from 01:00 - 00:00 on the next day because our backend systems run on the UTC timezone and that is fine.)
At present, I copy and paste (transpose) the values across manually, but want to be able to fully automate the process so that I can run the script (function), and have the resulting output looking like the image.
This is what the current final output looks like:

Try the package openxlsx. The package offers a lot of custom formatting for .xslx documents and is actively developed / fairly responsive to githhub issues. The vignettes on the cran website are particularly useful.

High-scale signal processing in R

I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.

This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.

Trimming a huge (3.5 GB) csv file to read into R

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.
So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?
Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?
Any ideas would be greatly appreciated.
Update:
Constraints
Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later
Data
The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17
These are entirely unimportant and can be dropped
I only need ~100,000 rows out of this file (See below)
Data example:
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC [Malformed row]
[8.5 Mill rows]
I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:
County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]
Results:
After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.
I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.
It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.
## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)
Failings by Approach:
sqldf
This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
MapReduce
To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
bigmemory
This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
scan
Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.

My try with readLines. This piece of a code creates csv with selected years.
file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers
B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:
mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files
Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

Is there a similar way to read in files a piece at a time in R?
Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.
If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.
If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.

There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
http://colbycol.r-forge.r-project.org/
It passes any arguments along to read.table, so the combination should let you subset pretty tightly.

The ff package is a transparent way to deal with huge files.
You may see the package website and/or a presentation about it.
I hope this helps

What about using readr and the read_*_chunked family?
So in your case:
testfile.csv
County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5
Actual code
require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)
This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.
This results in:
# A tibble: 2 × 8
County State Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment` GDP
* <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl>
1 Ada County NC 2009 4 FIRE Financial Banks 801
2 Ada County NC 2010 1 FIRE Financial Banks 825
You can even increase chunk_size but in this example there are only 4 lines.

You could import data to SQLite database and then use RSQLite to select subsets.

Have you consisered bigmemory ?
Check out this and this.

Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.
It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.

scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.

These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.

Now, 2017, I would suggest to go for spark and sparkR.
the syntax can be written in a simple rather dplyr-similar way
it fits quite well to small memory (small in the sense of 2017)
However, it may be an intimidating experience to get started...

I would go for a DB and then make some queries to extract the samples you need via DBI
Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html
It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...
Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.
my two humble pennies worth.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex