R write in feather format to byte vector - r

In R, how can a data.frame be written to an in-memory raw byte vector in the feather format?
The arrow package has a write_feather function in which the destination can be an BufferedOutputStream, but the documentation doesn’t describe how to create such a stream or access its underlying buffer.
Other than that most other packages assume usage of a local file system rather than in-memory storage.
Thank you in advance for your consideration and response.

BufferOutputStream$create() is how you create one. You can pass that to write_feather(). If you want a raw R vector back, you can use write_to_raw(), which wraps that. See https://arrow.apache.org/docs/r/reference/write_to_raw.html for docs; there's a link there to the source if you want to see exactly what it's doing, in case you want to do something slightly differently.

Related

AudioBuffer's "getChannelData()" equivalent for MediaStream (or MediaStreamAudioSourceNode)?

I use AudioContext's decodeAudioData on an mp3 file, which gives me an AudioBuffer. With this audio buffer, I go on to draw a waveform of this mp3 file on a canvas using the data returned by getChannelData().
Now I want to use the same code to draw a waveform of the audio data of a MediaStream, which means I need the same kind of input/data. I know a MediaStream contains real time information, but there must be a way to access each new data from the MediaStream as
a Float32Array containing the PCM data
which is what the AudioBuffer's getChannelData returns.
I tried to wrap the MediaStream with a MediaStreamAudioSourceNode and feed it into an AnalyserNode to use getFloatFrequencyData() (which returns a Float32Array), but I can tell the data is different from the data I get from getChannelData(). Maybe it isn't "PCM" data? How can I get "PCM" data?
Hopefully this is clear. Thanks for the help!
First, a note that AnalyserNode only samples the data occasionally, but won't process all of it. I think that matches your scenario well, but just know that if you need all the data (like, you're buffering up the audio), you will need to use ScriptProcessor instead today.
Presuming you just want samples of the data, you can use AnalyserNode, but you should call getFloatTimeDomainData(), not getFloatFrequencyData(). That will give you the PCM data (FrequencyData is giving you the FFT of the PCM data).
create MediaStreamDestination with audiocontext and then new MediaRecorder from the stream,
var options = {mimeType: 'audio/webm;codecs=pcm'};
mediaRecorder = new MediaRecorder(stream, options);

Read HDF5 data with numpy axis order with Julia HDF5

I have an HDF5 file containing arrays that are saved with Python/numpy. When I read them into Julia using HDF5.jl, the axes are in the reverse of the order in which they appear in Python. To reduce the mental gymnastics involved in moving between the Python and Julia codebases, I reverse the axis order when I read the data into Julia. I have written my own function to do this:
function reversedims(ary::Array)
permutedims(ary, [ ndims(ary):-1:1 ])
end
data = HDF5.read(someh5file, somekey) |> reversedims
This is not ideal because (1) I always have to import reversedims to use this; (2) I have to remember to do this for each Array I read. I am wondering if it is possible to either:
instruct HDF5.jl to read in the arrays with a numpy-style axis order, either through a keyword argument or some kind of global configuration parameter
use a builtin single argument function to reverse the axes
The best approach would be to create a H5py.jl package, modeled on MAT.jl (which reads and writes .mat files created by Matlab). See also https://github.com/timholy/HDF5.jl/issues/180.
It looks to me like permutedims! does what you're looking for, however it does do an array copy. If you can rewrite the hdf5 files in python, numpy.asfortranarray claims to return your data stored in column-major format, though the numpy internals docs seem to suggest that the data isn't altered, simply the stride is, so I don't know if the hdf5 file output would be any different
Edit: Sorry, I just saw you are already using permutedims in your function. I couldn't find anything else on the Julia side, but I would still try the numpy.asfortranarray and see if that helps.

Which functions should I use to work with an XDF file on HDFS?

I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers,
Amit
Note that rxReadXdf and rxXdfToDataFrame have different arguments and do slightly different things:
rxReadXdf has a numRows argument, so use this if you want to read the top 1000 (say) rows of the dataset
rxXdfToDataFrame supports rxTransforms, so use this if you want to manipulate your data in addition to reading it
rxXdfToDataFrame also has the maxRowsByCols argument, which is another way of capping the size of the input
So in your case, you want to use rxXdfToDataFrame since you're transforming the data in addition to reading it. rxReadXdf is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.
However, are you sure that you want to read the data into a data frame? You can use rxDataStep to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.

pytables: how to fill in a table row with binary data

I have a bunch of binary data in N-byte chunks, where each chunk corresponds exactly to one row of a PyTables table.
Right now I am parsing each chunk into fields, writing them to the various fields in the table row, and appending them to the table.
But this seems a little silly since PyTables is going to convert my structured data back into a flat binary form for inclusion in an HDF5 file.
If I need to optimize the CPU time necessary to do this (my data comes in large bursts), is there a more efficient way to load the data into PyTables directly?
PyTables does not currently expose a 'raw' dump mechanism like you describe. However, you can fake it by using UInt8Atom and UInt8Col. You would do something like:
import tables as tb
f = tb.open_file('my_file.h5', 'w')
mytable = f.create_table('/', 'mytable', {'mycol': tb.UInt8Col(shape=(N,))})
mytable.append(myrow)
f.close()
This would likely get you the fastest I/O performance. However, you will miss out on the meaning of the various fields that are part of this binary chunk.
Arguably, raw dumping of the chunks/rows is not what you want to do anyway, which is why it is not explicitly supported. Internally HDF5 and PyTables handle many kinds of conversion for you. This includes but is not limited to things like endianness and thet platform specific feature. By managing the data types for you the resultant HDF5 file and data set cross platform. When you dump raw bytes in the manner you describe you short-circuit one of the main advantages of using HDF5/PyTables. If you do short-circuit, you have a high probability that the resulting file will look like garbage on anything but the original system that produced it.
So in summary, you should be converting the chunks to the appropriate data types in memory and then writing out. Yes this takes more processing power, time, etc. So in addition to being the right thing to do it will ultimately save you huge headaches down the road.

faster than scan() with Rcpp?

Reading ~5x10^6 numeric values into R from a text file is relatively slow on my machine (a few seconds, and I read several such files), even with scan(..., what="numeric", nmax=5000) or similar tricks. Could it be worthwhile to try an Rcpp wrapper for this sort of task (e.g. Armadillo has a few utilities to read text files)?
Or would I likely be wasting my time for little to no gain in performance because of an expected interface overhead? I'm not sure what's currently limiting the speed (intrinsic machine performance, or else?) It's a task that I repeat many times a day, typically, and the file format is always the same, 1000 columns, around 5000 rows.
Here's a sample file to play with, if needed.
nr <- 5000
nc <- 1000
m <- matrix(round(rnorm(nr*nc),3),nr=nr)
cat(m[1, -1], "\n", file = "test.txt") # first line is shorter
write.table(m[-1, ], file = "test.txt", append=TRUE,
row.names = FALSE, col.names = FALSE)
Update: I tried read.csv.sql and also load("test.txt", arma::raw_ascii) using Armadillo and both were slower than the scan solution.
I highly recommend checking out fread in the latest version of data.table. The version on CRAN (1.8.6) doesn't have fread yet (at the time of this post) so you should be able to get it if you install from the latest source at R-forge. See here.
Please bear in mind that I'm not an R-expert but maybe the concept applies here too: usually reading binary stuff is much faster than reading text files. If your source files don't change frequently (e.g. you are running varied versions of your script/program on the same data), try to read them via scan() once and store them in a binary format (the manual has a chapter about exporting binary files).
From there on you can modify your program to read the binary input.
#Rcpp: scan() & friends are likely to call a native implementation (like fscanf()) so writing your own file read functions via Rcpp may not provide a huge performance gain. You can still try it though (and optimize for your particular data).
Salut Baptiste,
Data Input/Output is a huge topic, so big that R comes with its own manual on data input/output.
R's basic functions can be slow because they are so very generic. If you know your format, you can easily write yourself a faster import adapter. If you know your dimensions too, it is even easier as you need only one memory allocation.
Edit: As a first approximation, I would write a C++ ten-liner. Open a file, read a line, break it into tokens, assign to a vector<vector< double > > or something like that. Even if you use push_back() on individual vector elements, you should be competitive with scan(), methinks.
I once had a neat little csv reader class in C++ based on code by Brian Kernighan himself. Fairly generic (for csv files), fairly powerful.
You can then squeeze performance as you see fit.
Further edit: This SO question has a number of pointers for the csv reading case, and references to the Kernighan and Plauger book.
Yes, you almost certainly can create something that goes faster than read.csv/scan. However, for high performance file reading there are some existing tricks that already let you go much faster, so anything you do would be competing against those.
As Mathias alluded to, if your files don't change very often, then you can cache them by calling save, then restore them with load. (Make sure to use ascii = FALSE, since reading the binary files will be quicker.)
Secondly, as Gabor mentioned, you can often get a substantial performance boost by reading your file into a database and then from that database into R.
Thirdly, you can use the HadoopStreaming package to use Hadoop's file reading capabilities.
For more thoughts in these techniques, see Quickly reading very large tables as dataframes in R.

Resources