Streaming data in Julia - julia

Currently, is there a good way to read data in Julia in a streaming fashion?
For example, let's say I have a CSV file that is too big to fit in memory. Are there currently built in functions or a library that facilitates working with this?
I'm aware of the prototype DataStream functionality in DataFrames, but that's not currently exposed via a public API.

The eachline function turns an IO source into an iterator of lines. That should allow you to read a file a line at a time. from there the readcsv and readdlm function can read each line if you turn it into an IOBuffer.
for ln in eachline(open("file.csv"))
data = readcsv(IOBuffer(ln))
# do something with this data
end
It's still pretty do it yourself but there aren't that many steps so it's not too bad.

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

"filename.rdata" file Exploring and Converting to CSV

I'm no R-programmer (because of the problem I started learning it), I'm using Python, In a forcasting task I got a dataset signalList.rdata of a pheomenen called partial discharge.
I tried some commands to load, open and view, Hardly got a glimps
my_data <- get(load('C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/signalList.Rdata'))
but, since i lack deep knowledge about R, I wanted to convert it into a csv file, or any type that I can deal with in python.
or, explore it and copy-paste manually.
so, i'm asking for any solution whether using R or Python or any tool to get what's in the .rdata file.
Have you managed to load the data successfully into your working environment?
If so, write.csv is the function you are looking for.
If not,
setwd("C:/Users/Zack-PC/Desktop/Study/Data Sets/pdCluster/")
signalList <- load("signalList.Rdata")
write.csv(signalList, "signalList.csv")
should do the trick.
If you would like to remove signalList from your working directory,
rm(signalList)
will accomplish this.
Note: changing your working directory isn't necessary, it just makes it easier to read in a comment I feel. You may also specify another path for saving your csv to within the second argument of write.csv.

Multiple procedures in IDL program

I've written a procedure in IDL which performs some calculations on data and outputs an array of values. The calculations take about 2 minutes to run.
I need to then perform analysis on these results, and ideally I would like not to have to perform the initial calculations each time I want to perform some different analysis.
Is the best way to achieve this to save the output from the calculation to a data file and then read this in from a different program? Or is there a less cumbersome way to go about this?
Thanks in advance for any help
Yes, saving to a file is the easiest way to save the results from your first program for later use in the second (assuming you quit IDL between the two). There are may ways to save the data, depending on it's type and your preferences.
Easiest Way:
An IDL .sav file created by the SAVE command can store any kind of data, IDL variables, and even the whole state of your IDL session. Unfortunately, it only works for IDL (no other languages), and it can need to be re-generated if you upgrade IDL version. You read these files with RESTORE, which even remembers the names of the variables.
my_variable = 'Some data here.'
SAVE, my_variable, FILENAME='myfile.sav' ; save variable(s)
... IDL opened and closed here ...
RESTORE, 'myfile.sav' ; read variable(s) from file
print, my_variable
Some data here.
Most Portable Way:
For simple tabular data, CSV has the advantage of being highly portable and human readable. However, it's also slow, since numbers are stored in ASCII. Use WRITE_CSV to write, and READ_CSV to read.
Most Portable Binary Formats:
For complex data that needs to be read by multiple languages, consider the HDF5 or NetCDF libraries. Both of these are binary formats that can store most types of IDL-supported data. Note that NetCDF is actually an easier-to-use subset of HDF5.
Simplest Binary Format:
Another option for tabular data is a simple binary dump. Use WRITEU to write to a normal file opened for writing. Use READU to read from a normal file open for reading.
Assuming that your data calculations will only change very rarely, then, yes, your best solution is to just save the calculations to an output file, and then read them back into your analysis program. You don't say what kind of data this is, so it's hard to give a more specific answer. Assuming that you have a two-dimensional array of data, you could just write the results as a "flat" binary file:
pro perform_calculations
...
; assume mydata is a float array of dimensions [m,n]
openw, 1, 'results.dat'
writeu, 1, mydata
close, 1
end
Then, in either the same file or preferably a different .pro file:
pro perform_analysis
mydata = fltarr(m, n)
openr, 1, 'results.dat'
readu, 1, mydata
close, 1
...
end
Hope this helps.
Saving is a good way to do it, but if you run in the same session and your second program won't mess up the data from the first one, you can just call one and then pass the result to the second one.
pro do_calculations,result1,result2,result3
result1=1
result2=1.
result3=result1/result2
return
end
pro use_calculations,result1,result2,result3,result4
result4=result1-result2+result3
return
end
Then
IDL> do_calculations,result1,result2,result3
IDL> use_calculations,result1,result2,result3,result4
If you edit use_calculations, you can go again by:
IDL> use_calculations,result1,result2,result3,result4
Because the earlier results will stay in memory unless use_calculations does something bad to them.
You could also set up the second procedure to check to see if it has valid results from the first one and call it as needed.

R passing data frame to another program using system()

I have a data frame that I pass to another program using system(). In the current setup, I first write the contents of the dataframe to a text file, then have the system() command look for the created text file.
df1 <- runif(20)
write(df1, file="file1.txt")
system("myprogram file1.txt")
I have 2 questions:
1) Is there a way to pass a dataframe directly without writing the text file?
2) If not, is there are way to pass the data in memory as a text formatted entity without writing the file to disk?
Thanks for any suggestions.
You can write to anything R calls connections, and that includes network sockets.
So process A can write to the network, and process B can read it without any file-on-disk involved, see help(connections) which even has a working example in the "Examples" section.
Your general topic here is serialization, and R does that for you. You can also pass data that way to other programs using tools that encode metadata about your data structure -- as for example Google's Protocol Buffers (supported in R by the RProtoBuf package).
I spent quite a while and couldn't understand the accepted answer. But I figured out a workaround.
df1 <- runif(20)
system("myprogram /dev/stdin", input = write.table(df1))
However, according to documentation, the input argument will actually be redirected to a temp file, which I suppose will involve some i/o.

Is there a way to read and write in-memory files in R?

I am trying to use R to analyze large DNA sequence files (fastq files, several gigabytes each), but the standard R interface to these files (ShortRead) has to read the entire file at once. This doesn't fit in memory, so it causes an error. Is there any way that I can read a few (thousand) lines at a time, stuff them into an in-memory file, and then use ShortRead to read from that in-memory file?
I'm looking for something like Perl's IO::Scalar, for R.
I don’t know much about R, but have you had a look at the mmap package?
It looks like ShortRead is soon to add a "FastqStreamer" class that does what I want.
Well, I don't know about readFastq accepting something other than a file...
But if it can, for other functions, you can use the R function pipe() to open a unix connection, then you could do this with a combination of unix commands head and tail and some pipes.
For example, to get lines 90 to 100, you use this:
head file.txt -n 100 | tail -n 10
So you can just read the file in chunks.
If you have to, you can always use these unix utilities to create a temporary file, then read that in with shortRead. It's a pain but if it can only take a file, at least it works.
Incidentally, the answer to generally how to do an in-memory file in R (like Perl's IO::Scalar) is the textConnection function. Sadly though, the ShortRead package cannot handle textConnection objects as inputs, so while the idea that I expressed in the question of reading a file in small chunks into in-memory files which are then parsed bit by bit is certainly possible for many applications, but not for may particular application since ShortRead does not like textConnections. So the solution is the FastqStreamer class described above.

Resources