Passing data from Fortran to R - r

I am currently doing a large amount of data analysis in Fortran. I have been using R to plot most of my results, as Fortran is ill-suited for visualization. Up until now, the data sets have been two-dimensional and rather small, so I've gotten away with routines that write the data-to-be-plotted and various plot parameters to a .CSV file, and using a system call to run an R script that reads the file and generates the required plot.
However, I find myself now dealing with somewhat larger 3D data sets, and I do not know if I can feasibly continue in this manner (notably, sending and properly reading in a 3D array via .CSV is rather more difficult, and takes up a lot of excess memory which is a problem given the size of the data sets).
Does anyone know any efficient way of sending data from Fortran to R? The only utility I found for this (RFortran) is windows-only, and my work computer is a mac. I know that R possesses a rudimentary Fortran interface, but I am calling R from Fortran, not vice-versa, and moreover given the number of plot parameters I am sending (axis lables, plot titles, axis units and limits, etc., many of which are optional and have default values in the current routines I'm using) I am not sure that it has the features I require.

I would go for writing NetCDF files from Fortran. These files can contain large amounts of multi-dimensional data. There are also good bindings for creating NetCDF files form within Fortran (it is used a lot in climate models). In addition, R has excellent support for working with NetCDF files in the form of the ncdf package. It is for example very easy to only read a small portion of the data cube into memory (only some timesteps, or some geographic region). Finally, NetCDF works across all platforms.
In terms of workflow, I would let the fortran program generate NetCDF files plus some graphics parameters in a separate file (data.nc and data.plt for example), and then as a post-processing step call R. In this way you do not need to directly interface R and Fortran. Managing the entire workflow could be done by a separate script (e.g. Python), which calls the Fortran model, makes a list of the NetCDF/.plt files and creates the plots.

So, it turns out that sending arrays via. unformatted files between Fortran and R is trivially easy. Both are column-major, so one needs to do no more than pass an unformatted file containing the array and another containing array shape and size information, and then read the data directly into an array of proper size and shape in R.
Sample code for an n-dimensional array of integers, a, with dimension i having size s(i).
Fortran-side (access must be set to "stream," else you will have extra bytes inserted after every write):
open(unit = 1, file="testheader.dat", form="unformatted", access="stream", status="unknown")
open(unit = 2, file="testdata.dat", form="unformatted", access="stream", status="unknown")
write(1) n
do i=1,n
write(1) s(i)
enddo
write(2) a
R-side (be sure that you have endianness correct, or this will fail miserably):
testheader = file("testheader.dat", "rb")
testdata = file("testdata.dat", "rb")
dims <- readBin(testheader, integer(), endian="big")
sizes <- readBin(testheader, integer(), n=dims, endian="big")
dim(sizes) <- c(dims)
a <- readBin(testdata, integer(), n=prod(sizes), endian="big")
dim(a) <- sizes
You can put the header and data in the same file if you want.

Related

Export composite RGB image with spatial information, R

I am processing hundreds of 4-band images in R and need help on what is probably a very simple task. As part of the processing, I need to export a single band RGB composite that maintains the spatial information of the original GeoTiff. In other software I've exported a .jgw file but I need to be able to do this in R. These images will be used as basemaps and fed into another mapping interface. I have searched and searched and can only find how to plotRGB() and how to writeRaster(). PlotRGB loses the spatial information and writeRaster() produces a multi-band image.
Any ideas? There is a built in raster in R that can be used.
library(raster)
library(rgdal)
r <- raster(system.file("external/test.grd", package="raster"))
x <- RGB(r)
plotRGB(x) #Is there a way to output this where it will maintain spatial information?
writeRaster(x, filename="file.tif") #This produces a 3-band tiff, not a composite
The writeRaster function can take an options argument to pass options to the underlying GDAL library (e.g., GeoTIFF options are documented here). The option TFW=YES writes out a .tfw world file which appears to be the same thing as a .jgw file.
Now, "composite RGB" isn't standard terminology in the TIFF world; it seems to be specific to "ArcMap" and friends, so it's hard to tell what's really meant by this, but you can generate what one would normally think of as a "standard" RGB TIFF format by specifying that the datatype for the color components be 1-byte unsigned integers (datatype="INT1U"), so the following may do what you want:
writeRaster(RGB(r), filename="file2.tif", datatype="INT1U",
options="TFW=YES", format="GTiff")
As far as I can tell, unrecognized or misspelled options values don't generate any error messages, so you need to be careful they're all spelled correctly.
Just noting an update to the process utilizing the terra package. Process is very similar but there are some different parameters.
r <- rast(system.file("ex/logo.tif", package="terra"))
# a little forced as RGB is already assigned in this image...
RGB(r) <- 1:3
# export as geotiff -- again force due to input file example...
writeRaster(x = r, filename = "rgb2.tif",datatype="INT1U",filetype = "GTiff")
I've been using with NAIP imagery successfully.

Storage a single long character string with minimum disk usage with R

I want to use R to storage a DNA sequence with minimum disk usage. A DNA sequence is a very long (typically tens of million characters) character string composed of "A", "C", "G" and "T".
Suppose "abc.fa" is a text file on the disk contains 43 million characters, I have tried the following different approaches.
(1) Without using R, I use the gzip command of Linux to compress the file "abc.fa" and the result file "abc.fa.gz" occupied about 13 Mb of the disk space.
(2) Using the Biostring package of R.
dat <- readDNAStringSet("abc.fa")
writeXStringSet(dat, file="abc.comp.fa", compress=TRUE)
The output file abc.comp.fa also occupied about 13 Mb of the disk space.
(3) Using the save function of R to storage the sequence as a character string of R.
dat <- readDNAStringSet("abc.fa")
dat <- as.character(dat)
save(dat, file="abc.chara.fa", compress="xz")
The output file abc.chara.fa occupied about 9 Mb of the disk space.
I am wondering if there are more efficient approaches to storage this kind of sequences with even smaller disk usage in R.
Thanks.
EDIT:
I made some research. Both save and saveRDS comes with three different possible compression algorithms, as you already know. What could me more interesting for you is the compression_level argument, that comes with save. It is an integer from 1 to 9, by default set to 6 for gzip compression and to 9 for bzip2 or xz compression. saveRDS comes only with the default values for the three compression algorithms.
The higher compression rate has drawbacks in read and write times. I previously suggested saveRDS since you need to save a single object. In any case, if you are not interested in responsiveness (since the data object is quite small), I suggest you to test the three algorithms with compression_level = 9 and verify which one fit better your needs.
EDIT 2:
As far as I know, the structure of the string should not affect the size of the object, but I have an hypothesis. Your data has only four values, namely A, C, T, G. Data are often stored and represented in the standard IEEE 754 format. It allows a far wider range of representations. Actually, you should be ok with a two digits representation system, where 00, 01, 10, 11 are capable of deal with your data, saving otherwise unused space. You should check how your data is represented, and eventually consider for a conversion.

Either unformatted I/O is giving absurd values, or I'm reading them incorrectly in R

I have a problem with unformatted data and I don't know where, so I will post my entire workflow.
I'm integrating my own code into an existing climate model, written in fortran, to generate a custom variable from the model output. I have been successful in getting sensible and readable formatted output (values up to the thousands), but when I try to write unformatted output then the values I get are absurd (on the scale of 1E10).
Would anyone be able to take a look at my process and see where I might be going wrong?
I'm unable to make a functional replication of the entire code used to output the data, however the relevant snippet is;
c write customvar to file [UNFORMATTED]
open (unit=10,file="~/output_test_u",form="unformatted")
write (10)customvar
close(10)
c write customvar to file [FORMATTED]
c open (unit=10,file="~/output_test_f")
c write (10,*)customvar
c close(10)
The model was run twice, once with the FORMATTED code commented out and once with the UNFORMATTED code commented out, although I now realise I could have run it once if I'd used different unit numbers. Either way, different runs should not produce different values.
The files produced are available here;
unformatted(9kb)
formatted (31kb)
In order to interpret these files, I am using R. The following code is what I used to read each file, and shape them into comparable matrices.
##Read in FORMATTED data
formatted <- scan(file="output_test_f",what="numeric")
formatted <- (matrix(formatted,ncol=64,byrow=T))
formatted <- apply(formatted,1:2,as.numeric)
##Read in UNFORMATTED data
to.read <- file("output_test_u","rb")
unformatted <- readBin(to.read,integer(),n=10000)
close(to.read)
unformatted <- unformatted[c(-1,-2050)] #to remove padding
unformatted <- matrix(unformatted,ncol=64,byrow=T)
unformatted <- apply(unformatted,1:2,as.numeric)
In order to check the the general structure of the data between the two files is the same, I checked that zero and non-zero values were in the same position in each matrix (each value represents a grid square, zeros represent where there was sea) using;
as.logical(unformatted)-as.logical(formatted)
and an array of zeros was returned, indicating that it is the just the values which are different between the two, and not the way I've shaped them.
To see how the values relate to each other, I tried plotting formatted vs unformatted values (note all zero values are removed)
As you can see they have some sort of relationship, so the inflation of the values is not random.
I am completely stumped as to why the unformatted data values are so inflated. Is there an error in the way I'm reading and interpreting the file? Is there some underlying way that fortran writes unformatted data that alters the values?
The usual method that Fortran uses to write unformatted file is:
A leading record marker, usually four bytes, with the length of the following record
The actual data
A trailing record marker, the same number of bytes as the leading record marker, with the same information (used for BACKSPACE)
The usual number of bytes in the record marker is four bytes, but eight bytes have also been sighted (e.g. very old versions of gfortran for 64-bit systems).
If you don't want to deal with these complications, just use stream access. On the Fortran side, open the file with
OPEN(unit=10,file="foo.dat",form="unformatted",access="stream")
This will give you a stream-oriented I/O model like C's binary streams.
Otherwise, you would have to look at your compiler's documentation to see how exactly unformatted I/O is implemented, and take care of the record markers from the R side. A word of caution here: Different compilers have different methods of dealing with very long records of more than 2^31 bytes, even if they have four-byte record markers.
Following on from the comments of #Stibu and #IanH, I experimented with the R code and found that the source of error was the incorrect handling of the byte size in R. Explicitly specifying a bite size of 4, i.e
unformatted <- readBin(to.read,integer(),size="4",n=10000)
allows the data to be perfectly read in.

Running mahout k means clustering command without converting input file to vectors

I have a dataset(300MB) on which I wish to run k means clustering using Mahout. The data is in a form of csv which contains only numerical values. Is it still necessary to input the file in vectorized format for the mahout k means command? If not, how can i run the k means command directly on my csv file without converting it to a vector format?
If your data is 300 MB, the answer is don't use Mahout at all.
Really ONLY EVER use Mahout when your data no longer fits into memory. Map Reduce is expensive, you only want to use it when you can't solve the problem without.

pytables: how to fill in a table row with binary data

I have a bunch of binary data in N-byte chunks, where each chunk corresponds exactly to one row of a PyTables table.
Right now I am parsing each chunk into fields, writing them to the various fields in the table row, and appending them to the table.
But this seems a little silly since PyTables is going to convert my structured data back into a flat binary form for inclusion in an HDF5 file.
If I need to optimize the CPU time necessary to do this (my data comes in large bursts), is there a more efficient way to load the data into PyTables directly?
PyTables does not currently expose a 'raw' dump mechanism like you describe. However, you can fake it by using UInt8Atom and UInt8Col. You would do something like:
import tables as tb
f = tb.open_file('my_file.h5', 'w')
mytable = f.create_table('/', 'mytable', {'mycol': tb.UInt8Col(shape=(N,))})
mytable.append(myrow)
f.close()
This would likely get you the fastest I/O performance. However, you will miss out on the meaning of the various fields that are part of this binary chunk.
Arguably, raw dumping of the chunks/rows is not what you want to do anyway, which is why it is not explicitly supported. Internally HDF5 and PyTables handle many kinds of conversion for you. This includes but is not limited to things like endianness and thet platform specific feature. By managing the data types for you the resultant HDF5 file and data set cross platform. When you dump raw bytes in the manner you describe you short-circuit one of the main advantages of using HDF5/PyTables. If you do short-circuit, you have a high probability that the resulting file will look like garbage on anything but the original system that produced it.
So in summary, you should be converting the chunks to the appropriate data types in memory and then writing out. Yes this takes more processing power, time, etc. So in addition to being the right thing to do it will ultimately save you huge headaches down the road.

Resources