How to deal with hdf5 files in R? - r

I have a file in hdf5 format. I know that it is supposed to be a matrix, but I want to read that matrix in R so that I can study it. I see that there is a h5r package that is supposed to help with this, but I do not see any simple to read/understand tutorial. Is such a tutorial available online. Specifically, How do you read a hdf5 object with this package, and how to actually extract the matrix?
UPDATE
I found out a package rhdf5 which is not part of CRAN but is part of BioConductoR. The interface is relatively easier to understand the the documentation and example code is quite clear. I could use it without problems. My problem it seems was the input file. The matrix that I wanted to read was actually stored in the hdf5 file as a python pickle. So every time I tried to open it and access it through R i got a segmentation fault. I did figure out how to save the matrix from within python as a tsv file and now that problem is solved.

The rhdf5 package works really well, although it is not in CRAN. Install it from Bioconductor
# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.11")
And to use it:
library(rhdf5)
List the objects within the file to find the data group you want to read:
h5ls("path/to/file.h5")
Read the HDF5 data:
mydata <- h5read("path/to/file.h5", "/mygroup/mydata")
And inspect the structure:
str(mydata)
(Note that multidimensional arrays may appear transposed). Also you can read groups, which will be named lists in R.

You could also use h5, a package which I recently published on CRAN.
Compared to rhdf5 it has the following features:
S4 object model to directly interact with HDF5 objects like files, groups, datasets and attributes.
Simpler syntax, implemented R-like subsetting operators for datasets supporting commands like
readdata <- dataset[1:3, 1:3]
dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
Supported NA values for all data types
200+ Test cases with a code coverage of 80%+.
To save a matrix you could use:
library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)
... and read the entire matrix back into R:
file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)
See also h5 on
CRAN: http://cran.r-project.org/web/packages/h5/index.html
Github: https://github.com/mannau/h5

I used the rgdal package to read HDF5 files. You do need to take care that probably the binary version of rgdal does not support hdf5. In that case, you need to build gdal from source with HDF5 support before building rgdal from source.
Alternatively, try and convert the files from hdf5 to netcdf. Once they are in netcdf, you can use the excellent ncdf package to access the data. The conversion I think could be done with the cdo tool.

The ncdf4 package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer).
In the developer's words:
the HDF group says:
NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths
Unidata says:
The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer.
In practice, ncdf4 provides a simple interface, and migrating code from using older hdf5 and ncdf packages to a single ncdf4 package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer).

Related

How to write an effective loop to access datasets inside 1000's of h5 files in R [duplicate]

I have a file in hdf5 format. I know that it is supposed to be a matrix, but I want to read that matrix in R so that I can study it. I see that there is a h5r package that is supposed to help with this, but I do not see any simple to read/understand tutorial. Is such a tutorial available online. Specifically, How do you read a hdf5 object with this package, and how to actually extract the matrix?
UPDATE
I found out a package rhdf5 which is not part of CRAN but is part of BioConductoR. The interface is relatively easier to understand the the documentation and example code is quite clear. I could use it without problems. My problem it seems was the input file. The matrix that I wanted to read was actually stored in the hdf5 file as a python pickle. So every time I tried to open it and access it through R i got a segmentation fault. I did figure out how to save the matrix from within python as a tsv file and now that problem is solved.
The rhdf5 package works really well, although it is not in CRAN. Install it from Bioconductor
# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.11")
And to use it:
library(rhdf5)
List the objects within the file to find the data group you want to read:
h5ls("path/to/file.h5")
Read the HDF5 data:
mydata <- h5read("path/to/file.h5", "/mygroup/mydata")
And inspect the structure:
str(mydata)
(Note that multidimensional arrays may appear transposed). Also you can read groups, which will be named lists in R.
You could also use h5, a package which I recently published on CRAN.
Compared to rhdf5 it has the following features:
S4 object model to directly interact with HDF5 objects like files, groups, datasets and attributes.
Simpler syntax, implemented R-like subsetting operators for datasets supporting commands like
readdata <- dataset[1:3, 1:3]
dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
Supported NA values for all data types
200+ Test cases with a code coverage of 80%+.
To save a matrix you could use:
library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)
... and read the entire matrix back into R:
file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)
See also h5 on
CRAN: http://cran.r-project.org/web/packages/h5/index.html
Github: https://github.com/mannau/h5
I used the rgdal package to read HDF5 files. You do need to take care that probably the binary version of rgdal does not support hdf5. In that case, you need to build gdal from source with HDF5 support before building rgdal from source.
Alternatively, try and convert the files from hdf5 to netcdf. Once they are in netcdf, you can use the excellent ncdf package to access the data. The conversion I think could be done with the cdo tool.
The ncdf4 package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer).
In the developer's words:
the HDF group says:
NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths
Unidata says:
The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer.
In practice, ncdf4 provides a simple interface, and migrating code from using older hdf5 and ncdf packages to a single ncdf4 package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer).

How to access/open arbitrary formatted (.pdf etc) documents from R's console?

http://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-package-vignettes states:
"...In addition to the help files in Rd format, R packages allow the inclusion of documents in arbitrary other formats. The standard location for these is subdirectory inst/doc of a source package, the contents will be copied to subdirectory doc when the package is installed. Pointers from package help indices to the installed documents are automatically created. Documents in inst/doc can be in arbitrary format, however we strongly recommend providing them in PDF format, so users on almost all platforms can easily read them..."
I used roxygen package. It produced .Rd files for me. I produced .pdf of my package via " R CMD Rd2pdf causfinder/" from Windows command line ( or, via during build/install process via "roxygenize("causfinder"); build("causfinder"); install("causfinder")".)
I wanted to add some supplementary .pdf help files (that I created outside of R; from Word via save as .pdf etc.) to my package other than the one that is produced via above techniques. These supplementary .pdf files include the detailed mathematical theory and lots of samples of the functions of my package which illustrate the usage of the functions via various plots, graphs, etc. I called it TheoryOfcausfinder.pdf.
I wanted to add this supplementary .pdf file to my package. As is directed from R's above manual, I put TheoryOfcausfinder.pdf to inst\doc folder in my R's working directory. Upon build/install process, I obtained causfinder/doc/index.html and causfinder/doc/TheoryOfcausfinder.pdf in my R's library location.
The content of index.html:
"...Vignettes from package 'causfinder': The package contains no vignette meta-information.
Other files in the doc directory: TheoryOfcausfinder.pdf..."
I want the future users of causfinder to easily access/open this supplementary TheoryOfcausfinder.pdf. (I will add that there is such a file in functions' help documents)
Is there a way to open/access TheoryOfcausfinder.pdf in R's library location from (within) R's console?
(Important: By the way, since I am novice of Sweave and knitr, I do not wanna enter that path! I look for a solution outside Sweave and knitr.)
Any help will be greatly appreciated.
I don't know if you have found another solution or not yet, but I am posting some possible ideas below.
source http://www.r-bloggers.com/show-me-the-pdf-already/
# under Unix types
pdf <- getOption("pdfviewer", default='')
f <- system.file("doc", "TheoryOfcausfinder.pdf", package = "causfinder")
system2(pdf, args = f)
source http://www.r-bloggers.com/show-me-the-pdf-already/
# under MS Windows
f <- system.file("doc", "TheoryOfcausfinder.pdf", package = "causfinder")
shell.exec(normalizePath(f))
source Opening PDF within R studio using file.show studio-using-file-show/33791818
# under OS X
f <- system.file("doc", "TheoryOfcausfinder.pdf", package = "causfinder")
system2('open', args = f, wait = FALSE)

Convert Stata 13 .dta file to CSV without using stata [duplicate]

Is there a way to read a Stata version 13 dataset file in R?
I have tried to do the following:
> library(foreign)
> data = read.dta("TEAdataSTATA.dta")
However, I got an error:
Error in read.dta("TEAdataSTATA.dta") :
not a Stata version 5-12 .dta file
Could someone point out if there is a way to fix this?
There is a new package to import Stata 13 files into a data.frame in R.
Install the package and read a Stata 13 dataset with read.dta13():
install.packages("readstata13")
library(readstata13)
dat <- read.dta13("TEAdataSTATA.dta")
Update: readstata13 imports in version 0.8 also files from Stata 6 to 14
More about the package: https://github.com/sjewo/readstata13
There's a new package called Haven, by Hadley Wickham, which can load Stata 13 dta files (as well as SAS and SPSS files)
library(haven) # haven package now available on cran
df <- read_dta('c:/somefile.dta')
See: https://github.com/hadley/haven
If you have Stata 13, then you can load it there and save it as a Stata 12 format using the command saveold (see help saveold). Afterwards, take it to R.
If you have, Stata 10 - 12, you can use the user-written command use13, (by Sergiy Radyakin) to load it and save it there; then to R. You can install use13 running ssc install use13.
Details can be found at http://radyakin.org/transfer/use13/use13.htm
Other alternatives, still with Stata, involve exporting the Stata format to something else that R will read, e.g. text-based files. See help export within Stata.
Update
Starting Stata 14, saveold has a version() option, allowing one to save in Stata .dta formats as old as Stata 11.
In the meanwhile savespss command became a member of the SSC archive and can be installed to Stata with: findit savespss
The homepage http://www.radyakin.org/transfer/savespss/savespss.htm continues to work, but the program should be installed from the SSC now, not from the beta location.
I am not familiar with the current state of R programs regarding their ability
to read other file formats, but if someone doesn't have Stata installed on their computer and R cannot read a specific version of Stata's dta files, Pandas in Python can now do the vast majority of such conversions.
Basically, the data from the dta file are first loaded using the pandas.read_stata function. As of version 0.23.0, the supported encoding and formats can be found in a related answer of mine.
Then one can either save the data as a csv file and import them
using standard R functions, or instead use the pandas.DataFrame.to_feather function, which exports the data using a serialization format built on Apache Arrow. The latter has extensive support in R as it was conceived to promote interoperability with Pandas.
I had the same problem. Tried read.dta13, read.dta but nothing worked. Then tried the easiest and least expected: MS Excel! It opened marvelously. I saved it as a .csv and used in R!!! Hope this helps!!!!

Read Stata 13 file in R

Is there a way to read a Stata version 13 dataset file in R?
I have tried to do the following:
> library(foreign)
> data = read.dta("TEAdataSTATA.dta")
However, I got an error:
Error in read.dta("TEAdataSTATA.dta") :
not a Stata version 5-12 .dta file
Could someone point out if there is a way to fix this?
There is a new package to import Stata 13 files into a data.frame in R.
Install the package and read a Stata 13 dataset with read.dta13():
install.packages("readstata13")
library(readstata13)
dat <- read.dta13("TEAdataSTATA.dta")
Update: readstata13 imports in version 0.8 also files from Stata 6 to 14
More about the package: https://github.com/sjewo/readstata13
There's a new package called Haven, by Hadley Wickham, which can load Stata 13 dta files (as well as SAS and SPSS files)
library(haven) # haven package now available on cran
df <- read_dta('c:/somefile.dta')
See: https://github.com/hadley/haven
If you have Stata 13, then you can load it there and save it as a Stata 12 format using the command saveold (see help saveold). Afterwards, take it to R.
If you have, Stata 10 - 12, you can use the user-written command use13, (by Sergiy Radyakin) to load it and save it there; then to R. You can install use13 running ssc install use13.
Details can be found at http://radyakin.org/transfer/use13/use13.htm
Other alternatives, still with Stata, involve exporting the Stata format to something else that R will read, e.g. text-based files. See help export within Stata.
Update
Starting Stata 14, saveold has a version() option, allowing one to save in Stata .dta formats as old as Stata 11.
In the meanwhile savespss command became a member of the SSC archive and can be installed to Stata with: findit savespss
The homepage http://www.radyakin.org/transfer/savespss/savespss.htm continues to work, but the program should be installed from the SSC now, not from the beta location.
I am not familiar with the current state of R programs regarding their ability
to read other file formats, but if someone doesn't have Stata installed on their computer and R cannot read a specific version of Stata's dta files, Pandas in Python can now do the vast majority of such conversions.
Basically, the data from the dta file are first loaded using the pandas.read_stata function. As of version 0.23.0, the supported encoding and formats can be found in a related answer of mine.
Then one can either save the data as a csv file and import them
using standard R functions, or instead use the pandas.DataFrame.to_feather function, which exports the data using a serialization format built on Apache Arrow. The latter has extensive support in R as it was conceived to promote interoperability with Pandas.
I had the same problem. Tried read.dta13, read.dta but nothing worked. Then tried the easiest and least expected: MS Excel! It opened marvelously. I saved it as a .csv and used in R!!! Hope this helps!!!!

How to extract variable names from a netCDF file in R?

I am writing a function in R to extract some air quality modelling data from netCDF files. I have the Package "ncdf" installed.
In order to allow other users or myself to choose what variables to extract from a netCDF file, I would like to extract the names of all variables in the file, so that I can present in a simple list rather than just print.ncdf() the file to give too much information. Is there any way of doing it?
I tried unlist() to the var field of the ncdf object but it seemed that it returned the contents as well...
I googled and searched stack*overflow* but didn't seem to find an answer, so your help is very much appreciated.
Many thanks in advance.
If your ncdf object is called nc, then quite simply:
names(nc$var)
With an example, using the dataset downloaded here, for instance (since you didn't provide with one):
nc <- open.ncdf("20130128-ABOM-L4HRfnd-AUS-v01-fv01_0-RAMSSA_09km.nc")
names(nc$var)
[1] "analysed_sst" "analysis_error" "sea_ice_fraction" "mask"
It is now 2016. ncdf package is deprecated.
Same code as SE user plannapus' answer is now:
library(ncdf4)
netcdf.file <- "flux.nc"
nc = ncdf4::nc_open(netcdf.file)
variables = names(nc[['var']])
#print(nc)
A note from the documentation:
Package: ncdf
Title: Interface to Unidata netCDF Data Files
Maintainer: Brian Ripley <ripley#stats.ox.ac.uk>
Version: 1.6.9
Author: David Pierce <dpierce#ucsd.edu>
Description: This is deprecated and will be removed
from CRAN in early 2016: use 'RNetCDF' or 'ncdf4' instead.
Newer package "ncdf4" is designed to work with the netcdf library
version 4, and supports features such as compression and
chunking.Unfortunately, for various reasons the ncdf4 package must have
a different API than the ncdf package.
A note from the home page of the maintainer:
Package ncdf4 -- use this for new code
The "ncdf4" package is designed to work with the netcdf library, version 4.
It includes the ability to use compression and chunking,
which seem to be some of the most anticipated benefits of the version 4
library. Note that the API of ncdf4 has to be different
from the API of ncdf, unfortunately. New code should use ncdf4, not ncdf.
http://cirrus.ucsd.edu/~pierce/ncdf/

Resources