Extracting text between double quotation marks in R [duplicate] - r

I am fairly new to R, but the more use it, the more I see how powerful it really is over SAS or SPSS. Just one of the major benefits, as I see them, is the ability to get and analyze data from the web. I imagine this is possible (and maybe even straightforward), but I am looking to parse JSON data that is publicly available on the web. I am not a programmer by any stretch, so any help and instruction you can provide will be greatly appreciated. Even if you point me to a basic working example, I probably can work through it.

RJSONIO from Omegahat is another package which provides facilities for reading and writing data in JSON format.
rjson does not use S4/S3 methods and so is not readily extensible, but still useful. Unfortunately, it does not used vectorized operations and so is too slow for non-trivial data. Similarly, for reading JSON data into R, it is somewhat slow and so does not scale to large data, should this be an issue.
Update (new Package 2013-12-03):
jsonlite: This package is a fork of the RJSONIO package. It builds on the parser from RJSONIO but implements a different mapping between R objects and JSON strings. The C code in this package is mostly from the RJSONIO Package, the R code has been rewritten from scratch. In addition to drop-in replacements for fromJSON and toJSON, the package has functions to serialize objects. Furthermore, the package contains a lot of unit tests to make sure that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.

The jsonlite package is easy to use and tries to convert json into data frames.
Example:
library(jsonlite)
# url with some information about project in Andalussia
url <- 'https://api.stackexchange.com/2.2/badges?order=desc&sort=rank&site=stackoverflow'
# read url and convert to data.frame
document <- fromJSON(txt=url)

Here is the missing example
library(rjson)
url <- 'http://someurl/data.json'
document <- fromJSON(file=url, method='C')

The function fromJSON() in RJSONIO, rjson and jsonlite don't return a simple 2D data.frame for complex nested json objects.
To overcome this you can use tidyjson. It takes in a json and always returns a data.frame. It is currently not availble in CRAN, you can get it here: https://github.com/sailthru/tidyjson
Update: tidyjson is now available in cran, you can install it directly using install.packages("tidyjson")

For the record, rjson and RJSONIO do change the file type, but they don't really parse per se. For instance, I receive ugly MongoDB data in JSON format, convert it with rjson or RJSONIO, then use unlist and tons of manual correction to actually parse it into a usable matrix.

Try below code using RJSONIO in console
library(RJSONIO)
library(RCurl)
json_file = getURL("https://raw.githubusercontent.com/isrini/SI_IS607/master/books.json")
json_file2 = RJSONIO::fromJSON(json_file)
head(json_file2)

Related

.json file is too large to be opened in R with rjson

I have a 5.1 GB json file that I would like to read in R using rjson. I want afterwards to construct a dataframe from it, however it won't load because the size is too large.
Is there any way to work around it?
Thank you for your help =)
Nina, I would recommend you using jsonlite package instead of rjson.
library(jsonlite)
your_json <- "your_path.json"
unpacked_json <- jsonlite::stream_in(textConnection(readLines(your_json, n=100000)),verbose=F)
Here you limit the page size to let IDE correctly read your JSON file. For more information I would also recommend you to make some research on this topic:
https://community.rstudio.com/t/how-to-read-large-json-file-in-r/13486
Reading a huge json file in R , issues
I know for sure that it is sometimes really hard to cope with documentation (and as all other human beings we are lazy); and I don't like to read doc-n myself, but I highly recommend you to make yourself familiar with jsonlite documentation and vignettes. Here's the CRAN link: https://cran.r-project.org/web/packages/jsonlite/index.html

Data acquisition using QuantTools with R

I am using the QuantTools package in R language
When get_finam_data () is used, how can I obtain a list of symbols that can be acquired?
You should go to package internals to get the list.
Just download data for arbitrary symbol so the list is fetched from Finam server and saved for later use.
Keep in mind this is not documented so it can be changed in future versions.
get_finam_data( 'GAZP', Sys.Date() )
QuantTools:::finam_downloader_env$instruments_info
I suppose there no way to get it from QuantTools package. You can get it from https://www.finam.ru/profile/moex-akcii/mosenrg/export/?market=1 by hand or use external web sources.

How to write an effective loop to access datasets inside 1000's of h5 files in R [duplicate]

I have a file in hdf5 format. I know that it is supposed to be a matrix, but I want to read that matrix in R so that I can study it. I see that there is a h5r package that is supposed to help with this, but I do not see any simple to read/understand tutorial. Is such a tutorial available online. Specifically, How do you read a hdf5 object with this package, and how to actually extract the matrix?
UPDATE
I found out a package rhdf5 which is not part of CRAN but is part of BioConductoR. The interface is relatively easier to understand the the documentation and example code is quite clear. I could use it without problems. My problem it seems was the input file. The matrix that I wanted to read was actually stored in the hdf5 file as a python pickle. So every time I tried to open it and access it through R i got a segmentation fault. I did figure out how to save the matrix from within python as a tsv file and now that problem is solved.
The rhdf5 package works really well, although it is not in CRAN. Install it from Bioconductor
# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.11")
And to use it:
library(rhdf5)
List the objects within the file to find the data group you want to read:
h5ls("path/to/file.h5")
Read the HDF5 data:
mydata <- h5read("path/to/file.h5", "/mygroup/mydata")
And inspect the structure:
str(mydata)
(Note that multidimensional arrays may appear transposed). Also you can read groups, which will be named lists in R.
You could also use h5, a package which I recently published on CRAN.
Compared to rhdf5 it has the following features:
S4 object model to directly interact with HDF5 objects like files, groups, datasets and attributes.
Simpler syntax, implemented R-like subsetting operators for datasets supporting commands like
readdata <- dataset[1:3, 1:3]
dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
Supported NA values for all data types
200+ Test cases with a code coverage of 80%+.
To save a matrix you could use:
library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)
... and read the entire matrix back into R:
file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)
See also h5 on
CRAN: http://cran.r-project.org/web/packages/h5/index.html
Github: https://github.com/mannau/h5
I used the rgdal package to read HDF5 files. You do need to take care that probably the binary version of rgdal does not support hdf5. In that case, you need to build gdal from source with HDF5 support before building rgdal from source.
Alternatively, try and convert the files from hdf5 to netcdf. Once they are in netcdf, you can use the excellent ncdf package to access the data. The conversion I think could be done with the cdo tool.
The ncdf4 package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer).
In the developer's words:
the HDF group says:
NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths
Unidata says:
The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer.
In practice, ncdf4 provides a simple interface, and migrating code from using older hdf5 and ncdf packages to a single ncdf4 package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer).

How to deal with hdf5 files in R?

I have a file in hdf5 format. I know that it is supposed to be a matrix, but I want to read that matrix in R so that I can study it. I see that there is a h5r package that is supposed to help with this, but I do not see any simple to read/understand tutorial. Is such a tutorial available online. Specifically, How do you read a hdf5 object with this package, and how to actually extract the matrix?
UPDATE
I found out a package rhdf5 which is not part of CRAN but is part of BioConductoR. The interface is relatively easier to understand the the documentation and example code is quite clear. I could use it without problems. My problem it seems was the input file. The matrix that I wanted to read was actually stored in the hdf5 file as a python pickle. So every time I tried to open it and access it through R i got a segmentation fault. I did figure out how to save the matrix from within python as a tsv file and now that problem is solved.
The rhdf5 package works really well, although it is not in CRAN. Install it from Bioconductor
# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.11")
And to use it:
library(rhdf5)
List the objects within the file to find the data group you want to read:
h5ls("path/to/file.h5")
Read the HDF5 data:
mydata <- h5read("path/to/file.h5", "/mygroup/mydata")
And inspect the structure:
str(mydata)
(Note that multidimensional arrays may appear transposed). Also you can read groups, which will be named lists in R.
You could also use h5, a package which I recently published on CRAN.
Compared to rhdf5 it has the following features:
S4 object model to directly interact with HDF5 objects like files, groups, datasets and attributes.
Simpler syntax, implemented R-like subsetting operators for datasets supporting commands like
readdata <- dataset[1:3, 1:3]
dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
Supported NA values for all data types
200+ Test cases with a code coverage of 80%+.
To save a matrix you could use:
library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)
... and read the entire matrix back into R:
file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)
See also h5 on
CRAN: http://cran.r-project.org/web/packages/h5/index.html
Github: https://github.com/mannau/h5
I used the rgdal package to read HDF5 files. You do need to take care that probably the binary version of rgdal does not support hdf5. In that case, you need to build gdal from source with HDF5 support before building rgdal from source.
Alternatively, try and convert the files from hdf5 to netcdf. Once they are in netcdf, you can use the excellent ncdf package to access the data. The conversion I think could be done with the cdo tool.
The ncdf4 package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer).
In the developer's words:
the HDF group says:
NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths
Unidata says:
The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer.
In practice, ncdf4 provides a simple interface, and migrating code from using older hdf5 and ncdf packages to a single ncdf4 package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer).

How to extract variable names from a netCDF file in R?

I am writing a function in R to extract some air quality modelling data from netCDF files. I have the Package "ncdf" installed.
In order to allow other users or myself to choose what variables to extract from a netCDF file, I would like to extract the names of all variables in the file, so that I can present in a simple list rather than just print.ncdf() the file to give too much information. Is there any way of doing it?
I tried unlist() to the var field of the ncdf object but it seemed that it returned the contents as well...
I googled and searched stack*overflow* but didn't seem to find an answer, so your help is very much appreciated.
Many thanks in advance.
If your ncdf object is called nc, then quite simply:
names(nc$var)
With an example, using the dataset downloaded here, for instance (since you didn't provide with one):
nc <- open.ncdf("20130128-ABOM-L4HRfnd-AUS-v01-fv01_0-RAMSSA_09km.nc")
names(nc$var)
[1] "analysed_sst" "analysis_error" "sea_ice_fraction" "mask"
It is now 2016. ncdf package is deprecated.
Same code as SE user plannapus' answer is now:
library(ncdf4)
netcdf.file <- "flux.nc"
nc = ncdf4::nc_open(netcdf.file)
variables = names(nc[['var']])
#print(nc)
A note from the documentation:
Package: ncdf
Title: Interface to Unidata netCDF Data Files
Maintainer: Brian Ripley <ripley#stats.ox.ac.uk>
Version: 1.6.9
Author: David Pierce <dpierce#ucsd.edu>
Description: This is deprecated and will be removed
from CRAN in early 2016: use 'RNetCDF' or 'ncdf4' instead.
Newer package "ncdf4" is designed to work with the netcdf library
version 4, and supports features such as compression and
chunking.Unfortunately, for various reasons the ncdf4 package must have
a different API than the ncdf package.
A note from the home page of the maintainer:
Package ncdf4 -- use this for new code
The "ncdf4" package is designed to work with the netcdf library, version 4.
It includes the ability to use compression and chunking,
which seem to be some of the most anticipated benefits of the version 4
library. Note that the API of ncdf4 has to be different
from the API of ncdf, unfortunately. New code should use ncdf4, not ncdf.
http://cirrus.ucsd.edu/~pierce/ncdf/

Resources