Parallel IO in R with rhdf5 - r

I have a large amount of data in R data frames that I would like to write to an HDF5 file in parallel. In my initial experiments, the hdf5 file gets corrupted, presumably because parallel IO is not enabled in rhdf5. Is it possible to do parallel IO from R with rhdf5?
Here is an example of what I am trying to do:
library(parallel)
library(Rmpi)
library(rhdf5)
nwrites = 10000
db_name = 'testing_parallel_io.h5'
if(file.exists(db_name))unlink(db_name)
h5createFile(db_name)
write_data = function(index, db_name){
suppressPackageStartupMessages(require(rhdf5))
nr = 1000
nc = 10
df = as.data.frame(matrix(rnorm(nr*nc),nr,nc))
group_name = sprintf('group%05d',index)
dataset_name = sprintf('%s/A',group_name)
h5createGroup(db_name, group_name)
h5write(df, db_name, dataset_name)
return(0)
}
cl = makeCluster(detectCores(),type='MPI')
res = parSapply(cl, 1:nwrites, write_data, db_name)
stopCluster(cl)
mpi.quit()
When I run this I get all sorts of errors from hdf5 like this:
HDF5-DIAG: Error detected in HDF5 (1.8.7) thread 0:
#000: H5D.c line 170 in H5Dcreate2(): unable to create dataset
major: Dataset
minor: Unable to initialize object
#001: H5Dint.c line 431 in H5D_create_named(): unable to create and link to dataset
major: Dataset
minor: Unable to initialize object
#002: H5L.c line 1640 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#003: H5L.c line 1884 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#004: H5Gtraverse.c line 905 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#005: H5Gtraverse.c line 799 in H5G_traverse_real(): component not found
major: Symbol table
minor: Object not found

Related

Cannot write int16 data type using the R's rhdf5 package

In R I would like to write a matrix of integers into an HDF5 file ".h5" as an int16 data type. To do so I am using the rhdf5 package. The documentation says that you should set one of the supported H5 data types when creating the dataset. However, even when setting up the int16 data type the result is always int32. Is it possible to store the data as int16 or uint16?
library(rhdf5)
m <- matrix(1,5,5)
outFile <- "test.h5"
h5createFile(outFile)
h5createDataset(file=outFile,"m",dims=dim(m),H5type = "H5T_NATIVE_INT16")
h5write(m,file=outFile,name="m")
H5close()
h5ls(outFile)
The result is:
Using another library as I did not find rhdf5
library(hdf5r)
m <- matrix(1L,5L,5L)
outFile <- h5file("test.h5")
createDataSet(outFile, "m", m, dtype=h5types$H5T_NATIVE_INT16)
print(outFile)
print(outFile[["m"]])
h5close(outFile)
For the first print (the file)
Class: H5File
Filename: D:\Travail\Projets\SWM\swm.gps\test.h5
Access type: H5F_ACC_RDWR
Listing:
name obj_type dataset.dims dataset.type_class
m H5I_DATASET 5 x 5 H5T_INTEGER
Here we see it displays H5T_INTEGER as the datatype for the dataset m
and the second (the dataset)
Class: H5D
Dataset: /m
Filename: D:\Travail\Projets\SWM\swm.gps\test.h5
Access type: H5F_ACC_RDWR
Datatype: H5T_STD_I16LE
Space: Type=Simple Dims=5 x 5 Maxdims=Inf x Inf
Chunk: 64 x 64
We can see that it has the right datatype H5T_STD_I16LE
The code your provided works as expected, but it's a limitation of the h5ls() function in rhdf5 that it doens't report a more details data type. As #r2evans points out, it's technically true that it's an integer, you just want to know a bit more detail that that.
If we run you code and use the h5ls() tool distributed by the HDF5 group we get more information:
library(rhdf5)
m <- matrix(1,5,5)
outFile <- tempfile(fileext = ".h5")
h5createFile(outFile)
h5createDataset(file=outFile,"m", dims=dim(m),H5type = "H5T_NATIVE_INT16")
h5write(m,file=outFile, name="m")
system2("h5ls", args = list("-v", outFile))
## Opened "/tmp/RtmpFclmR3/file299e79c4c206.h5" with sec2 driver.
## m Dataset {5/5, 5/5}
## Attribute: rhdf5-NA.OK {1}
## Type: native int
## Location: 1:800
## Links: 1
## Chunks: {5, 5} 50 bytes
## Storage: 50 logical bytes, 14 allocated bytes, 357.14% utilization
## Filter-0: shuffle-2 OPT {2}
## Filter-1: deflate-1 OPT {6}
## Type: native short
Here the most important part is the final line which confirms the datatype is "native short" a.k.a native int16.

NetCDF - What is is 'phony_dim_0', 'phony_dim_1', 'phony_dim_2'?

I am very new to using NetCDF files and I am at the exploratory stage trying to understand what this file could do. I am using 'netCDF4' python library. I am trying to find what does 'phony_dim_0', 'phony_dim_1', 'phony_dim_2' mean and contains? I am thinking it could be 'lat','lon', and/or 'time'?
Loading nc file
ds = nc.Dataset('my_file.nc')
type(ds)
>>> netCDF4._netCDF4.Dataset
Printing Keys
print(ds.dimensions.keys())
>>> dict_keys(['phony_dim_0', 'phony_dim_1', 'phony_dim_2'])
Extract what is in this key?
ds.dimensions['phony_dim_0']
>>> <class 'netCDF4._netCDF4.Dimension'>: name = 'phony_dim_0', size = 1179
Error:
for c in ds.dimensions['phony_dim_0']:
print(c) # Want to see what is in this? # Errors: TypeError: 'netCDF4._netCDF4.Dimension' object is not iterable

Unable to create a webshot in R [TypeError: undefined is not an object]

I would like to create a webshot of a flexdashboard html file. I managed to create the dashboard and generate its html file. However, I am not able to create a png webshot, as it returns an error.
This is the code I am using:
rmarkdown::render(paste0(script.basedir, "dashboard.RMD"),
output_file = name_html,
output_dir = '.'
)
webshot(name_html,
file = name_png,
delay = 15,
vwidth = 1920,
vheight = 1080)
This is the error I get:
PHANTOM ERROR: TypeError: undefined is not an object (evaluating 'url.toLowerCase')
TRACE:
-> phantomjs://platform/utils.js: 112 (in function cleanUrl)
-> phantomjs://platform/casper.js: 1460 (in function open)
-> phantomjs://platform/casper.js: 1949 (in function _step)
-> phantomjs://platform/casper.js: 1615 (in function runStep)
-> phantomjs://platform/casper.js: 406 (in function checkStep)
Error in webshot(name_html, file = name_png, delay = 15, vwidth = 1920, :
webshot.js returned failure value: 1
Could anybody help me with this error, please?
Thanks for any suggestions!

Error in running a Python code from R with the package rPithon

I would like to run this Python code from R:
>>> import nlmpy
>>> nlm = nlmpy.mpd(nRow=50, nCol=50, h=0.75)
>>> nlmpy.exportASCIIGrid("raster.asc", nlm)
Nlmpy is a Python package to build neutral landscape models. The example comes from the website
To run this Python code from R, I 'm trying to use the package rPithon. However, I obtain this error message:
if (pithon.available())
{
nRow <- 50
nCol <- 50
h <- 0.75
# this file contains the definition of function concat
pithon.load("C:/Users/Anaconda2/Lib/site-packages/nlmpy/nlmpy.py")
pithon.call( "mpd", nRow, nCol, h)
} else {
print("Unable to execute python")
}
Error in pithon.get("_r_call_return", instance.name = instname) :
Couldn't retrieve variable: Traceback (most recent call last):
File "C:/Users/Documents/R/win-library/3.3/rPithon/pythonwrapperscript.py", line 110, in <module>
reallyReallyLongAndUnnecessaryPrefix.data = json.dumps([eval(reallyReallyLongAndUnnecessaryPrefix.argData)])
File "C:\Users\ANACON~1\lib\json\__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "C:\Users\ANACON~1\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: array([[ 0.36534654, 0.31962481, 0.44229946, ..., 0.11513079,
0.07156331, 0.00286971], [ 0.41534291, 0.41333479, 0.48118995, ..., 0.19203674,
0.04192771, 0.03679473], [ 0.5188
Is this error caused by a syntax issue in my code ? I work with the Anaconda 4.2.0 platform for Windows which uses the Python 2.7 version.
I haven't used the nlmpy package hence, I am not sure what would be your expected output. However, this code successfully communicates between R and Python.
There are two files,
nlmpyInR.R
command ="python"
path2script="path_to_your_pythoncode/nlmpyInPython.py"
nRow <-50
nCol <-50
h <- 0.75
# Build up args in a vector
args = c(nRow, nCol, h)
# Add path to script as first arg
allArgs = c(path2script, args)
Routput = system2(command, args=allArgs, stdout=TRUE)
#The command would be python nlmpyInPython.py 50 50 0.75
print(paste("The Output is:\n", Routput))
nlmpyInPython.py
import sys
import nlmpy
#Getting the arguments from the command line call
nRow = sys.argv[1]
nCol = sys.argv[2]
h = sys.argv[3]
nlm = nlmpy.mpd(nRow, nCol, h)
pyhtonOutput = nlmpy.exportASCIIGrid("raster.asc", nlm)
#Whatever you print will get stored in the R's output variable.
print pyhtonOutput
The cause of the error that you're getting is hinted at by the
"is not JSON serializable" line. Your R code calls the mpd
function with certain arguments, and that function itself will
execute correctly. The rPithon library will then try to send the
return value of the function back to R, and to do this it will try
to create a JSON object
that describes the return value.
This works well for integers, floating point values, arrays, etc,
but not every kind of Python object can be converted to such a
JSON representation. And because rPithon can't convert the return value
of mpd this way, an error is generated.
You can still use rPithon to call the mpd function though. The following
code creates a new Python function that performs two steps: first
it calls the mpd function with the specified parameters, and then it
exports the result to a file, of which the filename is also an argument.
Using rPithon, the new function is then called from R. Because myFunction doesn't return anything, representing the return value in JSON format will not be a problem.
library("rPithon")
pythonCode = paste("import nlmpy.nlmpy as nlmpy",
"",
"def myFunction(nRow, nCol, h, fileName):",
" nlm = nlmpy.mpd(nRow, nCol, h)",
" nlmpy.exportASCIIGrid(fileName, nlm)",
sep = "\n")
pithon.exec(pythonCode)
nRow <- 50
nCol <- 50
h <- 0.75
pithon.call("myFunction", nRow, nCol, h, "outputraster.asc")
Here, the Python code defined as an R string, and executed using
pithon.exec. You could also put that Python code in a separate file
and use pithon.load to process the code so that the myFunction
function is known.

Tensorflow : how to insert custom input to existing graph?

I have downloaded a tensorflow GraphDef that implements a VGG16 ConvNet, which I use doing this :
Pl['images'] = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
with open("tensorflow-vgg16/vgg16.tfmodel", mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
tf.import_graph_def(graph_def, input_map={"images": Pl['images']})
Besides, I have image features that are homogeneous to the output of the "import/pool5/".
How can I tell my graph that don't want to use his input "images", but the tensor "import/pool5/" as input ?
Thank's !
EDIT
OK I realize I haven't been very clear. Here is the situation:
I am trying to use this implementation of ROI pooling, using a pre-trained VGG16, which I have in the GraphDef format. So here is what I do:
First of all, I load the model:
tf.reset_default_graph()
with open("tensorflow-vgg16/vgg16.tfmodel",
mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
graph = tf.get_default_graph()
Then, I create my placeholders
images = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
boxes = tf.placeholder(tf.float32,
[None,5], # 5 = [batch_id,x1,y1,x2,y2]
name = "boxes")
And I define the output of the first part of the graph to be conv5_3/Relu
tf.import_graph_def(graph_def,
input_map={'images':images})
out_tensor = graph.get_tensor_by_name("import/conv5_3/Relu:0")
So, out_tensor is of shape [None,14,14,512]
Then, I do the ROI pooling:
[out_pool,argmax] = module.roi_pool(out_tensor,
boxes,
7,7,1.0/1)
With out_pool.shape = N_Boxes_in_batch x 7 x 7 x 512, which is homogeneous to pool5. I would then like to feed out_pool as an input to the op that comes just after pool5, so it would look like
tf.import_graph_def(graph.as_graph_def(),
input_map={'import/pool5':out_pool})
But it doesn't work, I have this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-89-527398d7344b> in <module>()
5
6 tf.import_graph_def(graph.as_graph_def(),
----> 7 input_map={'import/pool5':out_pool})
8
9 final_out = graph.get_tensor_by_name("import/Relu_1:0")
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/importer.py in import_graph_def(graph_def, input_map, return_elements, name, op_dict)
333 # NOTE(mrry): If the graph contains a cycle, the full shape information
334 # may not be available for this op's inputs.
--> 335 ops.set_shapes_for_outputs(op)
336
337 # Apply device functions for this op.
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
1610 raise RuntimeError("No shape function registered for standard op: %s"
1611 % op.type)
-> 1612 shapes = shape_func(op)
1613 if len(op.outputs) != len(shapes):
1614 raise RuntimeError(
/home/hbenyounes/vqa/roi_pooling_op_grad.py in _roi_pool_shape(op)
13 channels = dims_data[3]
14 print(op.inputs[1].name, op.inputs[1].get_shape())
---> 15 dims_rois = op.inputs[1].get_shape().as_list()
16 num_rois = dims_rois[0]
17
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
745 A list of integers or None for each dimension.
746 """
--> 747 return [dim.value for dim in self._dims]
748
749 def as_proto(self):
TypeError: 'NoneType' object is not iterable
Any clue ?
It is usually very convenient to use tf.train.export_meta_graph to store the whole MetaGraph. Then, upon restoring you can use tf.train.import_meta_graph, because it turns out that it passes all additional arguments to the underlying import_scoped_meta_graph which has the input_map argument and utilizes it when it gets to it's own invocation of import_graph_def.
It is not documented, and took me waaaay toooo much time to find it, but it works!
What I would do is something along those lines:
-First retrieve the names of the tensors representing the weights and biases of the 3 fully connected layers coming after pool5 in VGG16.
To do that I would inspect [n.name for n in graph.as_graph_def().node].
(They probably look something like import/locali/weight:0, import/locali/bias:0, etc.)
-Put them in a python list:
weights_names=["import/local1/weight:0" ,"import/local2/weight:0" ,"import/local3/weight:0"]
biases_names=["import/local1/bias:0" ,"import/local2/bias:0" ,"import/local3/bias:0"]
-Define a function that look something like:
def pool5_tofcX(input_tensor, layer_number=3):
flatten=tf.reshape(input_tensor,(-1,7*7*512))
tmp=flatten
for i in xrange(layer_number):
tmp=tf.matmul(tmp, graph.get_tensor_by_name(weights_name[i]))
tmp=tf.nn.bias_add(tmp, graph.get_tensor_by_name(biases_name[i]))
tmp=tf.nn.relu(tmp)
return tmp
Then define the tensor using the function:
wanted_output=pool5_tofcX(out_pool)
Then you are done !
Jonan Georgiev provided an excellent answer here. The same approach was also described with little fanfare at the end of this git issue: https://github.com/tensorflow/tensorflow/issues/3389
Below is a copy/paste runnable example of using this approach to switch out a placeholder for a tf.data.Dataset get_next tensor.
import tensorflow as tf
my_placeholder = tf.placeholder(dtype=tf.float32, shape=1, name='my_placeholder')
my_op = tf.square(my_placeholder, name='my_op')
# Save the graph to memory
graph_def = tf.get_default_graph().as_graph_def()
print('----- my_op before any remapping -----')
print([n for n in graph_def.node if n.name == 'my_op'])
tf.reset_default_graph()
ds = tf.data.Dataset.from_tensors(1.0)
next_tensor = tf.data.make_one_shot_iterator(ds).get_next(name='my_next_tensor')
# Restore the graph with a custom input mapping
tf.graph_util.import_graph_def(graph_def, input_map={'my_placeholder': next_tensor}, name='')
print('----- my_op after remapping -----')
print([n for n in tf.get_default_graph().as_graph_def().node if n.name == 'my_op'])
Output, where we can clearly see that the input to the square operation has changed.
----- my_op before any remapping -----
[name: "my_op"
op: "Square"
input: "my_placeholder"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
----- my_op after remapping -----
[name: "my_op"
op: "Square"
input: "my_next_tensor"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]

Resources