Cannot write int16 data type using the R's rhdf5 package - r

In R I would like to write a matrix of integers into an HDF5 file ".h5" as an int16 data type. To do so I am using the rhdf5 package. The documentation says that you should set one of the supported H5 data types when creating the dataset. However, even when setting up the int16 data type the result is always int32. Is it possible to store the data as int16 or uint16?
library(rhdf5)
m <- matrix(1,5,5)
outFile <- "test.h5"
h5createFile(outFile)
h5createDataset(file=outFile,"m",dims=dim(m),H5type = "H5T_NATIVE_INT16")
h5write(m,file=outFile,name="m")
H5close()
h5ls(outFile)
The result is:

Using another library as I did not find rhdf5
library(hdf5r)
m <- matrix(1L,5L,5L)
outFile <- h5file("test.h5")
createDataSet(outFile, "m", m, dtype=h5types$H5T_NATIVE_INT16)
print(outFile)
print(outFile[["m"]])
h5close(outFile)
For the first print (the file)
Class: H5File
Filename: D:\Travail\Projets\SWM\swm.gps\test.h5
Access type: H5F_ACC_RDWR
Listing:
name obj_type dataset.dims dataset.type_class
m H5I_DATASET 5 x 5 H5T_INTEGER
Here we see it displays H5T_INTEGER as the datatype for the dataset m
and the second (the dataset)
Class: H5D
Dataset: /m
Filename: D:\Travail\Projets\SWM\swm.gps\test.h5
Access type: H5F_ACC_RDWR
Datatype: H5T_STD_I16LE
Space: Type=Simple Dims=5 x 5 Maxdims=Inf x Inf
Chunk: 64 x 64
We can see that it has the right datatype H5T_STD_I16LE

The code your provided works as expected, but it's a limitation of the h5ls() function in rhdf5 that it doens't report a more details data type. As #r2evans points out, it's technically true that it's an integer, you just want to know a bit more detail that that.
If we run you code and use the h5ls() tool distributed by the HDF5 group we get more information:
library(rhdf5)
m <- matrix(1,5,5)
outFile <- tempfile(fileext = ".h5")
h5createFile(outFile)
h5createDataset(file=outFile,"m", dims=dim(m),H5type = "H5T_NATIVE_INT16")
h5write(m,file=outFile, name="m")
system2("h5ls", args = list("-v", outFile))
## Opened "/tmp/RtmpFclmR3/file299e79c4c206.h5" with sec2 driver.
## m Dataset {5/5, 5/5}
## Attribute: rhdf5-NA.OK {1}
## Type: native int
## Location: 1:800
## Links: 1
## Chunks: {5, 5} 50 bytes
## Storage: 50 logical bytes, 14 allocated bytes, 357.14% utilization
## Filter-0: shuffle-2 OPT {2}
## Filter-1: deflate-1 OPT {6}
## Type: native short
Here the most important part is the final line which confirms the datatype is "native short" a.k.a native int16.

Related

Copy a huge file with Julia Mmap

I have a big file (75GB) memory mapped in an array d that I want to copy in another m. Because I do not have 75GB of RAM available, I did:
for (i,v) in enumerate(d)
m[i] = v
end
In order to copy the file value after value. But I get a copy rate of ~2MB/s on a SSD where I expect at least 50MB/s both in read and write.
How could I optimize this copy rate?
=== [edit] ===
According to the comments, I changed my code to the following, which sped up the write rate to 15MB/s
function copydcimg(m::Array{UInt16,4}, d::Dcimg)
m .= d
Mmap.sync!(m)
end
copydcimg(m,d)
At this point, I think I should optimize the Dcimg code. This binary file is made of frames spaced by a timestamp. Here is the code I use to access the frames:
module dcimg
using Mmap
using TOML
struct Dcimg <: AbstractArray{UInt16,4} # struct allowing to access dcimg file
filename::String # filename of the dcimg
header::Int # header size in bytes
clock::Int # clock size in bytes
x::Int
y::Int
z::Int
t::Int
m # linear memory map
Dcimg(filename, header, clock, x, y, z, t) =
new(filename, header, clock, x, y, z, t,
Mmap.mmap(open(filename), Array{UInt16, 3},
(x*y+clock÷sizeof(UInt16), z, t), header)
)
end
# following functions allows to access DCIMG like an Array
Base.size(D::Dcimg) = (D.x, D.y, D.z, D.t)
# skip clock
Base.getindex(D::Dcimg, i::Int) =
D.m[i + (i ÷ (D.x*D.y))*D.clock÷sizeof(UInt16)]
Base.getindex(D::Dcimg, x::Int, y::Int, z::Int, t::Int) =
D[x + D.x*((y-1) + D.y*((z-1) + D.z*(t-1)))]
# allowing to automatically parse size
function Dcimg(pathtag)
p = TOML.parsefile(pathtag * ".toml")
return Dcimg(pathtag * ".dcimg",
# ...
)
end
export Dcimg, getframe
end
I got it! The solution was to copy the file chunk by chunk lets say by frame (around 1024×720 UInt16). This way I reached 300MB/s, which I didn't even know was possible in single thread. Here is the code.
In module dcimg, I added the methods to access the file frame by frame
# get frame number n (starting form 1)
getframe(D::Dcimg,n::Int) =
reshape(D.m[
D.x*D.y*(n-1)+1 + (n-1)*D.clock÷sizeof(UInt16) : # cosmetic line break
D.x*D.y*n + (n-1)*D.clock÷sizeof(UInt16)
], D.x, D.y)
# get frame for layer z, time t (starting from 1)
getframe(D::Dcimg,z::Int,t::Int) =
getframe(D::Dcimg,(z-1)+D.z*(t-1))
Iterating over the frames within a loop
function copyframes(m::Array{UInt16,4}, d::Dcimg)
N = d.z*d.t
F = d.x*d.y
for i in 1:N
m[(i-1)*F+1:i*F] = getframe(d, i)
end
end
copyframes(m,d)
Thanks all in comments for leading me to this.
===== edit =====
for further reading, you might look at:
dd: How to calculate optimal blocksize?
http://blog.tdg5.com/tuning-dd-block-size/
which give hints about the optimal block size to copy at a time.

In Lua, how to insert numbers as 32 bits to front of a binary sequence?

I'm new to Lua when I began to use OpenResty, I want to output a image and it's x,y coordinate together as one binary sequence to the clients, looks like: x_int32_bits y_int32_bits image_raw_data. At the client, I know the first 32 bits is x, the second 32 is y, and the others are image raw data. I met some questions:
How to convert number to 32 binary bits in Lua?
How to merge two 32 bits to one 64 bits sequence?
How to insert 64 bits to front of image raw data? And how to be fastest?
file:read("*a") got string type result, is the result ASCII sequence or like "000001110000001..." string?
What I'm thinking is like below, I don't know how to convert 32bits to string format same as file:read("*a") result.
#EgorSkriptunoff thank you, you opened a window for me. I wrote some new code, would you take a look, and I have another question, is the string merge method .. inefficient and expensive? Specially when one of the string is very large. Is there an alternative way to merge the bytes string?
NEW CODE UNDER #EgorSkriptunoff 's GUIDANCE
function _M.number_to_int32_bytes(num)
return ffi.string(ffi.new("int32_t[1]", num), 4)
end
local x, y = unpack(v)
local file, err = io.open(image_path, "rb")
if nil ~= file then
local image_raw_data = file:read("*a")
if nil == image_raw_data then
ngx.log(ngx.ERR, "read file error:", err)
else
-- Is the .. method inefficient and expensive? Because the image raw data maybe large,
-- so will .. copy all the data to a new string? Is there an alternative way to merge the bytes string?
output = utils.number_to_int32_bytes(x) .. utils.number_to_int32_bytes(y) .. image_raw_data
ngx.print(output)
ngx.flush(true)
end
file:close()
end
OLD CODE:
function table_merge(t1, t2)
for k,v in ipairs(t2) do
table.insert(t1, v)
end
return t1
end
function numToBits(num, bits)
-- returns a table of bits
local t={} -- will contain the bits
for b=bits,1,-1 do
rest=math.fmod(num,2)
t[b]=rest
num=(num-rest)/2
end
if num==0 then return t else return {'Not enough bits to represent this number'} end
end
-- Need to insert x,y as 32bits respectively to front of image binary sequence
function output()
local x = 1, y = 3
local file, err = io.open("/storage/images/1.png", "rb")
if nil ~= file then
local d = file:read("*a") ------- type(d) is string, why?
if nil == d then
ngx.log(ngx.ERR, "read file error:", err)
else
-- WHAT WAY I'M THINKING -----------------
-- Convert x, y to binary table, then merge them to one binary table
data = table_merge(numToBits(x, 32), numToBits(y, 32))
-- Convert data from binary table to string
data = convert_binary_table_to_string(data) -- HOW TO DO THAT? --------
-- Insert x,y data to front of image data, is data .. d ineffective?
data = data .. d
-------------------------------------------
ngx.print(data)
ngx.flush(true)
end
file:close()
end
end

Tensorflow : how to insert custom input to existing graph?

I have downloaded a tensorflow GraphDef that implements a VGG16 ConvNet, which I use doing this :
Pl['images'] = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
with open("tensorflow-vgg16/vgg16.tfmodel", mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
tf.import_graph_def(graph_def, input_map={"images": Pl['images']})
Besides, I have image features that are homogeneous to the output of the "import/pool5/".
How can I tell my graph that don't want to use his input "images", but the tensor "import/pool5/" as input ?
Thank's !
EDIT
OK I realize I haven't been very clear. Here is the situation:
I am trying to use this implementation of ROI pooling, using a pre-trained VGG16, which I have in the GraphDef format. So here is what I do:
First of all, I load the model:
tf.reset_default_graph()
with open("tensorflow-vgg16/vgg16.tfmodel",
mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
graph = tf.get_default_graph()
Then, I create my placeholders
images = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
boxes = tf.placeholder(tf.float32,
[None,5], # 5 = [batch_id,x1,y1,x2,y2]
name = "boxes")
And I define the output of the first part of the graph to be conv5_3/Relu
tf.import_graph_def(graph_def,
input_map={'images':images})
out_tensor = graph.get_tensor_by_name("import/conv5_3/Relu:0")
So, out_tensor is of shape [None,14,14,512]
Then, I do the ROI pooling:
[out_pool,argmax] = module.roi_pool(out_tensor,
boxes,
7,7,1.0/1)
With out_pool.shape = N_Boxes_in_batch x 7 x 7 x 512, which is homogeneous to pool5. I would then like to feed out_pool as an input to the op that comes just after pool5, so it would look like
tf.import_graph_def(graph.as_graph_def(),
input_map={'import/pool5':out_pool})
But it doesn't work, I have this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-89-527398d7344b> in <module>()
5
6 tf.import_graph_def(graph.as_graph_def(),
----> 7 input_map={'import/pool5':out_pool})
8
9 final_out = graph.get_tensor_by_name("import/Relu_1:0")
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/importer.py in import_graph_def(graph_def, input_map, return_elements, name, op_dict)
333 # NOTE(mrry): If the graph contains a cycle, the full shape information
334 # may not be available for this op's inputs.
--> 335 ops.set_shapes_for_outputs(op)
336
337 # Apply device functions for this op.
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
1610 raise RuntimeError("No shape function registered for standard op: %s"
1611 % op.type)
-> 1612 shapes = shape_func(op)
1613 if len(op.outputs) != len(shapes):
1614 raise RuntimeError(
/home/hbenyounes/vqa/roi_pooling_op_grad.py in _roi_pool_shape(op)
13 channels = dims_data[3]
14 print(op.inputs[1].name, op.inputs[1].get_shape())
---> 15 dims_rois = op.inputs[1].get_shape().as_list()
16 num_rois = dims_rois[0]
17
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
745 A list of integers or None for each dimension.
746 """
--> 747 return [dim.value for dim in self._dims]
748
749 def as_proto(self):
TypeError: 'NoneType' object is not iterable
Any clue ?
It is usually very convenient to use tf.train.export_meta_graph to store the whole MetaGraph. Then, upon restoring you can use tf.train.import_meta_graph, because it turns out that it passes all additional arguments to the underlying import_scoped_meta_graph which has the input_map argument and utilizes it when it gets to it's own invocation of import_graph_def.
It is not documented, and took me waaaay toooo much time to find it, but it works!
What I would do is something along those lines:
-First retrieve the names of the tensors representing the weights and biases of the 3 fully connected layers coming after pool5 in VGG16.
To do that I would inspect [n.name for n in graph.as_graph_def().node].
(They probably look something like import/locali/weight:0, import/locali/bias:0, etc.)
-Put them in a python list:
weights_names=["import/local1/weight:0" ,"import/local2/weight:0" ,"import/local3/weight:0"]
biases_names=["import/local1/bias:0" ,"import/local2/bias:0" ,"import/local3/bias:0"]
-Define a function that look something like:
def pool5_tofcX(input_tensor, layer_number=3):
flatten=tf.reshape(input_tensor,(-1,7*7*512))
tmp=flatten
for i in xrange(layer_number):
tmp=tf.matmul(tmp, graph.get_tensor_by_name(weights_name[i]))
tmp=tf.nn.bias_add(tmp, graph.get_tensor_by_name(biases_name[i]))
tmp=tf.nn.relu(tmp)
return tmp
Then define the tensor using the function:
wanted_output=pool5_tofcX(out_pool)
Then you are done !
Jonan Georgiev provided an excellent answer here. The same approach was also described with little fanfare at the end of this git issue: https://github.com/tensorflow/tensorflow/issues/3389
Below is a copy/paste runnable example of using this approach to switch out a placeholder for a tf.data.Dataset get_next tensor.
import tensorflow as tf
my_placeholder = tf.placeholder(dtype=tf.float32, shape=1, name='my_placeholder')
my_op = tf.square(my_placeholder, name='my_op')
# Save the graph to memory
graph_def = tf.get_default_graph().as_graph_def()
print('----- my_op before any remapping -----')
print([n for n in graph_def.node if n.name == 'my_op'])
tf.reset_default_graph()
ds = tf.data.Dataset.from_tensors(1.0)
next_tensor = tf.data.make_one_shot_iterator(ds).get_next(name='my_next_tensor')
# Restore the graph with a custom input mapping
tf.graph_util.import_graph_def(graph_def, input_map={'my_placeholder': next_tensor}, name='')
print('----- my_op after remapping -----')
print([n for n in tf.get_default_graph().as_graph_def().node if n.name == 'my_op'])
Output, where we can clearly see that the input to the square operation has changed.
----- my_op before any remapping -----
[name: "my_op"
op: "Square"
input: "my_placeholder"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
----- my_op after remapping -----
[name: "my_op"
op: "Square"
input: "my_next_tensor"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]

replace() fails for large strings

I have the following code:
cd(joinpath(homedir(),"Desktop"))
using HDF5
using JLD
# read contents of a file
t = readall("sourceFile")
# remove unnecessary characters
t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "")
# convert string into Float64 array (approximately ~140 columns)
data = readdlm(IOBuffer(t), ' ', char(10))
# save array on the hard drive
save("data.jld", "data", data)
Which works fine when I test it with the sourceFile that has 10^4 or less number of lines. However when sourceFile that has around 5*10^6 lines it fails at t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "") with the following message
This question is old, and based on an older version of Julia. However, it would be useful to check if this works on a recent version. I recently tested this in latest 0.5 version of Julia, and the code above seems to work correctly with 5*10^6 lines of 600 characters each. The entire operation takes about 5G of peak memory on my laptop.
julia> t=[randstring(600) for i=1:5*10^6];
julia> writecsv("/Users/aviks/tmp/long.csv", t)
julia> t=readstring("/Users/aviks/tmp/long.csv");
julia> length(t)
3005000000
julia> #time t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "");
43.599660 seconds (137 allocations: 3.358 GB, 0.85% gc time)
(PS: Note that readall is now deprecated in favour of readstring).

How do you use matrices in Nimrod?

I found this project on GitHub; it was the only search term returned for "nimrod matrix". I took the bare bones of it and changed it a little bit so that it compiled without errors, and then I added the last two lines to build a simple matrix, and then output a value, but the "getter" function isn't working for some reason. I adapted the instructions for adding properties found here, but something isn't right.
Here is my code so far. I'd like to use the GNU Scientific Library from within Nimrod, and I figured that this was the first logical step.
type
TMatrix*[T] = object
transposed: bool
dataRows: int
dataCols: int
data: seq[T]
proc index[T](x: TMatrix[T], r,c: int): int {.inline.} =
if r<0 or r>(x.rows()-1):
raise newException(EInvalidIndex, "matrix index out of range")
if c<0 or c>(x.cols()-1):
raise newException(EInvalidIndex, "matrix index out of range")
result = if x.transposed: c*x.dataCols+r else: r*x.dataCols+c
proc rows*[T](x: TMatrix[T]): int {.inline.} =
## Returns the number of rows in the matrix `x`.
result = if x.transposed: x.dataCols else: x.dataRows
proc cols*[T](x: TMatrix[T]): int {.inline.} =
## Returns the number of columns in the matrix `x`.
result = if x.transposed: x.dataRows else: x.dataCols
proc matrix*[T](rows, cols: int, d: openarray[T]): TMatrix[T] =
## Constructor. Initializes the matrix by allocating memory
## for the data and setting the number of rows and columns
## and sets the data to the values specified in `d`.
result.dataRows = rows
result.dataCols = cols
newSeq(result.data, rows*cols)
if len(d)>0:
if len(d)<(rows*cols):
raise newException(EInvalidIndex, "insufficient data supplied in matrix constructor")
for i in countup(0,rows*cols-1):
result.data[i] = d[i]
proc `[][]`*[T](x: TMatrix[T], r,c: int): T =
## Element access. Returns the element at row `r` column `c`.
result = x.data[x.index(r,c)]
proc `[][]=`*[T](x: var TMatrix[T], r,c: int, a: T) =
## Sets the value of the element at row `r` column `c` to
## the value supplied in `a`.
x.data[x.index(r,c)] = a
var m = matrix( 2, 2, [1,2,3,4] )
echo( $m[0][0] )
This is the error I get:
c:\program files (x86)\nimrod\config\nimrod.cfg(36, 11) Hint: added path: 'C:\Users\H127\.babel\libs\' [Path]
Hint: used config file 'C:\Program Files (x86)\Nimrod\config\nimrod.cfg' [Conf]
Hint: system [Processing]
Hint: mat [Processing]
mat.nim(48, 9) Error: type mismatch: got (TMatrix[int], int literal(0))
but expected one of:
system.[](a: array[Idx, T], x: TSlice[Idx]): seq[T]
system.[](a: array[Idx, T], x: TSlice[int]): seq[T]
system.[](s: string, x: TSlice[int]): string
system.[](s: seq[T], x: TSlice[int]): seq[T]
Thanks you guys!
I'd like to first point out that the matrix library you refer to is three years old. For a programming language in development that's a lot of time due to changes, and it doesn't compile any more with the current Nimrod git version:
$ nimrod c matrix
...
private/tmp/n/matrix/matrix.nim(97, 8) Error: ']' expected
It fails on the double array accessor, which seems to have changed syntax. I guess your attempt to create a double [][] accessor is problematic, it could be ambiguous: are you accessing the double array accessor of the object or are you accessing the nested array returned by the first brackets? I had to change the proc to the following:
proc `[]`*[T](x: TMatrix[T], r,c: int): T =
After that change you also need to change the way to access the matrix. Here's what I got:
for x in 0 .. <2:
for y in 0 .. <2:
echo "x: ", x, " y: ", y, " = ", m[x,y]
Basically, instead of specifying two bracket accesses you pass all the parameters inside a single bracket. That code generates:
x: 0 y: 0 = 1
x: 0 y: 1 = 2
x: 1 y: 0 = 3
x: 1 y: 1 = 4
With regards to finding software for Nimrod, I would like to recommend you using Nimble, Nimrod's package manager. Once you have it installed you can search available and maintained packages. The command nimble search math shows two potential packages: linagl and extmath. Not sure if they are what you are looking for, but at least they seem more fresh.

Resources