detectron2 diffusioninst: oom-kill during training - out-of-memory

I tried to run code for DiffusionInst based on Detectron2 (source code: https://github.com/chenhaoxing/DiffusionInst). During my training, my python process has always been killed (at 10000-20000 iteration epochs, which is insufficient for diffisioninst training).
I only rewrite the code for dataloader, in order to adapt to my own dataset.
My new code for dataloader:
class DiffusionInstDatasetMapper:
"""
A callable which takes a dataset dict in Detectron2 Dataset format,
and map it into a format used by DiffusionInst.
The callable currently does the following:
1. Read the image from "file_name"
2. Applies geometric transforms to the image and annotation
3. Find and applies suitable cropping to the image and annotation
4. Prepare image and annotation to Tensors
"""
def __init__(self, cfg, is_train=True):
if cfg.INPUT.CROP.ENABLED and is_train:
self.crop_gen = [
# T.ResizeShortestEdge([400, 500, 600], sample_style="choice"),
T.RandomCrop(cfg.INPUT.CROP.TYPE, cfg.INPUT.CROP.SIZE),
]
else:
self.crop_gen = None
self.tfm_gens = build_transform_gen(cfg, is_train)
logging.getLogger(__name__).info(
"Full TransformGens used in training: {}, crop: {}".format(str(self.tfm_gens), str(self.crop_gen))
)
self.img_format = cfg.INPUT.FORMAT
self.is_train = is_train
def __call__(self, dataset_dict):
"""
Args:
dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.
Returns:
dict: a format that builtin models in detectron2 accept
"""
dataset_dict = copy.deepcopy(dataset_dict) # it will be modified by code below
# image = utils.read_image(dataset_dict["file_name"], format=self.img_format)
## crop roi
'''lst = dataset_dict['file_name'].split('-')
image = sitk.ReadImage('-'.join(lst[:-2]))
image = sitk.GetArrayFromImage(image)
above, below = int(lst[-2]), int(lst[-1])
image = image[:, above:below, :]'''
## no crop roi
image = sitk.ReadImage(dataset_dict["file_name"],sitk.sitkFloat32)
image = sitk.GetArrayFromImage(image)
# print('**********************',image.shape,'************************')
image = (image - image.min()) / (image.max() - image.min()) * 255
#print(image.dtype)
image = image.transpose(1, 2, 0).astype(np.uint8)
image = np.repeat(image, 3, axis=2)
#print(image.dtype)
utils.check_image_size(dataset_dict, image)
#origshape = image.shape
if self.crop_gen is None:
image, transforms = T.apply_transform_gens(self.tfm_gens, image)
else:
image, transforms = T.apply_transform_gens(
self.tfm_gens + self.crop_gen, image
)
#print('orig', origshape, '\t\tresized', image.shape)
image_shape = image.shape[:2] # h, w
# Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,
# but not efficient on large generic data structures due to the use of pickle & mp.Queue.
# Therefore it's important to use torch.Tensor.
dataset_dict["image"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))
del image
gc.collect()
if not self.is_train:
# USER: Modify this if you want to keep them for some reason.
dataset_dict.pop("annotations", None)
return dataset_dict
if "annotations" in dataset_dict:
# USER: Modify this if you want to keep them for some reason.
# import pdb;pdb.set_trace()
for anno in dataset_dict["annotations"]:
# anno.pop("segmentation", None)
anno.pop("keypoints", None)
# USER: Implement additional transformations if you have other types of data
annos = [
utils.transform_instance_annotations(obj, transforms, image_shape)
for obj in dataset_dict.pop("annotations")
if obj.get("iscrowd", 0) == 0
]
instances = utils.annotations_to_instances(annos, image_shape, mask_format="bitmask")
dataset_dict["instances"] = utils.filter_empty_instances(instances)
del instances
gc.collect()
return dataset_dict
And the information about the oom-killer:
[2599547.303018] python invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=995
[2599547.303084] [<ffffffff8119bfae>] oom_kill_process+0x1fe/0x3c0
[2599547.303133] Task in /kubepods/burstable/podd09a5032-8b07-11ed-bb60-ac1f6b9ec91e/8b4a8d5c2c1a082f93b1610173beb70bbc19fb1a1c2e28150d2d912ed9b95b10 killed as a result of limit of /kubepods/burstable/podd09a5032-8b07-11ed-bb60-ac1f6b9ec91e
[2599547.305957] Memory cgroup out of memory: Kill process 1041771 (python) score 1198 or sacrifice child
[2599547.307810] Killed process 1041771 (python) total-vm:36436532kB, anon-rss:10288264kB, file-rss:104888kB
[2599718.702250] python invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=995
[2599718.702299] [<ffffffff8119bfae>] oom_kill_process+0x1fe/0x3c0
[2599718.702333] Task in /kubepods/burstable/podd09a5032-8b07-11ed-bb60-ac1f6b9ec91e/8b4a8d5c2c1a082f93b1610173beb70bbc19fb1a1c2e28150d2d912ed9b95b10 killed as a result of limit of /kubepods/burstable/podd09a5032-8b07-11ed-bb60-ac1f6b9ec91e
I set IMS_PER_BATCH to 1, and used a dataset which contains only 1 image, but the oom problem still occurred.
I wonder know what should i do to prevent oom problem?

Related

Copy a huge file with Julia Mmap

I have a big file (75GB) memory mapped in an array d that I want to copy in another m. Because I do not have 75GB of RAM available, I did:
for (i,v) in enumerate(d)
m[i] = v
end
In order to copy the file value after value. But I get a copy rate of ~2MB/s on a SSD where I expect at least 50MB/s both in read and write.
How could I optimize this copy rate?
=== [edit] ===
According to the comments, I changed my code to the following, which sped up the write rate to 15MB/s
function copydcimg(m::Array{UInt16,4}, d::Dcimg)
m .= d
Mmap.sync!(m)
end
copydcimg(m,d)
At this point, I think I should optimize the Dcimg code. This binary file is made of frames spaced by a timestamp. Here is the code I use to access the frames:
module dcimg
using Mmap
using TOML
struct Dcimg <: AbstractArray{UInt16,4} # struct allowing to access dcimg file
filename::String # filename of the dcimg
header::Int # header size in bytes
clock::Int # clock size in bytes
x::Int
y::Int
z::Int
t::Int
m # linear memory map
Dcimg(filename, header, clock, x, y, z, t) =
new(filename, header, clock, x, y, z, t,
Mmap.mmap(open(filename), Array{UInt16, 3},
(x*y+clock÷sizeof(UInt16), z, t), header)
)
end
# following functions allows to access DCIMG like an Array
Base.size(D::Dcimg) = (D.x, D.y, D.z, D.t)
# skip clock
Base.getindex(D::Dcimg, i::Int) =
D.m[i + (i ÷ (D.x*D.y))*D.clock÷sizeof(UInt16)]
Base.getindex(D::Dcimg, x::Int, y::Int, z::Int, t::Int) =
D[x + D.x*((y-1) + D.y*((z-1) + D.z*(t-1)))]
# allowing to automatically parse size
function Dcimg(pathtag)
p = TOML.parsefile(pathtag * ".toml")
return Dcimg(pathtag * ".dcimg",
# ...
)
end
export Dcimg, getframe
end
I got it! The solution was to copy the file chunk by chunk lets say by frame (around 1024×720 UInt16). This way I reached 300MB/s, which I didn't even know was possible in single thread. Here is the code.
In module dcimg, I added the methods to access the file frame by frame
# get frame number n (starting form 1)
getframe(D::Dcimg,n::Int) =
reshape(D.m[
D.x*D.y*(n-1)+1 + (n-1)*D.clock÷sizeof(UInt16) : # cosmetic line break
D.x*D.y*n + (n-1)*D.clock÷sizeof(UInt16)
], D.x, D.y)
# get frame for layer z, time t (starting from 1)
getframe(D::Dcimg,z::Int,t::Int) =
getframe(D::Dcimg,(z-1)+D.z*(t-1))
Iterating over the frames within a loop
function copyframes(m::Array{UInt16,4}, d::Dcimg)
N = d.z*d.t
F = d.x*d.y
for i in 1:N
m[(i-1)*F+1:i*F] = getframe(d, i)
end
end
copyframes(m,d)
Thanks all in comments for leading me to this.
===== edit =====
for further reading, you might look at:
dd: How to calculate optimal blocksize?
http://blog.tdg5.com/tuning-dd-block-size/
which give hints about the optimal block size to copy at a time.

record portions of large audio on click of a button using pyaudio

I want to cut large audio file into different segments and store them in WAV format using pyaudio. I basically need to listen to audio and then cut the file from starting point to where i want to cut,and again start recording and cut another portion, but i am not sure how can i do it with pyaudio. Am i looking for an alternate library ?
I am new to python, any sort of help would be appreciable.
This is code, i have experimented with:
import pyaudio
import wave
import time
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
WAVE_OUTPUT_FILENAME = "output.wav"
wf = wave.open("A001017001_Edited.wav", 'rb')
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
check = True;
While(check):
start = input("Do you wish to start recording?,then press ENTER")
if (start == 13):
try:
stream.start_stream()
p = time.time()
kdata = wf.readframes(CHUNK)
while len(kdata) > 0:
stream.write(kdata)
kdata = wf.readframes(CHUNK)
except KeyboardInterrupt:
q = time.time()
RECORD_SECONDS = (q-p); #gets time since wave file is played
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print(int(RATE / CHUNK * RECORD_SECONDS))
print("stopped recording")
stream.stop_stream()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
#compare if the whole audio is listened
#or not and
#if yes return false
stream.close()
p.terminate()
wf.close()

how to prepare image data for using in torch

I want to prepare my own image data for training in torch.
I tried to find a good source for this but could not find.
They have given reference to data that has been already prepared in .lua or .t7 formats.
Can you please explain the procedure of preparing raw image data for torch? (training, validation and test sets)
Thanks
you may try to write your own data loader class. store your image paths in a table, read image using
require 'image'
YOUR_RGB_FILE_PATH = "/home/username/image.png"
img = image.load(YOUR_RGB_FILE_PATH, 3)
Write your lua code in a iTorch notebook, it helps you debug quickly.
if you do not know how to start, you can refer to the project here wrote with lua torch.
require 'io'
require 'torch'
require 'image'
------------------------------ Parameters ---------------------------------
file_name = '.../train.txt'
save_name = '.../train.t7'
num_images = 10000*3
num_channels = 3
width = 51
height = 51
---------------------------------------------------------------------------
file = io.open(file_name, 'rb')
data = torch.Tensor(num_images, num_channels, width, height):byte()
label = torch.Tensor(num_images):byte()
counter = 1
for line in file:lines() do
print(counter)
image_name, image_label = line:split(' ')[1], line:split(' ')[2]
data[counter] = image.load(image_name, num_channels, 'byte')
label[counter] = image_label
counter = counter + 1
end
torch.save(save_name, {data = data, label = label})

Tensorflow: 6 layer CNN: OOM (use 10Gb GPU memory)

I am using the following code for running a 6 layer CNN with 2 FC layers on top (on Tesla K-80 GPU).
Somehow, it consumes entire memory 10GB and died out of memory.I know that i can reduce the batch_size and then run , but i also want to run with 15 or 20 CNN layers.Whats wrong with the following code and why it takes all the memory? How should i run the code for 15 layers CNN.
Code:
import model
with tf.Graph().as_default() as g_train:
filenames = tf.train.match_filenames_once(FLAGS.train_dir+'*.tfrecords')
filename_queue = tf.train.string_input_producer(filenames, shuffle=True, num_epochs=FLAGS.num_epochs)
feats,labels = get_batch_input(filename_queue, batch_size=FLAGS.batch_size)
### feats size=(batch_size, 100, 50)
logits = model.inference(feats, FLAGS.batch_size)
loss = model.loss(logits, labels, feats)
tvars = tf.trainable_variables()
global_step = tf.Variable(0, name='global_step', trainable=False)
# Add to the Graph operations that train the model.
train_op = model.training(loss, tvars, global_step, FLAGS.learning_rate, FLAGS.clip_gradients)
# Add the Op to compare the logits to the labels during evaluation.
eval_correct = model.evaluation(logits, labels, feats)
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables(), max_to_keep=15)
# The op for initializing the variables.
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
summary_writer = tf.train.SummaryWriter(FLAGS.model_dir,
graph=sess.graph)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value))
# Update the events file.
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
if (step == 0) or (step + 1) % 1000 == 0 or (step + 1) == FLAGS.max_steps:
ckpt_model = os.path.join(FLAGS.model_dir, 'model.ckpt')
saver.save(sess, ckpt_model, global_step=step)
#saver.save(sess, FLAGS.model_dir, global_step=step)
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
coord.join(threads)
sess.close()
###################### File model.py ####################
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1],
padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)
def maxpool2d(x, k=2,s=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, s,
s,1],padding='SAME')
def inference(feats,batch_size):
#feats size (batch_size,100,50,1) #batch_size=256
conv1_w=tf.get_variable("conv1_w", [filter_size,filter_size,1,256],initializer=tf.uniform_unit_scaling_initializer())
conv1_b=tf.get_variable("conv1_b",[256])
conv1 = conv2d(feats, conv1_w, conv1_b,2)
conv1 = maxpool2d(conv1, k=2,s=2)
### This was replicated for 6 layers and the 2 FC connected layers are added
return logits
def training(loss, train_vars, global_step, learning_rate, clip_gradients):
# Add a scalar summary for the snapshot loss.
tf.scalar_summary(loss.op.name, loss)
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_vars,aggregation_method=1), clip_gradients)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, train_vars), global_step=global_step)
return train_op
I am not too sure what the model python library is. If it is something you wrote and can change the setting in the optimizer I would suggest the following which I use in my own code
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost, aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
By default the aggeragetion_method is ADD_N but if you change it to EXPERIMENTAL_ACCUMULATE_N or EXPERIMENTAL_TREE this will greatly save memory. The main memory hog in these programs is that tensorflow must save the output values at every neuron so that it can compute the gradients. Changing the aggregation_method helps a lot from my experience.
Also BTW I don't think there is anything wrong with your code. I can run out of memory on small cov-nets as well.

Tensorflow : how to insert custom input to existing graph?

I have downloaded a tensorflow GraphDef that implements a VGG16 ConvNet, which I use doing this :
Pl['images'] = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
with open("tensorflow-vgg16/vgg16.tfmodel", mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
tf.import_graph_def(graph_def, input_map={"images": Pl['images']})
Besides, I have image features that are homogeneous to the output of the "import/pool5/".
How can I tell my graph that don't want to use his input "images", but the tensor "import/pool5/" as input ?
Thank's !
EDIT
OK I realize I haven't been very clear. Here is the situation:
I am trying to use this implementation of ROI pooling, using a pre-trained VGG16, which I have in the GraphDef format. So here is what I do:
First of all, I load the model:
tf.reset_default_graph()
with open("tensorflow-vgg16/vgg16.tfmodel",
mode='rb') as f:
fileContent = f.read()
graph_def = tf.GraphDef()
graph_def.ParseFromString(fileContent)
graph = tf.get_default_graph()
Then, I create my placeholders
images = tf.placeholder(tf.float32,
[None, 448, 448, 3],
name="images") #batch x width x height x channels
boxes = tf.placeholder(tf.float32,
[None,5], # 5 = [batch_id,x1,y1,x2,y2]
name = "boxes")
And I define the output of the first part of the graph to be conv5_3/Relu
tf.import_graph_def(graph_def,
input_map={'images':images})
out_tensor = graph.get_tensor_by_name("import/conv5_3/Relu:0")
So, out_tensor is of shape [None,14,14,512]
Then, I do the ROI pooling:
[out_pool,argmax] = module.roi_pool(out_tensor,
boxes,
7,7,1.0/1)
With out_pool.shape = N_Boxes_in_batch x 7 x 7 x 512, which is homogeneous to pool5. I would then like to feed out_pool as an input to the op that comes just after pool5, so it would look like
tf.import_graph_def(graph.as_graph_def(),
input_map={'import/pool5':out_pool})
But it doesn't work, I have this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-89-527398d7344b> in <module>()
5
6 tf.import_graph_def(graph.as_graph_def(),
----> 7 input_map={'import/pool5':out_pool})
8
9 final_out = graph.get_tensor_by_name("import/Relu_1:0")
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/importer.py in import_graph_def(graph_def, input_map, return_elements, name, op_dict)
333 # NOTE(mrry): If the graph contains a cycle, the full shape information
334 # may not be available for this op's inputs.
--> 335 ops.set_shapes_for_outputs(op)
336
337 # Apply device functions for this op.
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py in set_shapes_for_outputs(op)
1610 raise RuntimeError("No shape function registered for standard op: %s"
1611 % op.type)
-> 1612 shapes = shape_func(op)
1613 if len(op.outputs) != len(shapes):
1614 raise RuntimeError(
/home/hbenyounes/vqa/roi_pooling_op_grad.py in _roi_pool_shape(op)
13 channels = dims_data[3]
14 print(op.inputs[1].name, op.inputs[1].get_shape())
---> 15 dims_rois = op.inputs[1].get_shape().as_list()
16 num_rois = dims_rois[0]
17
/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_shape.py in as_list(self)
745 A list of integers or None for each dimension.
746 """
--> 747 return [dim.value for dim in self._dims]
748
749 def as_proto(self):
TypeError: 'NoneType' object is not iterable
Any clue ?
It is usually very convenient to use tf.train.export_meta_graph to store the whole MetaGraph. Then, upon restoring you can use tf.train.import_meta_graph, because it turns out that it passes all additional arguments to the underlying import_scoped_meta_graph which has the input_map argument and utilizes it when it gets to it's own invocation of import_graph_def.
It is not documented, and took me waaaay toooo much time to find it, but it works!
What I would do is something along those lines:
-First retrieve the names of the tensors representing the weights and biases of the 3 fully connected layers coming after pool5 in VGG16.
To do that I would inspect [n.name for n in graph.as_graph_def().node].
(They probably look something like import/locali/weight:0, import/locali/bias:0, etc.)
-Put them in a python list:
weights_names=["import/local1/weight:0" ,"import/local2/weight:0" ,"import/local3/weight:0"]
biases_names=["import/local1/bias:0" ,"import/local2/bias:0" ,"import/local3/bias:0"]
-Define a function that look something like:
def pool5_tofcX(input_tensor, layer_number=3):
flatten=tf.reshape(input_tensor,(-1,7*7*512))
tmp=flatten
for i in xrange(layer_number):
tmp=tf.matmul(tmp, graph.get_tensor_by_name(weights_name[i]))
tmp=tf.nn.bias_add(tmp, graph.get_tensor_by_name(biases_name[i]))
tmp=tf.nn.relu(tmp)
return tmp
Then define the tensor using the function:
wanted_output=pool5_tofcX(out_pool)
Then you are done !
Jonan Georgiev provided an excellent answer here. The same approach was also described with little fanfare at the end of this git issue: https://github.com/tensorflow/tensorflow/issues/3389
Below is a copy/paste runnable example of using this approach to switch out a placeholder for a tf.data.Dataset get_next tensor.
import tensorflow as tf
my_placeholder = tf.placeholder(dtype=tf.float32, shape=1, name='my_placeholder')
my_op = tf.square(my_placeholder, name='my_op')
# Save the graph to memory
graph_def = tf.get_default_graph().as_graph_def()
print('----- my_op before any remapping -----')
print([n for n in graph_def.node if n.name == 'my_op'])
tf.reset_default_graph()
ds = tf.data.Dataset.from_tensors(1.0)
next_tensor = tf.data.make_one_shot_iterator(ds).get_next(name='my_next_tensor')
# Restore the graph with a custom input mapping
tf.graph_util.import_graph_def(graph_def, input_map={'my_placeholder': next_tensor}, name='')
print('----- my_op after remapping -----')
print([n for n in tf.get_default_graph().as_graph_def().node if n.name == 'my_op'])
Output, where we can clearly see that the input to the square operation has changed.
----- my_op before any remapping -----
[name: "my_op"
op: "Square"
input: "my_placeholder"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]
----- my_op after remapping -----
[name: "my_op"
op: "Square"
input: "my_next_tensor"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
]

Resources