Extracting feature vector from Images Tensorflow OOM - out-of-memory

I have used pretrained network weights that I have downloaded from Caffe zoo to build a feature extractor (VGG-16) in tensorflow.
I have therefore redefined the architecture of the network in TF with the imported weights as constants and added an extra fully connected layer with tf.Variables to train a linear SVM by SGD on Hinge loss cost.
My initial training set is composed of 100000 32x32x3 images in the form of a numpy array.
I therefore had to resize them to 224x224x3 which is the input size of VGG but that does not fit into memory.
So I removed unnecessary examples and narrowed it down to 10000x224x224x3 images which is awful but still acceptable as only support vectors are important but even then I still get OOM with TF while training.
That should not be the case as the only important representation is the one from penultimate layer of size 4096 which is easily manageable and the weights to backprop on are of size only (4096+1bias).
So what I can do is first transform all my images to features with TF network with only constants to form a 10000x4096 dataset and then train a second tensorflow model.
Or at each batch recalculate all features for the batch. In the next_batch method. Or use the panoply of buffers/queue runners that TF provides but it is a bit scary as I am not really familiar with those.
I do not like those method I think there should be something more elegant (without too much queues if possible).
What would be the most Tensorflow-ic method to deal with this ?

If I understand your question correctly, 100K images are not fitting in memory at all, while 10K images do fit in memory, but then the network itself OOMs. That sounds very reasonable, because 10K images alone, assuming they are represented using 4 bytes per pixel per channel, occupy 5.6GiB of space (or 1.4GiB if you somehow only spend 1 byte per pixel per channel), so even if the dataset happens to fit in memory, as you add your model, that will occupy couple more GiBs, you will OOM.
Now, there are several ways you can address it:
You should train using minibatches (if you do not already). With a minibatch if size 512 you will load significantly less data to the GPU. With minibatches you also do not need to load your entire dataset into a numpy array at the beginning. Build your iterator in a way that will load 512 images at a time, run forward and backward pass (sess.run(train...)), load next 512 images etc. This way at no point you will need to have 10K or 100K images in memory simultaneously.
It also appears to be very wasteful to upscale images, when your original images are so much smaller. What you might consider doing is taking convolution layers from VGG net (dimensions of conv layers do not depend on dimensions of the original images), and train the fully connected layers on top of them from scratch. To do that just trim the VGG net after the flatten layer, run it for all the images you have and produce the output of the flatten layer for each image, then train a three layer fully connected network on those features (this will be relatively fast compared to training the entire conv network), and plug the resulting net after the flatten layer of the original VGG net. This might also produce better results, because the convolution layers are trained to find features in the original size images, not blurry upscaled images.

I guess a way to do that using some queues and threads but not too much would be to save the training set into a tensorflow protobuf format (or several) using tf.python_io.TFRecordWriter.
Then creating a method to read and decode a single example from the protobuf and finally use tf.train.shuffle_batch to feed BATCH_SIZE examples to the optimizer using the former method.
This way there is only a maximum of capacity (defined in shuffle_batch) tensors in the memory at the same time.
This awesome tutorial from Indico explains it all.

Related

Unchanges training accuracy in convolutional neural network using MXnet

I am totally new to NN and want to classify the almost 6000 images that belong to different games (gathered by IR). I used the steps introduced in the following link, but I get the same training accuracy in each round.
some info about NN architecture: 2 convloutional, activation, and pooling layers. Activation type: relu, Number of filters in first and second layers are 30 and 70 respectively.
2 fully connected layer with 500 and 2 hidden layers respectively.
http://firsttimeprogrammer.blogspot.de/2016/07/image-recognition-in-r-using.html
I had a similar problem, but for regression. After trying several things (different optimizers, varying layers and nodes, learning rates, iterations etc), I found that the way initial values are given helps a lot. For instance I used a -random initializer with variance of 0.2 (initializer = mx.init.normal(0.2)).
I came upon this value from this blog. I recommend you read it. [EDIT]An excerpt from the same,
Weight initialization. Worry about the random initialization of the weights at the start of learning.
If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.
Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.
Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.
http://yyue.blogspot.in/2015/01/a-brief-overview-of-deep-learning.html

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Trying to know the cut off point of an inbuilt function, since currently it is not running. In R

In R, I am trying to use the markov chain package and converting clickstream data to markov chain. I have 4GB of RAM but the program cannot run the command(after a lot of time). This is because after a while the ongoing conversion cannot allocate more than 3969mb of data(that is what the screen says). I am trying to find out that, as to what point will the program run? So if I have say `n' nodes, till how many nodes(obviously less than n) or rows(the rows might contain same or different nodes) will the program run. I am trying to do Attribution Modelling using R. The conversion path are converted from clickstream form to a markov chain. Trying to find out the transition matrix using that.
Image with the function and a sample dataset. Here the h,c,d,p are different nodes. Image here of the code for a small clickstream data
Attached the image of the code and a sample data. The function converts this data into a markov chain containing a lot of important things out of which I am mainly trying to get the Transition Matrix and the Steady State. As I increase the data size(the number of different channel path or Users are not important, it is the different nodes that are important), the function is unable to perform as it cannot allocate more than the 4GB of RAM. I tried hit and trial to get to the point beyond which the function is not working but it did not help. Is there a way where I can know that till what node(or row) will the function work? So that I can generate the Transition Matrix till that point. And maybe the increase in the memory usage with every increasing node as I would like to believe the relationship between the two won't be linear.
Please let me know if the question is not specific enough and if it might need any more details.

Why are matrices (in R) so much slower and larger than image files that contain the same data?

I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:
x, y, z, i (intensity), m (mass)
As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.
I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?
Can anyone explain this?
Yep
How can it be that two files containing essentially the exact same data can have such different sizes and speeds?
R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.
How can I work with a matrix of image data much faster?
Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.
Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32
UPDATE
If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart
The answer to this question turned out to be a little esoteric and pretty specific to my data-set, but may be of interest to others. My data is very sparse - i.e. most of the values in my matrix are zero. Therefore, I was able to significantly reduce the size of my data using the Matrix package (capitalisation important), which is designed to more efficiently handle sparse matrices. To implement the package, I just inserted the line:
data <- Matrix(data)
The amount of space saved will vary depending on the sparsity of the dataset, but in my case I reduced 1.8 GB to 156 Mb. A Matrix behaves just as a matrix, so there was no need to change my other code, and there was no noticeable change in speed. Sparsity is obviously something that the proprietary format could take advantage of.

Train SVM on a very large dataset stored on hard drive

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

Resources