How can I reduce CPU usage in qml 3d - cpu-usage

I create a qml program to load 3d model(.obj file) by SceneLoader,but CPU usage has increased to 116% in arm (There are two cores, each with a CPU utilization of 50%-60%), and what should I do to reduce CPU usage? What's more,the fps is 60 fps when load 3d model.

The main problem is that .obj files use the text-based Wavefront format to describe the geometry.
Loading geometry from this file format requires SceneLoader or Mesh to parse the text first and populate lists of vertices/faces/etc which ends up being a costly operation and slow things down considerably.
You should investigate using .fbx files (Autodesk) with SceneLoader which is a file format that can store the geometry as binary data which is something much faster to load.
I believe the ~60 FPS limitation might be derived from the Qt window implementation for that particular platform. If you measure the FPS doing non-Qt3D rendering I suspect you should also see ~60 FPS.
In other words, the performance you are observing is expected for the way you are doing things.

Related

How should we select the chunk size in disk frame?

I'm working with disk frame and it's great so far.
One piece that confuses me is the chunk size. I sense that a small chunk might create too many tasks and disk frame might eat up time managing those tasks. On the other hand, a big chunk might be too expensive for the workers, decreasing the performance benefits from parallelism.
What pieces of information can we use to make a better guess for chunk size?
This is a tough problem and I probably need better tools.
Currently, everything is on guess basis. But I have made a presentation on this and I will try to bring it into the docs soon.
Ideally, you want
RAM Used = number of workers * RAM usage per chunk
So, if you have 6 workers (ideal for 6 CPU cores), then you would want smaller chunk vs someone with 4 (workers) but same amount of total RAM.
The difficult is in estimating "RAM usage per chunk" which is different for different operations like merge, sort, and just vaniall filtering!
This is a hard problem to solve in general! So no good solution for now.

R raster timeseries: what's the most efficient read and write?

I have the following problem/question:
I've written an R functions which is smoothing values from a time series. The time series is defined by a big number of single global raster files, hence each single pixel a series with n timesteps (generally more than 500). Even though I've plenty of RAM, I have to rely on blockwise processing because loading the entire dataset is just too much. So far so good.
I've written (IMHO) a fairly decent code, which leverages parallel processing when possible. I have a processing machine which should be more than well equipped to handle this amount of data and computation. This leads me to believe that most of the time will be spent by reading lots of values from the disk and then, after smoothing, writing lots of values to the disk.
So I've tried running the code with the files being on either a normal HDD or a normal SSD.
Against my expectations, it didn't really matter much.
Then I tried running a test function which reads a file, gets the values and writes them back to disk with the raster being on either the HDD, the SSD or a blazing fast SSD. Again, no significant difference.
I've already done a fair share of profiling to find bottlenecks, as well as a good amount of time googling for efficient solutions. There's bits of info here and there, but I decided to post this question here to get a definitive answer and maybe some pointers for me and others how to efficient manage things.
So without further ado (and for people who skipped the above), here's my question:
In a setting as described above (high data volume, blockwise processing, reading and writing from/to disk), what's the most efficient (and/or fastest) way to do computation on a long raster time series which involves reading and writing values from/to disk? (especially regarding the read write aspect)
Assuming I have a fast SSD, how can I leverage the speed? Is it done automatically?
What are the influencing factors (filesize, filetype, caching) and the most efficient setting of these factors?
I know that in terms of raster, R works the fastest with .grd, but I would like to avoid this format for flexibility, compatibility and diskspace reasons.
Maybe I'm also having a misconception of how the raster package interacts with the files on disk. In that case, should I use different functions than getValues and writeValues ?
-- Some system info and example code: --
Os: Win7 x64
CPU: Xenon E5-1650 # 3.5 GHz
RAM: 128 GB
R-version: 3.2
Raster file format: .rst
Read/write benchmark function:
benchfun <- function(x){
# x ... raster file
xr <- raster(x)
x2 <- raster(xr)
xval <- getValues(xr)
x2 <- setValues(x2,xval)
writeRaster(x2,'testras.tif',overwrite=TRUE)
}
If needed I can also provide a little example code for the time series processing, but for now I don't think it's needed.
Appreciate all tips!
Thanks,
Val

Extracting feature vector from Images Tensorflow OOM

I have used pretrained network weights that I have downloaded from Caffe zoo to build a feature extractor (VGG-16) in tensorflow.
I have therefore redefined the architecture of the network in TF with the imported weights as constants and added an extra fully connected layer with tf.Variables to train a linear SVM by SGD on Hinge loss cost.
My initial training set is composed of 100000 32x32x3 images in the form of a numpy array.
I therefore had to resize them to 224x224x3 which is the input size of VGG but that does not fit into memory.
So I removed unnecessary examples and narrowed it down to 10000x224x224x3 images which is awful but still acceptable as only support vectors are important but even then I still get OOM with TF while training.
That should not be the case as the only important representation is the one from penultimate layer of size 4096 which is easily manageable and the weights to backprop on are of size only (4096+1bias).
So what I can do is first transform all my images to features with TF network with only constants to form a 10000x4096 dataset and then train a second tensorflow model.
Or at each batch recalculate all features for the batch. In the next_batch method. Or use the panoply of buffers/queue runners that TF provides but it is a bit scary as I am not really familiar with those.
I do not like those method I think there should be something more elegant (without too much queues if possible).
What would be the most Tensorflow-ic method to deal with this ?
If I understand your question correctly, 100K images are not fitting in memory at all, while 10K images do fit in memory, but then the network itself OOMs. That sounds very reasonable, because 10K images alone, assuming they are represented using 4 bytes per pixel per channel, occupy 5.6GiB of space (or 1.4GiB if you somehow only spend 1 byte per pixel per channel), so even if the dataset happens to fit in memory, as you add your model, that will occupy couple more GiBs, you will OOM.
Now, there are several ways you can address it:
You should train using minibatches (if you do not already). With a minibatch if size 512 you will load significantly less data to the GPU. With minibatches you also do not need to load your entire dataset into a numpy array at the beginning. Build your iterator in a way that will load 512 images at a time, run forward and backward pass (sess.run(train...)), load next 512 images etc. This way at no point you will need to have 10K or 100K images in memory simultaneously.
It also appears to be very wasteful to upscale images, when your original images are so much smaller. What you might consider doing is taking convolution layers from VGG net (dimensions of conv layers do not depend on dimensions of the original images), and train the fully connected layers on top of them from scratch. To do that just trim the VGG net after the flatten layer, run it for all the images you have and produce the output of the flatten layer for each image, then train a three layer fully connected network on those features (this will be relatively fast compared to training the entire conv network), and plug the resulting net after the flatten layer of the original VGG net. This might also produce better results, because the convolution layers are trained to find features in the original size images, not blurry upscaled images.
I guess a way to do that using some queues and threads but not too much would be to save the training set into a tensorflow protobuf format (or several) using tf.python_io.TFRecordWriter.
Then creating a method to read and decode a single example from the protobuf and finally use tf.train.shuffle_batch to feed BATCH_SIZE examples to the optimizer using the former method.
This way there is only a maximum of capacity (defined in shuffle_batch) tensors in the memory at the same time.
This awesome tutorial from Indico explains it all.

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

Using qCompress/qUncompress for multi-core

Is it possible to utilize multicore for internal QT functions such as qCompress/qUncompress?
Thank you.
qCompress and qUncompress internally use zlib, and its algorithm is not easily parallellizable. The way other do it, like pigz for example, is to cut up the data into chunks that are then compressed in parallel. It requires a different file format than what zlib normally expects.
If you have control over your own data, you can split the data into multiple chunks and compress/decompress them in parallel. The number of chunks can be as large as the maximum number of threads envisioned for the decompression, but when you do compress/decompress, you only run QThread::idealThreadCount() number of threads in parallel, they may process more than one chunk each. The minimal size of the chunk has to make sense so as not to impact the compression ratio too much. You may need to experiment, but I'd gather that chunks below 128kb make little sense.

Resources