My computer (i5-6500 3.2 GHZ, 8 GB RAM) takes a long time: something like 10 minutes (havent yet measured exactly).
i currently have to
read 400 images. (*.gif format, should all be b&w, resolution of approx. 200*400 px.) (3520 images in total)
i want to "add" all images "cell-wise".
here is how im doing it at the moment: Read image with raster than turn it into matrix, then sum it.
library(rgdal)
library(raster)
library(magrittr)
oldPic <- raster("initalImage.gif") %>% as.matrix
for (pat_IND in currSide) {
newPic <- raster(pat_IND) %>% as.matrix
oldPic <- oldPic + newPic
}
This takes for ever. I used caTools::read.gif() which was even much slower. Do i have a bottle neck in my code? Is there a faster implementation?
Edit: Image Properties
i use "no dither", mono palette (b&w).
Edit2
i want to add the images pixel-wise. Lets take pic A and pic B.
A + B = C. If A(1,1) = 1 and B(1,1) = 1, C(1,1) should be 2. Its a simple matrix addition.
test image:
reading with raster takes 0.03699994 secs
reading with raster + as.matrix takes: 0.201 secs
you need to measure... without any sample image is hard to say and we can only guess. You need to take into account that loading/decoding JPG take time in milliseconds and encoding of GIF can be time consuming even 200 ms. Depends on kind of encoding. To speed up GIF encoding you can:
use single global palette + dithering
GIF is 8 bpp and JPG is 24 bpp so your encoder needs to do the transformation. That is called color quantization and is the most expensive operation while encoding which can take even ~200 ms per frame on average PC machine in well optimized C++ code. for more info see:
Effective gif/image color quantization?
To remedy this you can use single palette dedicated to dithering (like default VGA or use some WEB palette they have the same purpose) and use dithering with is much much faster. See:
simple and fast Dithering
btw if you need to preserve colors take a look at this:
Images lose quality after saving as GIF
So try to find out how to configure your encoder to force dithering instead of color quantization based on K-means or similar ....
limit encoding dictionary to less then 4096
The encoding/decoding is based on creating dictionary and encoding need to search it more than once on per pixel basis. So lovering its size to 1024 gets significant boost to speed. Of coarse you need to access to encoding code to change this unless this can be configured somehow in it... The compression will be decreased by this however and more clear codes will be present in the stream.
use multi-threading
you can fully parallelize this and encode with each core present in your system.
I strongly recommend you to measure how long it take to encode single frame of GIF. If you take advantage on both bullets #1,#2 then I estimate you can get near times around ~5 ms per frame with dithering and ~60 ms per frame with fast quantization. So with 3520 frames it would take around 17.6 or 211.2 seconds just to encode GIF so add the file memory and JPG manipulation and take into account all is heavily guessed/estimated as you did not provide sample data. And divide by number of cores if you use #3 +/- shared disc access waits.
I need to frequently read large objects, so I want the highest possible read rate. I save the objects with save(..., compress = F). Two things have me wondering if R is really saving the objects as binary, uncompressed files:
The bottleneck for both read and write rates is the CPU, running at 100%, not my disk's maximum rates.
The objects appear to be smaller on disk than in memory (e.g. 944,696,761 B on disk, 973,970,512 B in R according to object.size()).
Uncompressed files read at 75-85 MB/s, compared to ~30 MB/s for reading compressed files; this rate is the same from my external hard drive and my SSD, well under the capacity for each.
So my question is: does this make sense? Shouldn't the read/write run at the disk's limit? Is there another option that will?
Details:
The compress = F argument does, as expected, both increase the file size on disk (about threefold), and definitely does improve both write and read speed (20-fold and 2.5-fold, respectively); yet the CPU remains the bottleneck, grinding away at 100% for the duration of the read or write. This feels to me like a light compression.
For the files saved with compress = F, Nautilus's properties window reports the files' type to be "Binary (application/octet-stream)".
The read takes over than twice as long as the write (e.g. 50.5s vs. 22s for 4.3 GB). I don't know why that would be.
I have a new SSD (sequential non-cached read speed (hdparm -t) of at least 265 MB/s).
Hi in a paper about data transfer in opencl i read As the data size that we want to send to the device memory increase the bandwidth will increase, but i dont know why . can some one please explain it to me why bandwidth will increase?
Every time a kernel gets launched or gets transferred to/from the GPU, there is a short delay of several micro-seconds. Historically, this has been larger on AMD GPUs than Nvidia GPUs. Therefore, there's two components to the time it takes to send data: latency + X * Y B/s where X is the number of bytes and Y is the theoretical bandwidth. When X is small, X * Y is not much larger than latency. As X gets large, e.g. multiple megabytes, the latency component of the total time becomes such a tiny fraction of the total time that it becomes insignificant.
I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
sum+=i;
output[id]=sum;
}
GPU on my laptop:
NVS 5400M (www.nvidia.com/object/nvs_techspecs.html)
GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7
I'm running linear regression on a tiff image. Image sizes are;
ncol=6350, nrow=2077, nlayers=26
What I did before running the calculation is just read tiff image in R using
ndvi2000<-raster("img2000.tif")
Then wrote following script in R console window. Calculation process is taking very long time more than 20mins and still running. Is it normal to take long time on big image? The script of the regression is:
time<-sort(sample(97:297, nlayers(ndvi2000)))
t.lm.pred<-function(x) {if (is.na(x[1])) {NA} else{predict(lm(x~time))}}
f.pred<-calc(ndvi2000,t.lm.pred)
The number of values you have is very large, so I'm not in the least surprised that it takes very long. Simply making a list of random numbers the size of your tiff file:
x = runif(6350 * 2077 * 26)
object.size(x) / (1024 * 1024)
2616.216
That is over 2.5 Gb, and that is just to save one variable. A rule of thumb is that you need roughly three times the amount of RAM than your dataset size. So, assuming you load some more images, you'll needs more than 10-20 Gb of RAM. If you don't have enough RAM, your operating system will starting swapping memory to disk, which makes your analysis veeeery slow.
I think it will be good idea to rethink your analysis, either that or rent a 64 Gb RAM EC2 instance. You could only look at the temporal average, or spatial average. Only look at specific locations, etc, etc. Simply brute-force using all values in your data might not be best here.