torch: difference in GPU / CPU responses - torch

I have a simple question. I am trying to understand why is there a large difference in network responses given by the gpu (cuda) and the cpu. Here's a minimal example:
require 'torch'
require 'nn'
require 'cunn'
require 'paths'
-- a small convnet
net = nn.Sequential()
net:add(nn.SpatialConvolution(3,16, 3,3))
net:add(nn.SpatialConvolution(16,8, 3,3))
net:add(nn.SpatialConvolution(8,1, 3,3))
-- randomize weights
local w = net:getParameters()
-- random input
x = torch.Tensor(3, 10, 10):uniform(-1,1)
-- network on gpu
y = net:forward(x:cuda())
-- network on cpu
y2 = net:clone():double():forward(x)
-- check difference (typically ~10000)
print("Mean Abs. Diff:")
Am I doing something wrong here, or it's some expected difference between CPU/GPU computation?

It turns out, even though the mean absolute difference can be large, the percentage difference is quite small (on the order of 1e-5%):
print("Mean Abs. % Diff:")
print(torch.abs(y2-y:double()):cdiv(torch.abs(y2)):sum() / y2:nElement())
Is the mean absolute diff. large due to some difference in how cuda handles floating point precision as compared to the cpu?


Creating distance matrix in R for a matrix in a higher dimensional space

I have created an euclidean distance matrix using dist() function in R.
Below is my R script. As the dimensions of matrix would be 16809 * 16809 while running this script in R I got the error message:
Error: cannot allocate vector of size 1.1 Gb
So is there any way to get rid of this error?
I haven't used parallelization in R previously. Can it be done using parallelization?
rnd.points = matrix(runif(3 * 16809), ncol = 3)
rnd.points <- rnd.points[1:5,]
ds <- dist(rnd.points)
as.matrix(ds) -> nt
As #Gopola said: dist(.) computes all pairwise distances, and hence needs
O(n^2) memory. Indeed, dist() is efficient and only stores half of the symmetric n x n matrix.
If I compute dist() on a computer with enough RAM, it works nicely, and indeed creates an object ds of size 1.1 Gb ... which is not so large for today's computers.
rnd.points <- matrix(runif(3 * 16809), ncol = 3)
ds <- dist(rnd.points)
Note however that your
as.matrix(ds) -> nt
is not such a good idea as the resulting matrix nt is indeed (almost) twice the size of ds, as nt is of course a n x n matrix.
O/S has a principal limit on RAM-addressing ( smaller for a 32-bit system, larger for 64-bit system )
O/S next has a design-based limit for a max RAM a process can allocate ( +kill-s afterwards )
Had the same InRAM constraints in python and went beyond that
Sure, at some cost, but was a worth piece of experience.
python numpy has a wonderfull feature for this very scenario seamlessly inbuilt - a .memmap(). The word seamlessly is intentionally emphasised, as this is of the core importance for your problem re-formulation / re-design costs. There are tools available, but it will be your time to master 'em and to re-design your algoritm ( libraries et al ) so as these can use the new tools - guess what - SEAMLESSLY. This is the hidden part of the iceberg.
Handy R tools available:
filebacked.big.matrix which also supports an HPC cluster-wide sharing for distributed processing ( thus solving both PSPACE and PTIME dimensions of the HPC processing challenge, unless you fortunately hit the filesystem fileSize ceiling )
ff which allowslibrary(ff)pt_coords <- ff( vmode = "double", dim = c(16809, 3), initdata = 0 )pt_dists <- ff( vmode = "double", dim = c(16809, 16809), initdata = -1 )and work with it in as simple as in matrix-alike [row,column] mode to fill in the points and process their pair-wise distances et al,
?ffsave for further details on saving your resulting distances data
and last, but not least
mmap + indexing
Parallel? No.Distributed?Yes, might help with PTIME:
As noted with filebacked.big.matrix there are chances to segment the computational PSPACE into smaller segments for distributed processing and reduction of the PTIME, but the concept is in principle just a concurrent (re)-use of available resouces, not the [ PARALLEL ] system-behaviour ( while it is necessary to admit, that lot of marketing ( the bad news is that even the technology marketing has joined this unfair and knowingly incorrect practice ) texts mis-uses the word parallel / parallelism in places, where a just concurrent system-behaviour is observed ( there are not many real, true-PARALLEL, systems ) ).
Big matrices are doable in R well beyond the InRAM limits, select the tools most suitable for your problem-domain and harness all the HPC-resources you may.
Error: cannot allocate vector of size 1.1 Gb is solved.
There is nothing but resources, that imposts limits and delays on our computing-ready tasks, so do not hesitate to make your move while computing resources are still available for your Project, otherwise you will find yourself, with all the re-engineered software ready, but waiting in a queue for the computing resources.

gpgpu: how to estimate speed gains based on gpu and cpu specifications

I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,
a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu ( instead of intel i7-4770K processor (
b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?
Thank you!
Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.
I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.
The Good:
GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.
__kernel void calculate(__global int* input,__global int* output){
size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
for(int i=0;i<=inp_num;++i)
GPU on my laptop:
NVS 5400M (
GFLOPS, single precision: 253.44 (
CPU on my Laptop:
intel i7-3720QM, 2.6 GHz
GFLOPS (assuming single precision): 83.2 ( Intel document does not specify if it is single or double
CPU Time: 3.295 sec
GPU Time: 0.184 sec
Speed gains per core: 3.295/0.184 ~18
Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5
Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0
For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU
The Problem:
The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu ( with intels i7-4770K (
AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)
Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS
Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26
AMD GPUs price: $1500
Intel CPUs price: $300
For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?
You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.
But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.
number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute
The number of kernels launched will depend on your data.
A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.
The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.
I cannot give you a clear answer on this, other than try it and measure.
B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.
The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.
You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).
Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.
If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :
Sample kernel:
Read two 32-bit floats from memory and
do calcs on them for 20-30 times at least.
Then write to memory once.
New: GPU
Old: CPU
Gain ratio = ((New/Old) - 1 ) *100 (%)
New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops
Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops
((New/Old) - 1 ) *100 ===> 3000% speed gain.
This is when code uses registers and local memory mostly. Rarely hitting global mem.
If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.
When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).
If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.
loads are usually classified into 2 categories
bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS
there is a tool clpeak which tries to programmatically measure these
its very important to classify your problem to measure its performance & choose the right device(knowing their limits)
say if you compare intel-HD-4000 & i7-3630(both on same chip) in
i7 is comparatively better at bandwidth(plus no transfer overheads)
in terms of compute, gpu is 4-5 times faster than i7

How to calculate peak FLOPS in GPGPU hardware?

I want to calculate the theoretical peak performance of graphics hardware. Well, actually I want to understand the calculation.
Example with a AMD Radeon HD 6670:
The AMD Accelerated Parallel Processing Programming Guide ( tells me in the middle of page 6-42 to take the number of Stream Cores (96), multiply it by the number of operations per cycle for each Stream Core (let's take an ADD with Single Precision, which would be 5) and multiply that by the core clock (800 MHz). That results to:
96 * 5 FLOPS * 800MHz = 384,000 MFLOPS = 384 GFLOPS
The very same document tells me on page D-4 that this particular device has a peak throughput of 768 GFLOPS, which is twice of what I just calculated. Wikipedia and the AMD homepage state the same.
So my question is: Where am I missing the factor of two?
I am not sure about AMD hardware, but I remember that NVIDIA counted MAD (multiply-add) operation as two flops. Since MADs are performed in one cycle, the theoretical performance was multiplied by two.
480 processing elements * 2 operations per cycle(single addition pipeline + single multiplication pipeline per element) * 800MHz = 768 GFLOPS
When the code has too many levels of branching, it drops to 1-4 shader per compute unit which means 6-24 of them and this translates to as low as 10-40 GFlops which is even slower than a single cpu core.

openCL behavior --- need clarification

I am using the following parameters for my simulation on Geforce GT 220 card -
number of compute units = 6
local size = 32
global size = 32*6*256 = 49152
(everything is one dimensional)
But in the Visual Profiler, I see that Number of work groups per Compute Unit = 768. Which means it is utilizing only 2 compute units. Why is that? How can I make sure all the compute units are busy? I mean, ideally, I would expect 49152/(32*6) = 256 work groups per compute unit. I am confused at this behavior.
You should not care about compute units, that is onyl HW specific.
Just care about local size and global size, and try to use the largest local size as you can.
What is probably happening, is that you specify a very small local size. Every group of local size threads are loaded inside a compute unit. And is not efficient to run only 32 threads. So the loading trashing slows the performance, and probably makes the Compute Units remain idle lot of time.
My recomendation, use a very high Local size. Or DO NOT specify a local size (OpenCL will select the higest one posible)

Hardware Cache Formulas (Parameter)

The image below was scanned (poorly) from Computer Systems: A Programmer's Perspective. (I apologize to the publisher). This appears on page 489.
Figure 6.26: Summary of cache parameters
I'm having a terribly difficult time understanding some of these calculations. At the current moment, what is troubling me is the calculation for M, which is supposed to be the number of unique addresses. "Maximum number of unique memory addresses." What does 2m suppose to mean? I think m is calculated as log2(M). This seems circular....
For the sake of this post, assume the following in the event you want to draw up an example: 512 sets, 8 blocks per set, 32 words per block, 8 bits per word
Update: All of the answers posted thus far have been helpful but I still think I'm missing something. cwrea's answer provides the biggest bridge for my understand. I feel like the answer is on the tip of my mental tongue. I know it is there but I can't identify it.
Why does M = 2m but then m = log2(M)?
Perhaps the detail I'm missing is that for a 32-bit machine, we'd assume M = 232. Does this single fact allow me to solve for m? m = log2(232)? But then this gets me back to 32... I have to be missing something...
m & M are related to each other, not defined in terms of each other. They call M a derived quantity however since usually the processor/controller is the limiting factor in terms of the word length it uses.
On a real system they are predefined. If you have a 8-bit processor, it generally can handle 8-bit memory addresses (m = 8). Since you can represent 256 values with 8-bits, you can have a total of 256 memory addresses (M = 2^8 = 256). As you can see we start with the little m due to the processor constraints, but you could always decide you want a memory space of size M, and use that to select a processor that can handle it based on word-size = log2(M).
Now if we take your assumptions for your example,
512 sets, 8 blocks per set, 32 words
per block, 8 bits per word
I have to assume this is an 8-bit processor given the 8-bit words. At that point your described cache is larger than your address space (256 words) & therefore pretty meaningless.
You might want to check out Computer Architecture Animations & Java applets. I don't recall if any of the cache ones go into the cache structure (usually they focus on behavior) but it is a resource I saved on the past to tutor students in architecture.
Feel free to further refine your question if it still doesn't make sense.
The two equations for M are just a relationship. They are two ways of saying the same thing. They do not indicate causality, though. I think the assumption made by the author is that the number of unique address bits is defined by the CPU designer at the start via requirements. Then the M can vary per implementation.
m is the width in bits of a memory address in your system, e.g. 32 for x86, 64 for x86-64. Block size on x86, for example, is 4K, so b=12. Block size more or less refers to the smallest chunk of data you can read from durable storage -- you read it into memory, work on that copy, then write it back at some later time. I believe tag bits are the upper t bits that are used to look up data cached locally very close to the CPU (not even in RAM). I'm not sure about the set lines part, although I can make plausible guesses that wouldn't be especially reliable.
Circular ... yes, but I think it's just stating that the two variables m and M must obey the equation. M would likely be a given or assumed quantity.
Example 1: If you wanted to use the formulas for a main memory size of M = 4GB (4,294,967,296 bytes), then m would be 32, since M = 2^32, i.e. m = log2(M). That is, it would take 32 bits to address the entire main memory.
Example 2: If your main memory size assumed were smaller, e.g. M = 16MB (16,777,216 bytes), then m would be 24, which is log2(16,777,216).
It seems you're confused by the math rather than the architectural stuff.
2^m ("2 to the m'th power") is 2 * 2... with m 2's. 2^1 = 2, 2^2 = 2 * 2 = 4, 2^3 = 2 * 2 * 2 = 8, and so on. Notably, if you have an m bit binary number, you can only represent 2^m different numbers. (is this obvious? If not, it might help to replace the 2's with 10's and think about decimal digits)
log2(x) ("logarithm base 2 of x") is the inverse function of 2^x. That is, log2(2^x) = x for all x. (This is a definition!)
You need log2(M) bits to represent M different numbers.
Note that if you start with M=2^m and take log2 of both sides, you get log2(M)=m. The table is just being very explicit.
