I'm working on a project in which I have a matrix of distances between nodes that I import to cplex. I do it like this:
tuple arc{
float x;
float y;
float d;
float Ttime; //Time to travell the arc
}
tuple vehicle{
key int id;
int STdepot; //Starting Depot (1 or 2)
int MaxCars; //Maximum number of cars in a vehicle
float AvSpeed; //Average Speed of a vehicle
}
tuple cavities{
key int id;
float x;
float y;
float rate; //Consumption Rate
float iniStock; //Initial Stock to be consumed at cavitie x
float deadline; //Deadline to arrive at cavitie x
int ProdCons; //Production Consumed at cavitie x
}
tuple CAVtype{
key int id;
int CarsCons; //Consuming cars of 12 or 20
}
tuple nodes{
key int id;
float x; //Coordinates in X
float y; //Coordinates in Y
string type;
}
setof(arc) OD = ...; //DistanceMatrix
setof(vehicle) K=...; //Vehicles
setof(cavities) C=...; //Cavities
setof(CAVtype) T=...; // Cavities Type
setof(nodes) N=...; //Nodes
float d[N][N];
float t[N][N];
execute preProcess{
cplex.tilim=300;
for(var i in N){
for(var j in N){
d[i][j] = 9999;
t[i][j] = 9999;
}
}
for(var arc in OD){
var origin = N.get(arc.x);
var destination = N.get(arc.y);
d[origin][destination] = arc.d;
t[origin][destination] = arc.Ttime;
}
}
It imports everything, but when I add the restrictions, the distance matrix is not respected and the variables show connections between nodes that don't have connections. Also, the last restrictions changes the value of q, why does this happen? How can I solve this?
Thanks in advance.
The objective function and the restrictions are the following:
dexpr float MachineStoppage = sum(k in K,i in N,j in N) d[i][j] * x[i][j][k] +
sum(g in C,k in K) penalize *phi[g] + sum(i in N,g in C) u[i][g]; //(1)
minimize MachineStoppage;
//*******************************|Restrictions|***********************************************************
subject to{
forall (i in C, k in K) //(2)
FlowConservation:
sum(j in N: i.id!=j.id) x[<i.id>][j][k] == z[<i.id>][k];
forall (i in C, k in K) //(3)
FlowConservation2:
sum(j in N: i.id!=j.id) x[j][<i.id>][k] == z[<i.id>][k];
forall(i in N, k in K: i.type == "d" && k.STdepot!= i.id) //(5)
DepartingFromAnyDepot:
sum(j in N: i.id!=j.id) x[i][j][k] == 0;
forall(i in N)
sum(k in K) z[i][k]==1;
forall(i in N,j in N,k in K: i!=j && j.id!=0) //(8)
ArrivalTimeTracking1:
w[k][i] + t[i][j] <= w[k][j] + M*(1-x[i][j][k]);
forall(i in N,j in N,k in K: i!=j && j.id!=0) //(9)
ArrivalTimeTracking2:
w[k][i] + t[i][j] >= w[k][j]- M*(1-x[i][j][k]);
forall(k in K, g in C, i in N) //(10)
ReplenishmentDelay:
//w[k][<g.id>] <= g.deadline + phi[g];
w[k][<g.id>] <= g.deadline + phi[g];
forall(i in N, g in C, k in K) //(11)
QuantitiesToBeDeliveredToTheCavities:
q[k][g] == ((g.rate*w[k][<g.id>]) + u[i][g] + (g.ProdCons-g.iniStock));
forall(i in N,g in C,k in K) //(12)
LimitofQuantitiesToBeDelivered:
q[k][g] >= z[i][k] * g.ProdCons;
//q[k][g] >= z[<i.id>][k] * g.ProdCons;
forall(h in T, k in K) //(13)
NumberOfCarsOfEachTypeinEachVehicle:
sum(i in N,g in C) q[k][g] <= h.CarsCons*y[k][h];
/*
forall(k in K, g in C) //(14)
MaximumOfCarsinaVehicle:
sum(h in T) y[k][h] <=b;
*/
Are you sure you do not get a relaxed solution ? In documentation
IDE and OPL > CPLEX Studio IDE > IDE Tutorials
You could have a look at the section "Relaxing infeasible models".
Related
I'm an amateur playing with discrete math. This isn't a
homework problem though I am doing it at home.
I want to solve ax + by = c for natural numbers, with a, b and c
given and x and y to be computed. I want to find all x, y pairs
that will satisfy the equation.
This has a similar structure to Bezout's identity for integers
where there are multiple (infinite?) solution pairs. I thought
the similarity might mean that the extended Euclidian algorithm
could help here. Below are two implementations of the EEA that
seem to work; they're both adapted from code found on the net.
Could these be adapted to the task, or perhaps can someone
find a more promising avenue?
typedef long int Int;
#ifdef RECURSIVE_EEA
Int // returns the GCD of a and b and finds x and y
// such that ax + by == GCD(a,b), recursively
eea(Int a, Int b, Int &x, Int &y) {
if (0==a) {
x = 0;
y = 1;
return b;
}
Int x1; x1=0;
Int y1; y1=0;
Int gcd = eea(b%a, a, x1, y1);
x = y1 - b/a*x1;
y = x1;
return gcd;
}
#endif
#ifdef ITERATIVE_EEA
Int // returns the GCD of a and b and finds x and y
// such that ax + by == GCD(a,b), iteratively
eea(Int a, Int b, Int &x, Int &y) {
x = 0;
y = 1;
Int u; u=1;
Int v; v=0; // does this need initialising?
Int q; // quotient
Int r; // remainder
Int m;
Int n;
while (0!=a) {
q = b/a; // quotient
r = b%a; // remainder
m = x - u*q; // ?? what are the invariants?
n = y - v*q; // ?? When does this overflow?
b = a; // A candidate for the gcd - a's last nonzero value.
a = r; // a becomes the remainder - it shrinks each time.
// When a hits zero, the u and v that are written out
// are final values and the gcd is a's previous value.
x = u; // Here we have u and v shuffling values out
y = v; // via x and y. If a has gone to zero, they're final.
u = m; // ... and getting new values
v = n; // from m and n
}
return b;
}
#endif
If we slightly change the equation form:
ax + by = c
by = c - ax
y = (c - ax)/b
Then we can loop x through all numbers in its range (a*x <= c) and compute if viable natural y exists. So no there is not infinite number of solutions the limit is min(c/a,c/b) ... Here small C++ example of naive solution:
int a=123,b=321,c=987654321;
int x,y,ax;
for (x=1,ax=a;ax<=c;x++,ax+=a)
{
y = (c-ax)/b;
if (ax+(b*y)==c) here output x,y solution somewhere;
}
If you want to speed this up then just iterate y too and just check if c-ax is divisible by b Something like this:
int a=123,b=321,c=987654321;
int x,y,ax,cax,by;
for (x=1,ax=a,y=(c/b),by=b*y;ax<=c;x++,ax+=a)
{
cax=c-ax;
while (by>cax){ by-=b; y--; if (!y) break; }
if (by==cax) here output x,y solution somewhere;
}
As you can see now both x,y are iterated in opposite directions in the same loop and no division or multiplication is present inside loop anymore so its much faster here first few results:
method1 method2
[ 78.707 ms] | [ 21.277 ms] // time needed for computation
75044 | 75044 // found solutions
-------------------------------
75,3076776 | 75,3076776 // first few solutions in x,y order
182,3076735 | 182,3076735
289,3076694 | 289,3076694
396,3076653 | 396,3076653
503,3076612 | 503,3076612
610,3076571 | 610,3076571
717,3076530 | 717,3076530
824,3076489 | 824,3076489
931,3076448 | 931,3076448
1038,3076407 | 1038,3076407
1145,3076366 | 1145,3076366
I expect that for really huge c and small a,b numbers this
while (by>cax){ by-=b; y--; if (!y) break; }
might be slower than actual division using GCD ...
I need to do it with recursion, but the problem is that function depends on only ONE parameter and inside function it depends on two ( k and n ), also how to find minimum value if it returns only one value?
The function is :
I've already tried to make random k, but I don't think that is really good idea.
F1(int n) {
Random random = new Random();
int k = random.Next(1,10);
if (1 <= k && k <= n){
return Math.Min(F1(k - 1) + F1(n - k) + n);
} else {
return 0;
}
}
You need to make a loop traversing all k values in range 1..n. Something like this:
F1(int n) {
if (n == 0)
return ???? what is starting value?
minn = F1(0) + F1(n - 1) + n
for (int k = 2; k <= n; k++)
minn = Math.Min(minn, F1(k - 1) + F1(n - k) + n);
return minn;
}
I want to compute sequence of numbers like this:
n*(n-1)+n*(n-1)*(n-2)+n*(n-1)*(n-2)*(n-3)+n*(n-1)*(n-2)*(n-3)*(n-4)+...+n(n-1)...(n-n)
For example n=5 and sum equals 320.
I have a function, which compute one element:
int fac(int n, int s)
{
if (n > s)
return n*fac(n - 1, s);
return 1;
}
Recomputing the factorial for each summand is quite wasteful. Instead, I'd suggest to use memoization. If you reorder
n*(n-1) + n*(n-1)*(n-2) + n*(n-1)*(n-2)*(n-3) + n*(n-1)*(n-2)*(n-3)*...*1
you get
n*(n-1)*(n-2)*(n-3)*...*1 + n*(n-1)*(n-2)*(n-3) + n*(n-1)*(n-2) + n*(n-1)
Notice how you start with the product of 1..n, then you add the product of 1..n divided by 1, then you add the product divided by 1*2 etc.
I think a much more efficient definition of your function is (in Python):
def f(n):
p = product(range(1, n+1))
sum_ = p
for i in range(1, n-1):
p /= i
sum_ += p
return sum_
A recursive version of this definition is:
def f(n):
def go(sum_, i):
if i >= n-1:
return sum_
return sum_ + go(sum_ / i, i+1)
return go(product(range(1, n+1)), 1)
Last but not least, you can also define the function without any explicit recursion by using reduce to generate the list of summands (this is a more 'functional' -- as in functional programming -- style):
def f(n):
summands, _ = reduce(lambda (lst, p), i: (lst + [p], p / i),
range(1, n),
([], product(range(1, n+1))))
return sum(summands)
This style is very concise in functional programming languages such as Haskell; Haskell has a function call scanl which simplifies generating the summands so that the definition is just:
f n = sum $ scanl (/) (product [1..n]) [1..(n-2)]
Something like this?
function fac(int n, int s)
{
if (n >= s)
return n * fac(n - 1, s);
return 1;
}
int sum = 0;
int s = 4;
n = 5;
while(s > 0)
{
sum += fac(n, s);
s--;
}
print sum; //320
Loop-free version:
int fac(int n, int s)
{
if (n >= s)
return n * fac(n - 1, s);
return 1;
}
int compute(int n, int s, int sum = 0)
{
if(s > 0)
return compute(n, s - 1, sum + fac(n, s));
return sum;
}
print compute(5, 4); //320
Ok ther is not mutch to write. I would suggest 2 methodes if you want to solve this recursiv. (Becaus of the recrusiv faculty the complexity is a mess and runtime will increase drasticaly with big numbers!)
int func(int n){
return func(n, 2);
}
int func(int n, int i){
if (i < n){
return n*(fac(n-1,n-i)+func(n, i + 1));
}else return 0;
}
int fac(int i,int a){
if(i>a){
return i*fac(i-1, a);
}else return 1;
}
As we all know, the simplest algorithm to generate Fibonacci sequence is as follows:
if(n<=0) return 0;
else if(n==1) return 1;
f(n) = f(n-1) + f(n-2);
But this algorithm has some repetitive calculation. For example, if you calculate f(5), it will calculate f(4) and f(3). When you calculate f(4), it will again calculate both f(3) and f(2). Could someone give me a more time-efficient recursive algorithm?
I have read about some of the methods for calculating Fibonacci with efficient time complexity following are some of them -
Method 1 - Dynamic Programming
Now here the substructure is commonly known hence I'll straightly Jump to the solution -
static int fib(int n)
{
int f[] = new int[n+2]; // 1 extra to handle case, n = 0
int i;
f[0] = 0;
f[1] = 1;
for (i = 2; i <= n; i++)
{
f[i] = f[i-1] + f[i-2];
}
return f[n];
}
A space-optimized version of above can be done as follows -
static int fib(int n)
{
int a = 0, b = 1, c;
if (n == 0)
return a;
for (int i = 2; i <= n; i++)
{
c = a + b;
a = b;
b = c;
}
return b;
}
Method 2- ( Using power of the matrix {{1,1},{1,0}} )
This an O(n) which relies on the fact that if we n times multiply the matrix M = {{1,1},{1,0}} to itself (in other words calculate power(M, n )), then we get the (n+1)th Fibonacci number as the element at row and column (0, 0) in the resultant matrix. This solution would have O(n) time.
The matrix representation gives the following closed expression for the Fibonacci numbers:
fibonaccimatrix
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
/*multiplies 2 matrices F and M of size 2*2, and
puts the multiplication result back to F[][] */
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
/*function that calculates F[][] raise to the power n and puts the
result in F[][]*/
static void power(int F[][], int n)
{
int i;
int M[][] = new int[][]{{1,1},{1,0}};
// n - 1 times multiply the matrix to {{1,0},{0,1}}
for (i = 2; i <= n; i++)
multiply(F, M);
}
This can be optimized to work in O(Logn) time complexity. We can do recursive multiplication to get power(M, n) in the previous method.
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
static void power(int F[][], int n)
{
if( n == 0 || n == 1)
return;
int M[][] = new int[][]{{1,1},{1,0}};
power(F, n/2);
multiply(F, F);
if (n%2 != 0)
multiply(F, M);
}
Method 3 (O(log n) Time)
Below is one more interesting recurrence formula that can be used to find nth Fibonacci Number in O(log n) time.
If n is even then k = n/2:
F(n) = [2*F(k-1) + F(k)]*F(k)
If n is odd then k = (n + 1)/2
F(n) = F(k)*F(k) + F(k-1)*F(k-1)
How does this formula work?
The formula can be derived from the above matrix equation.
fibonaccimatrix
Taking determinant on both sides, we get
(-1)n = Fn+1Fn-1 – Fn2
Moreover, since AnAm = An+m for any square matrix A, the following identities can be derived (they are obtained from two different coefficients of the matrix product)
FmFn + Fm-1Fn-1 = Fm+n-1
By putting n = n+1,
FmFn+1 + Fm-1Fn = Fm+n
Putting m = n
F2n-1 = Fn2 + Fn-12
F2n = (Fn-1 + Fn+1)Fn = (2Fn-1 + Fn)Fn (Source: Wiki)
To get the formula to be proved, we simply need to do the following
If n is even, we can put k = n/2
If n is odd, we can put k = (n+1)/2
public static int fib(int n)
{
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n] != 0)
return f[n];
int k = (n & 1) == 1? (n + 1) / 2
: n / 2;
// Applyting above formula [See value
// n&1 is 1 if n is odd, else 0.
f[n] = (n & 1) == 1? (fib(k) * fib(k) +
fib(k - 1) * fib(k - 1))
: (2 * fib(k - 1) + fib(k))
* fib(k);
return f[n];
}
Method 4 - Using a formula
In this method, we directly implement the formula for the nth term in the Fibonacci series. Time O(1) Space O(1)
Fn = {[(√5 + 1)/2] ^ n} / √5
static int fib(int n) {
double phi = (1 + Math.sqrt(5)) / 2;
return (int) Math.round(Math.pow(phi, n)
/ Math.sqrt(5));
}
Reference: http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html
Look here for implementation in Erlang which uses formula
. It shows nice linear resulting behavior because in O(M(n) log n) part M(n) is exponential for big numbers. It calculates fib of one million in 2s where result has 208988 digits. The trick is that you can compute exponentiation in O(log n) multiplications using (tail) recursive formula (tail means with O(1) space when used proper compiler or rewrite to cycle):
% compute X^N
power(X, N) when is_integer(N), N >= 0 ->
power(N, X, 1).
power(0, _, Acc) ->
Acc;
power(N, X, Acc) ->
if N rem 2 =:= 1 ->
power(N - 1, X, Acc * X);
true ->
power(N div 2, X * X, Acc)
end.
where X and Acc you substitute with matrices. X will be initiated with and Acc with identity I equals to .
One simple way is to calculate it iteratively instead of recursively. This will calculate F(n) in linear time.
def fib(n):
a,b = 0,1
for i in range(n):
a,b = a+b,a
return a
Hint: One way you achieve faster results is by using Binet's formula:
Here is a way of doing it in Python:
from decimal import *
def fib(n):
return int((Decimal(1.6180339)**Decimal(n)-Decimal(-0.6180339)**Decimal(n))/Decimal(2.236067977))
you can save your results and use them :
public static long[] fibs;
public long fib(int n) {
fibs = new long[n];
return internalFib(n);
}
public long internalFib(int n) {
if (n<=2) return 1;
fibs[n-1] = fibs[n-1]==0 ? internalFib(n-1) : fibs[n-1];
fibs[n-2] = fibs[n-2]==0 ? internalFib(n-2) : fibs[n-2];
return fibs[n-1]+fibs[n-2];
}
F(n) = (φ^n)/√5 and round to nearest integer, where φ is the golden ratio....
φ^n can be calculated in O(lg(n)) time hence F(n) can be calculated in O(lg(n)) time.
// D Programming Language
void vFibonacci ( const ulong X, const ulong Y, const int Limit ) {
// Equivalent : if ( Limit != 10 ). Former ( Limit ^ 0xA ) is More Efficient However.
if ( Limit ^ 0xA ) {
write ( Y, " " ) ;
vFibonacci ( Y, Y + X, Limit + 1 ) ;
} ;
} ;
// Call As
// By Default the Limit is 10 Numbers
vFibonacci ( 0, 1, 0 ) ;
EDIT: I actually think Hynek Vychodil's answer is superior to mine, but I'm leaving this here just in case someone is looking for an alternate method.
I think the other methods are all valid, but not optimal. Using Binet's formula should give you the right answer in principle, but rounding to the closest integer will give some problems for large values of n. The other solutions will unnecessarily recalculate the values upto n every time you call the function, and so the function is not optimized for repeated calling.
In my opinion the best thing to do is to define a global array and then to add new values to the array IF needed. In Python:
import numpy
fibo=numpy.array([1,1])
last_index=fibo.size
def fib(n):
global fibo,last_index
if (n>0):
if(n>last_index):
for i in range(last_index+1,n+1):
fibo=numpy.concatenate((fibo,numpy.array([fibo[i-2]+fibo[i-3]])))
last_index=fibo.size
return fibo[n-1]
else:
print "fib called for index less than 1"
quit()
Naturally, if you need to call fib for n>80 (approximately) then you will need to implement arbitrary precision integers, which is easy to do in python.
This will execute faster, O(n)
def fibo(n):
a, b = 0, 1
for i in range(n):
if i == 0:
print(i)
elif i == 1:
print(i)
else:
temp = a
a = b
b += temp
print(b)
n = int(input())
fibo(n)
What are the best practices to consider when implementing an error function defined as
using an OpenCL kernel?
A, B and C are 3D float arrays and \delta is the Kronecker delta.
Typical values for (N, M) = (2, 7) or (N, M) = (3, 23).
The naive implementation (given below) is by several orders of magnitude slower than the CPU version.
Thanks,
T.
__kernel void cl_bilinear_alg(
__global float * A,
__global float * B,
__global float * C,
__global const int M,
__global const int N,
__global float * R)
{
int index = get_global_id(0);
int N2 = N * N;
int mat_offset = index * N2 * M;
float s1, s2, err = 0.0f;
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
for (int k = 0; k < N; ++k)
{
for (int l = 0; l < N; ++l)
{
for (int m = 0; m < N; ++m)
{
for (int n = 0; n < N; ++n)
{
s1 = (n == i) * (j == k) * (l == m);
s2 = 0;
for (int r = 0; r < M; ++r)
{
s2 += A[mat_offset + r * N2 + i * N + j] *
B[mat_offset + r * N2 + k * N + l] *
C[mat_offset + r * N2 + m * N + n];
}
err += (s2 - s1) * (s2 - s1);
}
}
}
}
}
}
R[index] = err;
}
UPDATE
The primary target is a Geforce GTX 570, though this could change in the future.
UPDATE2
After vectorizing the code, moving bits to local memory, unrolling some loops and passing precomputed Kronecker products explicitly to the kernel the code looks as follows:
__kernel void cl_bilinear_alg(__global const float * A,
__global const float * B,
__global const float * C,
__global const int N,
__global const int M,
__global const float * kron,
__global float * R)
{
__private int index = get_global_id(0);
__private int cM = ceil(M / 4.0f);
__private int N2 = N*N;
__private int N4 = N2*N2;
__private int mat_offset = index * N2 * M;
__private float s1, s2, err = 0;
__private float4 vzero = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
__local float4 va[54], vb[54], vc[54];
for (int ij = 0, k = 0; ij < N2; ++ij)
{
int r = 0;
for (; r < M / 4; r += 4, ++k)
{
int idx0 = mat_offset + N2 * r + ij;
int idx1 = mat_offset + N2 * (r + 1) + ij;
int idx2 = mat_offset + N2 * (r + 2) + ij;
int idx3 = mat_offset + N2 * (r + 3) + ij;
va[k] = (float4) (A[idx0], A[idx1], A[idx2], A[idx3]);
vb[k] = (float4) (B[idx0], B[idx1], B[idx2], B[idx3]);
vc[k] = (float4) (C[idx0], C[idx1], C[idx2], C[idx3]);
}
if (M % 4)
{
float buffa[4] = {0}, buffb[4] = {0}, buffc[4] = {0};
for (; r < M; ++r)
{
int idx = mat_offset + N2 * r + ij;
buffa[r % 4] = A[idx];
buffb[r % 4] = B[idx];
buffc[r % 4] = C[idx];
}
va[k] = vload4(0, buffa);
vb[k] = vload4(0, buffb);
vc[k++] = vload4(0, buffc);
}
}
for (int ij = 0; ij < N2; ++ij)
{
for (int kl = 0; kl < N2; ++kl)
{
for (int mn = 0; mn < N2; ++mn)
{
s1 = kron[ij * N4 + kl * N2 + mn];
s2 = 0;
for (int r = 0; r < cM; ++r)
s2 += dot(va[cM * ij + r], mad(vb[cM * kl + r], vc[cM * mn + r], vzero));
//the most expensive line
err += (s2 - s1) * (s2 - s1);
}
}
}
R[index] = err;
}
By applying these changes a 4x speed increase was observed compared to the naive implementation. Furthermore, it was revealed that the most expensive line of all is the error update, i.e.
err += (s2 - s1) * (s2 - s1);
Any suggestions?
Typically you'd want to break some of those loops up... a lot...
- the outer loops become split over multiple workgroups, which run on their own compute unit (there are around 16 compute units per GPU, not many)
- the next few loops would be split over different threads within each workgroup
If you try to run all the calculations all at the same time, they will all try to load the data into memory at the same time, and this will simply thrash horribly. GPUs have very limited memory. Sure, the global memory sounds large enough, several gigabytes, but the global GPU memory is slow. You want to get the data into the local memory, which is per compute unit, and is of the order of 32-64KB, not much more than that.
You'd typically want to somehow divide your task into very small tasks, and do the following, for each workgroup:
load a chunk of memory from global memory into local memory
the whole workgroup warp of threads can participate in doing the copy, using coallesced access
do work on this memory, like doing some sums, and so on
write the results back to global memory
then, can either iterate a bit, or simply exit, and leave other workgroups to handle other bits of the work
On the CPU, the mathematical operations tend to be a major bottleneck, but on the GPU, generally the cores are mostly spinning uselessly, whilst waiting for data to gradually get to them, from global memory. Whatever you can do to optimize this process, prevent conflicting demands, and so on, will make the kernel significantly faster.