Solving DDE system - vector

im trying to solve a differentiel delay equation system with c++. Im a newbie in terms of coding, so please if you have recommendations, tell me, I would like to improve my writing! What i want to do: initialize the history-array and then start to solve the differential equation by overwriting the history-array. But the problem is, I get the error message:
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 9999) >= this->size() (which is 9999)
It seems that the history-arrays are out of range. I tried to put a std::cout in after the second if-condition to check if the code is going through the second for-loop, but he isn't. Since im learning c++ by doing right now, the problem isn't really clear to me. I hope someone sees the error. And dont hesitate to improve my code, I would really appreciate!
Thanks for your help!
#include <iostream>
#include <vector>
#include <cmath>
#include <iomanip>
#include <fstream>
const double pi = 3.14159265358979323846;
//delay
int tau = 1;
//initial values
double x = 1.0;
double y = 1.0;
double t = 0.0;
//constants and parameters
double K = 0.25;
double lam = 0.5;
double omega = pi;
double dx, dy;
//step-size
double dt = pow(10.0, -4.0);
//number of steps
int Delta = static_cast<int>(tau/dt);
std::vector<double> hist_x((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> hist_y((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> t_val;
std::vector<double> x_val;
std::vector<double> y_val;
double euler(double f, double di, double time_step){
f = f + time_step * di;
return f;
}
int main()
{
std::ofstream file_x;
std::ofstream file_y;
std::ofstream file_t;
file_x.open("x_val.txt");
file_y.open("y_val.txt");
file_t.open("t_val.txt");
for(int n = 0; n < 2; n++){
if(n==0){
for(int j; j < Delta; j++){
dx = lam * x + omega * x;
dy = lam * y - omega * x;
x = euler(x, dx, dt);
y = euler(y, dy, dt);
t = t + dt;
x_val.push_back(x);
y_val.push_back(y);
t_val.push_back(t);
hist_x.at(j) = x;
hist_y.at(j) = y;
file_x<<x_val.at(j)<<std::endl;
file_y<<y_val.at(j)<<std::endl;
file_t<<t_val.at(j)<<std::endl;
}
}
if(!(n==0)){
for(int k = 0; k < Delta; k++){
//f1(x,y)
dx = lam * x + omega * x - K * ( x - hist_x.at(k) );
//f2(x,y)
dy = lam * y - omega * x - K * ( y - hist_y.at(k) );
x = euler(x, dx, dt);
y = euler(y, dy, dt);
t = t + dt;
x_val.push_back(x);
y_val.push_back(y);
t_val.push_back(t);
hist_x.at(k) = x;
hist_y.at(k) = y;
file_x<<x_val.at(k + n * Delta)<<std::endl;
file_y<<y_val.at(k + n * Delta)<<std::endl;
file_t<<t_val.at(k + n * Delta)<<std::endl;
}
}
}
file_x.close();
file_y.close();
file_t.close();
}

for(int j; j < Delta; j++){
You forgot to initialize j; you meant:
for (int j = 0; j < Delta; j++)
{
int Delta = static_cast<int>(tau/dt);
std::vector<double> hist_x((static_cast<int>(tau/dt) - 1), 0.0);
std::vector<double> hist_y((static_cast<int>(tau/dt) - 1), 0.0);
You index from 0 to Delta−1, this means the vectors need to have Delta elements, and you allocate one less; correct:
std::vector<double> hist_x(Delta, 0.0);
std::vector<double> hist_y(Delta, 0.0);

Related

Explicit FDM with CUDA [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am working to implement CUDA for the following code. The first version has been written serially and the second version is written with CUDA. I am sure about its results in serial version. I expect that the second version that I have added CUDA functionality also give me the same result, but it seems that kernel function does not do any thing and it gives me the initial value of u and v. I know due to lack of my experience, the bug may be obvious, but I cannot figure it out. Also, please do not recommend using flatten array, because it is harder for me to understand the indexing in code.
First version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
#include <chrono>
#include <omp.h>
using namespace std;
const int M = 1024;
const int N = 1024;
const double A = 1;
const double B = 3;
const double Du = 5 * pow(10, -5);
const double Dv = 5 * pow(10, -6);
const int Max_Itr = 1000;
const double h = 1.0 / static_cast<double>(M - 1);
const double delta_t = 0.0025;
const double s1 = (Du * delta_t) / pow(h, 2);
const double s2 = (Dv * delta_t) / pow(h, 2);
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
for (int k = 1; k < Max_Itr; k++) {
for (int i = 1; i < N - 1; i++) {
for (int j = 1; j < M - 1; j++) {
u[i][j] = ((1 - (4 * s1)) * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2)) * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
}
}
}
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
Second version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define M 1024
#define N 1024
__global__ void my_kernel(double** v, double** u){
int i= blockIdx.y * blockDim.y + threadIdx.y;
int j= blockIdx.x * blockDim.x + threadIdx.x;
double A = 1;
double B = 3;
int Max_Itr = 1000;
double delta_t = 0.0025;
double Du = 5 * powf(10, -5);
double Dv = 5 * powf(10, -6);
double h = 1.0 / (M - 1);
double s1 = (Du * delta_t) / powf(h, 2);
double s2 = (Dv * delta_t) / powf(h, 2);
for (int k = 1; k < Max_Itr; k++) {
u[i][j] = ((1 - (4 * s1))
* u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2))
* v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
__syncthreads();
}
}
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
dim3 blocks(32,32);
dim3 grids(M/32 +1, N/32 + 1);
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
double **u_d, **v_d;
int d_size = N * M * sizeof(double);
cudaMalloc(&u_d, d_size);
cudaMalloc(&v_d, d_size);
cudaMemcpy(u_d, u, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(v_d, v, d_size, cudaMemcpyHostToDevice);
my_kernel<<<grids, blocks>>> (v_d,u_d);
cudaDeviceSynchronize();
cudaMemcpy(v, v_d, d_size, cudaMemcpyDeviceToHost);
cudaMemcpy(u, u_d, d_size, cudaMemcpyDeviceToHost);
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
What I expect from the second version is :
u: 0.2815 v: 1.7581
Your two-dimensional array - in the first version of the program - is implemented using an array of pointers, each of which to a separately-allocated array of double values.
In your second version, you are using the same pointer-to-pointer-to-double type, but - you're not allocating any space for the actual data, just for the array of pointers (and not copying any of the data to the GPU - just the pointers; which are useless to copy anyway, since they're pointers to host-side memory.)
What is most likely happening is that your kernel attempts to access memory at an invalid address, and its execution is aborted.
If you were to properly check for errors, as #njuffa noted, you would know that is what happened.
Now, you could avoid having to make multiple memory allocations if you were to use a single data area instead of separate allocations for each second-dimension 1D array; and that is true both for the first and the second version of your program. That would not quite be array flattening. See an explanation of how to do this (C-language-style) on this page.
Note, however, that double-dereferencing, which you insist on performing in your kernel, is likely slowing it down significantly.

Unexpected behaviour in Rcpp

Please note that this error was taken from a bigger context, which I cannot obviously report here entirely.
I have the following functions in the file fun.cpp
#include <RcppArmadilloExtensions/sample.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
arma::vec colMeans(arma::mat data){
int n_0 = data.n_rows;
arma::vec xbar(data.n_cols);
for(int i = 0; i < data.n_rows; i++){
for(int j = 0; j < data.n_cols; j++){
xbar[j] += data(i,j) /n_0;
}
}
return xbar;
}
// [[Rcpp::export]]
List PosteriorNIW(arma::mat data, arma::vec mu0, double lambda0,
double df0, arma::mat V){
// Compute posterior
int n = data.n_rows;
arma::vec xbar = colMeans(data);
double lambdan = lambda0 + n;
arma::vec mun = (lambda0 * mu0 + n * xbar) / lambdan;
arma::mat S;
S.zeros(data.n_cols, data.n_cols);
for(int i = 0; i < n; i++){
S += (arma::conv_to<arma::vec>::from(data.row(i)) - xbar) * arma::trans(arma::conv_to<arma::vec>::from(data.row(i)) - xbar);
}
arma::mat Vn = V + S + ((lambda0*n)/(lambda0 + n)) * (xbar - mu0) * arma::trans(xbar - mu0);
return List::create(_["mun"] = mun,
_["Vn"] = Vn,
_["lambdan"] = lambdan);
}
Calling now:
library(Rcpp); library(RcppArmadillo)
mu0 <- c(3,3)
V0 <- matrix(c(2.5,0.0,0.0,2.5), nrow = 2)
sourceCpp("fun.cpp")
data <- cbind(rep(5,15),rep(0,15))
PosteriorNIW(data, mu0, 1, 1, V0)
gives the expected result.
$mun
[,1]
[1,] 4.8750
[2,] 0.1875
$Vn
[,1] [,2]
[1,] 6.250 -5.6250
[2,] -5.625 10.9375
$lambdan
[1] 16
Now if I add to the file fun.cpp the following functions (again, these are taken from a bigger context so don't bother trying to understand but just paste them) strange things happens:
// [[Rcpp::export]]
NumericMatrix myFun(arma::mat t_dish, arma::cube data){
int l = 0;
for(int j = 0; j < data.n_rows; j++){
l++;
}
NumericMatrix Dk(l, 2);
return Dk;
}
// [[Rcpp::export]]
int myFun2(arma::cube n_cust){
arma::mat temp = n_cust.subcube(arma::span(0), arma::span(), arma::span());
int i;
for(i = 0; i < n_cust.n_cols; i++){
arma::rowvec temp2 = temp.row(i);
}
return i + 1;
}
// [[Rcpp::export]]
arma::vec myFun3(arma::mat k_tables){
arma::vec temp(k_tables.n_cols * k_tables.n_rows);
int l = 0;
if(!R_IsNA(k_tables(0,0))){
l++;
}
arma::vec temp2(l);
arma::vec tmp3 = sort(temp2);
return tmp3;
}
double myFun4(arma::vec x, double nu, arma::vec mu, arma::mat Sigma){
arma::vec product = (arma::trans(x - mu) * arma::inv(Sigma) * (x - mu));
double num = pow(1 + (1 / nu) * product[0], - ( nu + 2 ) / 2);
double den = pow(sqrt(M_PI * nu),2) * sqrt(arma::det(Sigma));
return num / den;
}
bool myFun5(NumericVector X, double z) {
return std::find(X.begin(), X.end(), z)!=X.end();
}
calling PosteriorNIW(data, mu0, 1, 1, V0) repeatedly starts giving different results every time. Note that there is no randomness in the functions and that obviously those functions have got no impact as they are not called in the original function.
I have tried on a different machine to make sure it was not a problem of my compiler but the error keeps happening.
I know that removing those function (even just one of them) fixes the problem but clearly this is not a feasible solution when I am working with more functions.
I would like to know if other users are able to replicate this behavior and if yes if there is a fix for it.
Thank you in advance
EDIT:
The version of R is 3.3.2 and Rtools is 3.4. Both Rcpp and RcppArmadillo are up-to-date
You're not zeroing xbar in your colMeans function. If I do do that:
arma::vec colMeans(arma::mat data){
int n_0 = data.n_rows;
arma::vec xbar;
xbar.zeros(data.n_cols);
for(int i = 0; i < data.n_rows; i++){
for(int j = 0; j < data.n_cols; j++){
xbar[j] += data(i,j) /n_0;
}
}
return xbar;
}
I get this everytime:
> PosteriorNIW(data, mu0, 1, 1.1, V0)
$mun
[,1]
[1,] 4.8750
[2,] 0.1875
$Vn
[,1] [,2]
[1,] 6.250 -5.6250
[2,] -5.625 10.9375
$lambdan
[1] 16
Even when I do add your extra block of code.
I don't know if these vectors are documented to be initialised to zero by their constructor (in which case this might be a bug there) or not, in which case its your bug!

RcppArmadillo on several cpu cores

I have the following RccpArmadillo function that runs fine if I execute it on one cpu core. But if I use several cores, then R will crash. All the other Rcpp functions I created so far run fine on several cores (with foreach), only RccpArmadillo seems to be problematic. Any ideas how to fix that?
cppFunction('double augmentedDickeyFullerCpp(NumericVector a, NumericVector b, double gamma, double mu, int lags) {
if (gamma < 0) {
return 0;
}
int n = a.size()-1;
int lags2 = lags + 1;
// first rows, then columns
NumericMatrix x(n-lags2,lags2);
NumericMatrix zdifflag(n-lags2+1,lags2);
NumericVector diff(n);
NumericVector zdiff(n-lags2+1);
NumericVector residuals(n+1);
residuals[0] = a[0] - gamma * b[0] - mu;
// residuals a is y and b is x
for(int i = 1; i < n+1; i++) {
residuals[i] = a[i] - gamma * b[i] - mu;
diff[i-1] = residuals[i] - residuals[i-1];
}
for(int i = 0; i < n-lags2+1; i++) {
zdifflag[0,i] = residuals[i+lags2-1];
}
for(int j = 0; j < n-lags2+1; j++) {
for(int i = 0; i < lags2; i++) {
x(j,i) = diff[j+lags2-1-i];
if (i > 0) {
zdifflag(j,i) = x(j,i);
}
}
zdiff[j] = x(j,0);
}
int length = zdifflag.nrow(), k = zdifflag.ncol();
arma::mat X(zdifflag.begin(), length, k, false); // reuses memory and avoids extra copy
arma::colvec y(zdiff.begin(), zdiff.size(), false);
arma::colvec coef = arma::solve(X, y); // fit model y ~ X
arma::colvec res = y - X*coef; // residuals
// std.errors of coefficients
//arma::colvec res = y - X*coef[0];
// sqrt(sum(residuals^2)/(length - k))
double s2 = std::inner_product(res.begin(), res.end(), res.begin(), 0.0)/(length - k);
arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
return coef[0]/std_err[0];
}',depends = "RcppArmadillo", includes="#include <RcppArmadillo.h>")
I generally recommend putting the code into a small package, and having each parallel worker load the package. That is known to work, both in serial and parallel, whereas relying on cppFunction() for an ad-hoc function may be too fragile for parallel execution.

what "C2F(ddot)" the meaning in scilab

i download the scilab sourcecode, i am interested how the conv2 works and want translate it to c# code, but i don't know what the meaning of "C2F(ddot)" and how it works. if i tranfer the "C2F(ddot)" into c or c# code how i should implement it. here are some piece of source code in scilab
extern double C2F(ddot)(int *n, double *A, int *iA, double *B, int *iB);
/*--------------------------------------------------------------------------*/
void conv2_separable_R(double *R, int nR, double *C, int mC, double *A, int mA, int nA, double *Out, int mOut, int nOut, int edgM, int edgN, double *T)
{
int ai = 0, tj = 0, ci = 0, rj = 0; /*current index over A,T,C and R */
int i = 0, j = 0; /* loop variables*/
int l = 0;
int one = 1, minusone = -1;
for (i = 0; i < mOut; i++ )
{
/*Compute the 1-D conv A(i,:) and C in T */
ai = Max(0, i - edgM) ;
ci = mC - 1 - Max(0, edgM - i);
l = Min(ci + 1, mA - ai);
for (j = 0; j < nA; j++ )
{
T[j] = C2F(ddot)(&l, A + ai + mA * j, &one, C + ci - l + 1, &minusone);
}
/*1-D convolution of T and R */
for (j = 0; j < nOut; j++ )
{
rj = nR - 1 - Max(0, edgN - j);
tj = Max(0, j - edgN) ;
l = Min(rj + 1, nA - tj);
Out[i + j * mOut] = C2F(ddot)(&l, T + tj, &one, R + rj - l + 1, &minusone);
}
}
}
if i want tranform the code:" T[j] = C2F(ddot)(&l, A + ai + mA * j, &one, C + ci - l + 1, &minusone);" into c or c# ,how i should do ?
In scilab's sources, C2F is a macro used to call Fortran function from C.
It's declared in machine.h.
Actually this code probably make a call to the Blass ddot function.
That it, the dot product of two vector is done in Fortran...

opencl kernel implementing a simple mathematical formula

What are the best practices to consider when implementing an error function defined as
using an OpenCL kernel?
A, B and C are 3D float arrays and \delta is the Kronecker delta.
Typical values for (N, M) = (2, 7) or (N, M) = (3, 23).
The naive implementation (given below) is by several orders of magnitude slower than the CPU version.
Thanks,
T.
__kernel void cl_bilinear_alg(
__global float * A,
__global float * B,
__global float * C,
__global const int M,
__global const int N,
__global float * R)
{
int index = get_global_id(0);
int N2 = N * N;
int mat_offset = index * N2 * M;
float s1, s2, err = 0.0f;
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
for (int k = 0; k < N; ++k)
{
for (int l = 0; l < N; ++l)
{
for (int m = 0; m < N; ++m)
{
for (int n = 0; n < N; ++n)
{
s1 = (n == i) * (j == k) * (l == m);
s2 = 0;
for (int r = 0; r < M; ++r)
{
s2 += A[mat_offset + r * N2 + i * N + j] *
B[mat_offset + r * N2 + k * N + l] *
C[mat_offset + r * N2 + m * N + n];
}
err += (s2 - s1) * (s2 - s1);
}
}
}
}
}
}
R[index] = err;
}
UPDATE
The primary target is a Geforce GTX 570, though this could change in the future.
UPDATE2
After vectorizing the code, moving bits to local memory, unrolling some loops and passing precomputed Kronecker products explicitly to the kernel the code looks as follows:
__kernel void cl_bilinear_alg(__global const float * A,
__global const float * B,
__global const float * C,
__global const int N,
__global const int M,
__global const float * kron,
__global float * R)
{
__private int index = get_global_id(0);
__private int cM = ceil(M / 4.0f);
__private int N2 = N*N;
__private int N4 = N2*N2;
__private int mat_offset = index * N2 * M;
__private float s1, s2, err = 0;
__private float4 vzero = (float4) (0.0f, 0.0f, 0.0f, 0.0f);
__local float4 va[54], vb[54], vc[54];
for (int ij = 0, k = 0; ij < N2; ++ij)
{
int r = 0;
for (; r < M / 4; r += 4, ++k)
{
int idx0 = mat_offset + N2 * r + ij;
int idx1 = mat_offset + N2 * (r + 1) + ij;
int idx2 = mat_offset + N2 * (r + 2) + ij;
int idx3 = mat_offset + N2 * (r + 3) + ij;
va[k] = (float4) (A[idx0], A[idx1], A[idx2], A[idx3]);
vb[k] = (float4) (B[idx0], B[idx1], B[idx2], B[idx3]);
vc[k] = (float4) (C[idx0], C[idx1], C[idx2], C[idx3]);
}
if (M % 4)
{
float buffa[4] = {0}, buffb[4] = {0}, buffc[4] = {0};
for (; r < M; ++r)
{
int idx = mat_offset + N2 * r + ij;
buffa[r % 4] = A[idx];
buffb[r % 4] = B[idx];
buffc[r % 4] = C[idx];
}
va[k] = vload4(0, buffa);
vb[k] = vload4(0, buffb);
vc[k++] = vload4(0, buffc);
}
}
for (int ij = 0; ij < N2; ++ij)
{
for (int kl = 0; kl < N2; ++kl)
{
for (int mn = 0; mn < N2; ++mn)
{
s1 = kron[ij * N4 + kl * N2 + mn];
s2 = 0;
for (int r = 0; r < cM; ++r)
s2 += dot(va[cM * ij + r], mad(vb[cM * kl + r], vc[cM * mn + r], vzero));
//the most expensive line
err += (s2 - s1) * (s2 - s1);
}
}
}
R[index] = err;
}
By applying these changes a 4x speed increase was observed compared to the naive implementation. Furthermore, it was revealed that the most expensive line of all is the error update, i.e.
err += (s2 - s1) * (s2 - s1);
Any suggestions?
Typically you'd want to break some of those loops up... a lot...
- the outer loops become split over multiple workgroups, which run on their own compute unit (there are around 16 compute units per GPU, not many)
- the next few loops would be split over different threads within each workgroup
If you try to run all the calculations all at the same time, they will all try to load the data into memory at the same time, and this will simply thrash horribly. GPUs have very limited memory. Sure, the global memory sounds large enough, several gigabytes, but the global GPU memory is slow. You want to get the data into the local memory, which is per compute unit, and is of the order of 32-64KB, not much more than that.
You'd typically want to somehow divide your task into very small tasks, and do the following, for each workgroup:
load a chunk of memory from global memory into local memory
the whole workgroup warp of threads can participate in doing the copy, using coallesced access
do work on this memory, like doing some sums, and so on
write the results back to global memory
then, can either iterate a bit, or simply exit, and leave other workgroups to handle other bits of the work
On the CPU, the mathematical operations tend to be a major bottleneck, but on the GPU, generally the cores are mostly spinning uselessly, whilst waiting for data to gradually get to them, from global memory. Whatever you can do to optimize this process, prevent conflicting demands, and so on, will make the kernel significantly faster.

Resources