Unanswered Undefined Symbols for architecture x86_64 - xcode4

I'm getting errors, all : "Undefined Symbols for architecture x86_64".
the simplest class where this error happens is class named Peaking that computes equalizations.
here is my class definition peaking
#ifndef _PEAKING_
#define _PEAKING_
#include "IPlugBase.h"
#include "IGraphics.h"
class Peaking
{
public:
Peaking()
{
a = new double[3];
b = new double[3];
previousx1=0;
previouspreviousx1=0;
}
~Peaking() {}
double* a;
double* b;
void ComputeParams(double G, double fc, double Q, double fs);
double ComputeAudio(double x);
double previouspreviousx1;
double previousx1;
private:
};
#endif
and here is the implementation
#include <math.h>
#include "defines.hpp"
#include "Peaking.h"
void Peaking::ComputeParams(double G, double fc, double Q, double fs)
{
double K = tan((M_PI * fc)/fs);
double V0 = pow(10,(G/20.));
//Invert gain if a cut
if (V0 < 1)
V0 = 1/V0;
double a1, a2, b0, b1, b2;
////////////////////////////
// BOOST
////////////////////////////
double K2 = pow(K,2.);
double divQ = 1/Q;
double z1 = (divQ*K);
double divz2 = 1/(1 + z1 + K2);
double z3 = ((V0*divQ)*K);
double divz4 = 1/(1 + z3 + K2);
if( G > 0 )
{
b0 = (1 + z3 + K2) * divz2;
b1 = (2 * (K2 - 1)) * divz2;
b2 = (1 - z3 + K2) * divz2;
a1 = b1;
a2 = (1 - z1 + K2) * divz2;
}
////////////////////////////
// CUT
////////////////////////////
else
{
b0 = (1 + z1 + K2) * divz4;
b1 = (2 * (K2 - 1)) * divz4;
b2 = (1 - z1 + K2) * divz4;
a1 = b1;
a2 = (1 - z3 + K2) * divz4;
}
a[0] = 1;
a[1] = a1;
a[2] = a2;
b[0] = b0;
b[1] = b1;
b[2] = b2;
}
double Peaking::ComputeAudio(double x)
{
// E:\Programmation\#Audio\#EQ\dfilt_df2.gif
double srg = a[2];
double x1 = x - a[1] * previousx1 - a[2] * previouspreviousx1;
double y = x1 * b[0] + previousx1 * b[1] + previouspreviousx1 * b[2];
previouspreviousx1 = previousx1;
previousx1 = x1;
return y;
}
here is the error list:
Undefined symbols for architecture x86_64:
"OnOff::OnOff(IPlugBase*, int, int, int, IBitmap*, IBitmap*, IChannelBlend::EBlendMethod)", referenced from:
AutoEQ::AutoEQ(IPlugInstanceInfo) in AutoEQ.o
"Peaking::ComputeAudio(double)", referenced from:
AutoEQ::traiterLearning(int) in AutoEQ.o
AutoEQ::ProcessDoubleReplacing(double**, double**, int) in AutoEQ.o
"Peaking::ComputeParams(double, double, double, double)", referenced from:
AutoEQ::traiterLearning(int) in AutoEQ.o
"FFTOoura::rdft(int, int, double*, int*, double*)", referenced from:
AutoEQ::traiterLearning(int) in AutoEQ.o
"vtable for ProgressBar", referenced from:
ProgressBar::ProgressBar(IPlugBase*, int, int, int, IBitmap, IBitmap, IChannelBlend::EBlendMethod) in AutoEQ.o
NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
"vtable for EQClass", referenced from:
EQClass::EQClass(IPlugBase*, int) in AutoEQ.o
NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
"vtable for VUmetre", referenced from:
VUmetre::VUmetre(IPlugBase*, int, int, int, IBitmap, IBitmap, IChannelBlend::EBlendMethod) in AutoEQ.o
NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
"vtable for MyButton", referenced from:
MyButton::MyButton(IPlugBase*, int, int, int, IBitmap*, IChannelBlend::EBlendMethod) in AutoEQ.o
NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
if I put the core of the function "double Peaking::ComputeAudio(double x)" in the class, one error disappears.
If i put the body of the virtual function "draw" of Vumetre, another error goes away. So i cannot put some main bodies functions in the .cpp file, how come ?
thanks for help
Jeff

okay i found out : some .cpp files were not present in the "Build Phase/Compile source" panel
now it's working!
i'm relatively new to developping on OS X thus this noob error.
Jeff

Related

Explicit FDM with CUDA [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am working to implement CUDA for the following code. The first version has been written serially and the second version is written with CUDA. I am sure about its results in serial version. I expect that the second version that I have added CUDA functionality also give me the same result, but it seems that kernel function does not do any thing and it gives me the initial value of u and v. I know due to lack of my experience, the bug may be obvious, but I cannot figure it out. Also, please do not recommend using flatten array, because it is harder for me to understand the indexing in code.
First version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
#include <chrono>
#include <omp.h>
using namespace std;
const int M = 1024;
const int N = 1024;
const double A = 1;
const double B = 3;
const double Du = 5 * pow(10, -5);
const double Dv = 5 * pow(10, -6);
const int Max_Itr = 1000;
const double h = 1.0 / static_cast<double>(M - 1);
const double delta_t = 0.0025;
const double s1 = (Du * delta_t) / pow(h, 2);
const double s2 = (Dv * delta_t) / pow(h, 2);
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
for (int k = 1; k < Max_Itr; k++) {
for (int i = 1; i < N - 1; i++) {
for (int j = 1; j < M - 1; j++) {
u[i][j] = ((1 - (4 * s1)) * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2)) * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
}
}
}
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
Second version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define M 1024
#define N 1024
__global__ void my_kernel(double** v, double** u){
int i= blockIdx.y * blockDim.y + threadIdx.y;
int j= blockIdx.x * blockDim.x + threadIdx.x;
double A = 1;
double B = 3;
int Max_Itr = 1000;
double delta_t = 0.0025;
double Du = 5 * powf(10, -5);
double Dv = 5 * powf(10, -6);
double h = 1.0 / (M - 1);
double s1 = (Du * delta_t) / powf(h, 2);
double s2 = (Dv * delta_t) / powf(h, 2);
for (int k = 1; k < Max_Itr; k++) {
u[i][j] = ((1 - (4 * s1))
* u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2))
* v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
__syncthreads();
}
}
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
dim3 blocks(32,32);
dim3 grids(M/32 +1, N/32 + 1);
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
double **u_d, **v_d;
int d_size = N * M * sizeof(double);
cudaMalloc(&u_d, d_size);
cudaMalloc(&v_d, d_size);
cudaMemcpy(u_d, u, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(v_d, v, d_size, cudaMemcpyHostToDevice);
my_kernel<<<grids, blocks>>> (v_d,u_d);
cudaDeviceSynchronize();
cudaMemcpy(v, v_d, d_size, cudaMemcpyDeviceToHost);
cudaMemcpy(u, u_d, d_size, cudaMemcpyDeviceToHost);
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
What I expect from the second version is :
u: 0.2815 v: 1.7581
Your two-dimensional array - in the first version of the program - is implemented using an array of pointers, each of which to a separately-allocated array of double values.
In your second version, you are using the same pointer-to-pointer-to-double type, but - you're not allocating any space for the actual data, just for the array of pointers (and not copying any of the data to the GPU - just the pointers; which are useless to copy anyway, since they're pointers to host-side memory.)
What is most likely happening is that your kernel attempts to access memory at an invalid address, and its execution is aborted.
If you were to properly check for errors, as #njuffa noted, you would know that is what happened.
Now, you could avoid having to make multiple memory allocations if you were to use a single data area instead of separate allocations for each second-dimension 1D array; and that is true both for the first and the second version of your program. That would not quite be array flattening. See an explanation of how to do this (C-language-style) on this page.
Note, however, that double-dereferencing, which you insist on performing in your kernel, is likely slowing it down significantly.

Passing SEXP objects between C functions using PROTECT and UNPROTECT correctly

I have an issue with the Valgrind check on CRAN for a package. While I cannot reproduce the issue, I suspect I know what the issue is. A simplified example is this C++ file:
#include <Rcpp.h>
using namespace Rcpp;
// what I currently do
// call a R function with 1 argument
SEXP do_work1(SEXP x, SEXP fn, SEXP env){
SEXP R_fcall, out;
PROTECT(R_fcall = Rf_lang2(fn, x));
PROTECT(out = Rf_eval(R_fcall, env));
UNPROTECT(2);
return out;
}
bool not_ok(SEXP x, R_len_t const ex_len){
return !Rf_isReal(x) or !Rf_isVector(x) or Rf_xlength(x) != ex_len or
Rf_isNull(x);
}
// [[Rcpp::export(rng = false)]]
double use_do_worka(SEXP x, SEXP fn, SEXP env){
SEXP res = do_work1(x, fn, env);
CharacterVector what("y");
SEXP y = Rf_getAttrib(res, what);
if(not_ok(res, 2) or not_ok(y, 1))
throw std::invalid_argument("not ok");
double out = *REAL(res) + REAL(res)[1] + *REAL(y);
return out;
}
// what I could do instead?
// [[Rcpp::export(rng = false)]]
double use_do_workb(SEXP x, SEXP fn, SEXP env){
SEXP res = PROTECT(do_work1(x, fn, env)); // added PROTECT
CharacterVector what("y");
SEXP y = PROTECT(Rf_getAttrib(res, what)); // added PROTECT
if(not_ok(res, 2) or not_ok(y, 1)){
UNPROTECT(2); // added UNPROTECT
throw std::invalid_argument("not ok");
}
double out = *REAL(res) + REAL(res)[1] + *REAL(y);
UNPROTECT(2); // added UNPROTECT
return out;
}
// or maybe?
SEXP do_work2(SEXP x, SEXP fn, SEXP env){
SEXP R_fcall, out;
PROTECT(R_fcall = Rf_lang2(fn, x));
PROTECT(out = Rf_eval(R_fcall, env));
// removed UNPROTECT
return out;
}
// [[Rcpp::export(rng = false)]]
double use_do_workc(SEXP x, SEXP fn, SEXP env){
SEXP res = do_work2(x, fn, env);
CharacterVector what("y");
SEXP y = PROTECT(Rf_getAttrib(res, what)); // added PROTECT
if(not_ok(res, 2) or not_ok(y, 1)){
UNPROTECT(3); // added UNPROTECT
throw std::invalid_argument("not ok");
}
double out = *REAL(res) + REAL(res)[1] + *REAL(y);
UNPROTECT(3); // added UNPROTECT
return out;
}
/*** R
f <- function(x) {
x1 <- x[1]
x2 <- x[2]
out <- c(-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1),
200 * (x2 - x1 * x1))
attr(out, "y") <- 100 * (x2 - x1 * x1)^2 + (1 - x1)^2
out
}
for(i in 1:1000000){
use_do_worka(1, f, .GlobalEnv)
use_do_workb(1, f, .GlobalEnv)
use_do_workc(1, f, .GlobalEnv)
}
*/
which one can compile and run the examples using Rcpp::sourceCpp. do_work1 and use_do_worka is very similar to what I currently do. However, I suspect that res and y may be garbage collected in use_do_worka. Is this true?
If so then use_do_workb PROTECT both of them again. Is this a good and correct way to pass SEXP objects between C functions? That is, re-protect the returned object from another C function?
Lastly, do_work2 and use_do_workc is very similar use_do_workb but save one PROTECT call. However, it adds the burden that one has to remember to UNPROTECT which I gather can easily lead to a bug. Is this version still valid?
Remarks
I have tried to run
R -d valgrind -e "Rcpp::sourceCpp('[name-of-file-w-above-code].cpp')"
but this does not cause any issues. I have also tried to R CMD check --use-valgrind the package but I cannot reproduce the issue on CRAN.
The actually error I get from CRAN is:
==4144090== Invalid read of size 8
==4144090== at 0x483EDED: memcpy#GLIBC_2.2.5 (/builddir/build/BUILD/valgrind-3.16.1/memcheck/../shared/vg_replace_strmem.c:1032)
==4144090== by 0x1736C2AC: copy (packages/tests-vg/psqn/src/../inst/include/lp.h:12)
==4144090== by 0x1736C2AC: r_worker_bfgs::grad(double const*, double*) (packages/tests-vg/psqn/src/r-api.cpp:353)
==4144090== by 0x17367754: PSQN::optim_info PSQN::bfgs<PSQN::R_reporter, PSQN::R_interrupter>(PSQN::problem&, double*, double, unsigned long, double, double, int)::{lambda(double, double*, double*, double*, double&)#4}::operator()(double, double*, double*, double*, double&) const (packages/tests-vg/psqn/src/../inst/include/psqn-bfgs.h:148)
with
packages/tests-vg/psqn/src/../inst/include/lp.h:12.
packages/tests-vg/psqn/src/r-api.cpp:353.
where I know that the vector has the right length which makes me suspect that it has been garbage collected because of the invalid read of size 8 error. The vector is created with very similar code as in my example above.

Two forcings in compiled code - R package deSolve

I am using package deSolve to run some models that include an external forcing. To gain speed, I produced compiled code following the vignette of the package (see 6.2 in https://cran.r-project.org/web/packages/deSolve/vignettes/compiledCode.pdf). My problem is that now I want to introduce two external forcings in the compiled code. Does anyone have a working example/knows how to do it?
#include <R.h>
static double parms[6];
static double forc[1];
/* A trick to keep up with the parameters and forcings */
#define b parms[0]
#define c parms[1]
#define d parms[2]
#define e parms[3]
#define f parms[4]
#define g parms[5]
#define import forc[0]
/* initializers: */
void odec(void (* odeparms)(int *, double *))
{
int N=6;
odeparms(&N, parms);
}
void forcc(void (* odeforcs)(int *, double *))
{
int N=1;
odeforcs(&N, forc);
}
/* derivative function */
void derivsc(int *neq, double *t, double *y, double *ydot,double *yout, int*ip)
{
if (ip[0] <2) error("nout should be at least 2");
ydot[0] = import - b*y[0]*y[1] + g*y[2];
ydot[1] = c*y[0]*y[1] - d*y[2]*y[1];
ydot[2] = e*y[1]*y[2] - f*y[2];
yout[0] = y[0] + y[1] + y[2];
yout[1] = import;
}
Thanks
Oh, it was so easy to solve. I just defined a second forcing and actualize the counters. I modified the previous example here:
static double parms[6];
static double forc[2];
/* A trick to keep up with the parameters and forcings */
#define b parms[0]
#define c parms[1]
#define d parms[2]
#define e parms[3]
#define f parms[4]
#define g parms[5]
#define import forc[0]
#define import2 forc[1]
/* initializers: */
void odec(void (* odeparms)(int *, double *))
{
int N=6;
odeparms(&N, parms);
}
void forcc(void (* odeforcs)(int *, double *))
{
int N=2;
odeforcs(&N, forc);
}
/* derivative function */
void derivsc(int *neq, double *t, double *y, double *ydot,double *yout, int*ip)
{
if (ip[0] <2) error("nout should be at least 2");
ydot[0] = import2 - b*y[0]*y[1] + g*y[2];
ydot[1] = c*y[0]*y[1] - d*y[2]*y[1];
ydot[2] = e*y[1]*y[2] - f*y[2];
yout[0] = y[0] + y[1] + y[2];
yout[1] = import;
}

Exposed Rcpp defined C++ function crashes R session, code snippets working before. Function exposition wrong?

I have C++ code executed in Rcpp where I define a few functions that are then called in an exposed function using the // [[Rcpp::export]] tag. The code compiles fine but executing the exposed function returns in a fatal crash of my R session leading to immediate termination.
What mystifies me is that the code executed fine yesterday when I ran it up to the line VectorXd z = y_luet - kroneckerProduct(X_luet.transpose(), MatrixXd::Identity(p, p)) * r; and returning the vector z. Now, neither that nor the full code as displayed below work.
I have also done my homework of testing all functions individually, checking they are correct by exposing them to R before using the same technique and checking them against their slower R counterparts, obtaining numerically identical results (at greater speed).
I am wondering whether I am just using the 'define a few functions and then use them in a bigger function' approach is not appropriate as soon as the tasks become a little bigger?
The data themselves are moderate by Eigen's standards, dat is a matrix with 200 rows and 2 columns, everything else is low-dimensional, with the maximum of (row, column) not exceeding 12, i.e. the second-largest matrix is 12 by 1.
I am using Rtools and Rcpp all of the most recent vintage.
The code implements a simple iterated generalised least squares estimator, as is common in Statistics/Econometrics.
Edit
Here is some sample data in R format that should get a minimal working example:
params <- .963
G <- matrix(c(1,0),nrow = 2)
G_perp <- matrix(c(0,1),nrow = 2)
mat_Lambda_lu <- matrix(0.95,nrow=1)
dat <- matrix(c(0,0,0,-0.79642284,-1.36694331,-1.18267593,-1.48827199,0.12549353,3.03343410,7.36256542,0,0,0,0.11282054,0.24798861,0.32448004,-0.27283699,-1.2462477,-0.0104694,3.21067339), nrow = 10, ncol = 2)
k<-3
no_ur <- 1
maxiter <- 100 #or something small for conserving memory
mini <- TRUE
The above should be executed in an R environment for it work. Please let me know if it doesn't work or if there are issues.
Here is the code:
// [[Rcpp::depends(RcppEigen)]]
#include <Rcpp.h>
#include <RcppEigen.h>
#include <cmath>
#include <cstdlib>
#include <Eigen/Dense>
#include <unsupported/Eigen/src/MatrixFunctions/MatrixPower.h>
#include <unsupported/Eigen/src/KroneckerProduct/KroneckerTensorProduct.h>
using namespace Rcpp;
using namespace Eigen;
//using Eigen::Map; // 'maps' rather than copies
//using Eigen::MatrixXd; // variable size matrix, double precision
//using Eigen::VectorXd; // variable size vector, double precision
MatrixXd makeXluet(MatrixXd dat, int k, int p, int T) {
MatrixXd X_luet(p*k, T - k);
for (int i = k; i > 0; i--)
{
X_luet.block((k-i)*p,0,p,T-k) = dat.block(i - 1, 0, T - k, p).transpose();
}
return X_luet;
}
MatrixXd makeRLuTilde(MatrixXd Rlu, MatrixXd LambdaLu, int k, int p, int q) {
MatrixXd RLuTilde(p*k, q);
MatrixPower<MatrixXd> Apow(LambdaLu);
for (int i = k; i > 0; i--) {
RLuTilde.block((k - i)*p, 0, p, q) = Rlu * Apow(i-1);
}
return RLuTilde;
}
VectorXd GLSEstimateFast(MatrixXd Xluet, MatrixXd Sigma_u, MatrixXd R, VectorXd z, int T, int k) {
return (R.transpose() * kroneckerProduct(Xluet * Xluet.transpose(), Sigma_u.inverse()) * R).inverse() * R.transpose() * kroneckerProduct(Xluet, Sigma_u.inverse()) * z;
}
MatrixXd ResMaker(MatrixXd Xluet, MatrixXd Yluet, VectorXd beta, const int k, const int p) {
Map<MatrixXd> A(beta.data(), p, k*p);
return Yluet - A * Xluet;
}
double GLSCriterion(MatrixXd res, MatrixXd Sigma_u, const int k, const int p, const int T) {
MatrixXd Lp = Sigma_u.inverse().llt().matrixL().transpose();
MatrixXd v = Lp * res;
Map<VectorXd> v2(v.data(), v.size());
return (1 / static_cast<double>(T)) * v2.transpose() * v2;
}
MatrixXd CovEstFast(MatrixXd res, const int T) {
return (1 / static_cast<double>(T)) * res * res.transpose();
}
double likeli_h(MatrixXd CovEstHat, const int T) {
return (-0.5)*static_cast<double>(T) * log(CovEstHat.determinant());
}
// [[Rcpp::export]]
double restricted_iterated_ml_cpp(Map<VectorXd> params, Map<MatrixXd> G, Map<MatrixXd> G_perp, Map<MatrixXd> mat_Lambda_lu, Map<MatrixXd> dat, const int k, const int no_ur, const int maxiter, bool mini) {
const int p = dat.cols();
const int T = dat.rows();
int p2 = static_cast<int>(pow(p, 2));
int iter = 0;
MatrixXd X_luet = makeXluet(dat, k, p, T);
MatrixXd Y_luet = dat.bottomRows(T-k).transpose();
Map<MatrixXd> D(params.data(), p - no_ur, no_ur);
MatrixXd R_lu = G + G_perp * D;
MatrixXd R_lu_tilde = makeRLuTilde(R_lu, mat_Lambda_lu, k, p, no_ur);
MatrixXd C = kroneckerProduct(R_lu_tilde.transpose(), MatrixXd::Identity(T - k, T - k));
MatrixXd C1 = C.topLeftCorner(no_ur*p, no_ur*p);
MatrixXd C2 = C.block(0, no_ur*p, no_ur*p, C.cols() - (no_ur*p));
MatrixPower<MatrixXd> Llupow(mat_Lambda_lu);
MatrixXd mat_cee = R_lu * Llupow(k);
Map<VectorXd> cee(mat_cee.data(), mat_cee.size());
MatrixXd R(no_ur*p + k * p2 - (no_ur * p), k*p2 - (no_ur * p));
R << static_cast<double>(-1) * C1.inverse()*C2,
MatrixXd::Identity(k*p2-no_ur*p, k*p2 - (no_ur * p));
VectorXd r(k * p2);
r << C1.inverse() * cee,
MatrixXd::Zero(k * p2 - (no_ur * p), 1);
Map<VectorXd> y_luet(Y_luet.data(), Y_luet.size());
VectorXd z = y_luet - kroneckerProduct(X_luet.transpose(), MatrixXd::Identity(p, p)) * r;
MatrixXd Sigma_u = MatrixXd::Identity(p, p);
VectorXd gamma = GLSEstimateFast(X_luet, Sigma_u, R, z, T, k);
VectorXd beta = R * gamma + r;
MatrixXd res = ResMaker(X_luet, Y_luet, beta, k, p);
double crit_old = GLSCriterion(res, Sigma_u, k, p, T);
double crit_new = crit_old;
do
{
crit_old = crit_new;
Sigma_u = CovEstFast(res, T);
gamma = GLSEstimateFast(X_luet, Sigma_u, R, z, T, k);
beta = R * gamma + r;
res = ResMaker(X_luet, Y_luet, beta, k, p);
crit_new = GLSCriterion(res, Sigma_u, k, p, T);
iter++;
} while ((iter<maxiter) && (crit_old-crit_new>0.001));
double ll = likeli_h(Sigma_u, T);
if (mini) {
ll = static_cast<double>(-1)*ll;
}
return ll;
}

Calculate the middle of two unit length 3D vectors without a square root?

With two 3D, unit length vectors, is there a way to calculate a unit length vector in-between these, without re-normalizing? (more specifically without a square root).
Currently I'm just adding them both and normalizing, but for efficiency I thought there might be some better way.
(for the purpose of this question, ignore the case when both vectors are directly opposite)
It's not an answer to the original question; I am rather trying to resolve the issues between two answers and it wouldn't fit into a comment.
The trigonometric approach is 4x slower than your original version with the square-root function on my machine (Linux, Intel Core i5). Your mileage will vary.
The asm (""); is always a bad smell with his siblings volatile and (void) x.
Running a tight loop many-many times is a very unreliable way of benchmarking.
What to do instead?
Analyze the generated assembly code to see what the compiler actually did to your source code.
Use a profiler. I can recommend perf or Intel VTune.
If you look at the assembly code of your micro-benchmark, you will see that the compiler is very smart and figured out that v1 and v2 are not changing and eliminated as much work as it could at compile time. At runtime, no calls were made to sqrtf or to acosf and cosf. That explains why you did not see any difference between the two approaches.
Here is an edited version of your benchmark. I scrambled it a bit and guarded against division by zero with 1.0e-6f. (It doesn't change the conclusions.)
#include <stdio.h>
#include <math.h>
#ifdef USE_NORMALIZE
#warning "Using normalize"
void mid_v3_v3v3_slerp(float res[3], const float v1[3], const float v2[3])
{
float m;
float v[3] = { (v1[0] + v2[0]), (v1[1] + v2[1]), (v1[2] + v2[2]) };
m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2] + 1.0e-6f);
v[0] *= m;
v[1] *= m;
v[2] *= m;
res[0] = v[0];
res[1] = v[1];
res[2] = v[2];
}
#else
#warning "Not using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
const float theta = acosf(dot_product);
const float n = 1.0f / (2.0f * cosf(theta * 0.5f) + 1.0e-6f);
v[0] = (v1[0] + v2[0]) * n;
v[1] = (v1[1] + v2[1]) * n;
v[2] = (v1[2] + v2[2]) * n;
}
#endif
int main(void)
{
unsigned long long int i = 20000000;
float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
float v[3] = { 0.0, 0.0, 0.0 };
while (--i) {
mid_v3_v3v3_slerp( v, v1, v2);
mid_v3_v3v3_slerp(v1, v, v2);
mid_v3_v3v3_slerp(v1, v2, v );
}
printf("done %f %f %f\n", v[0], v[1], v[2]);
return 0;
}
I compiled it with gcc -ggdb3 -O3 -Wall -Wextra -fwhole-program -DUSE_NORMALIZE -march=native -static normal.c -lm and profiled the code with perf.
The trigonometric approach is 4x slower and it is because the expensive cosf and acosf functions.
I have tested the Intel C++ Compiler as well: icc -Ofast -Wall -Wextra -ip -xHost normal.c; the conclusion is the same, although gcc generates approximately 10% slower code (for -Ofast as well).
I wouldn't even try to implement an approximate sqrtf: It is already an intrinsic and chances are, your approximation will only be slower...
Having said all these, I don't know the answer to the original question. I thought about it and I also suspect that the there might be another way that doesn't involve the square-root function.
Interesting question in theory; in practice, I doubt that getting rid of that square-root would make any difference in speed in your application.
First, find the angle between your two vectors. From the principle of scalar projection, we know that
|a| * cos(theta) = a . b_hat
. being the dot product operator, |a| being the length of a, theta being the angle between a and b, and b_hat being the normalized form of b.
In your situation, a and b are already unit vectors, so this simplifies to:
cos(theta) = a . b
which we can rearrange to:
theta = acos(a . b)
Lay vectors A and B end to end, and complete the triangle by drawing a line from the start of the first vector to the end of the second. Since two sides are of equal length, we know the triangle is isoceles, so it's easy to determine all of the angles if you already know theta.
That line with length N is the middle vector. We can normalize it if we divide it by N.
From the law of sines, we know that
sin(theta/2)/1 = sin(180-theta)/N
Which we can rearrange to get
N = sin(180-theta) / sin(theta/2)
Note that you'll divide by zero when calculating N if A and B are equal, so it may be useful to check for that corner case before starting.
Summary:
dot_product = a.x * b.x + a.y * b.y + a.z * b.z
theta = acos(dot_product)
N = sin(180-theta) / sin(theta/2)
middle_vector = [(a.x + b.x) / N, (a.y + b.y) / N, (a.z + b.z) / N]
Based on the answers I made some speed comparison.
Edit. with this nieve benchmark, GCC optimizes out the trigonometry yeilding approx the same speeds for both methods, Read #Ali's post for a more complete explanation.
In summery, using re-normalizing is approx 4x faster.
#include <stdio.h>
#include <math.h>
/* gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_a
* gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_b -DUSE_NORMALIZE
*
* time ./mid_v3_v3v3_slerp_a
* time ./mid_v3_v3v3_slerp_b
*/
#ifdef USE_NORMALIZE
#warning "Using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
float m;
v[0] = (v1[0] + v2[0]);
v[1] = (v1[1] + v2[1]);
v[2] = (v1[2] + v2[2]);
m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
v[0] *= m;
v[1] *= m;
v[2] *= m;
}
#else
#warning "Not using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
const float theta = acosf(dot_product);
const float n = 1.0f / (2.0f * cosf(theta * 0.5f));
v[0] = (v1[0] + v2[0]) * n;
v[1] = (v1[1] + v2[1]) * n;
v[2] = (v1[2] + v2[2]) * n;
}
#endif
int main(void)
{
unsigned long long int i = 10000000000;
const float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
const float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
float v[3];
while (--i) {
asm (""); /* prevent compiler from optimizing the loop away */
mid_v3_v3v3_slerp(v, v1, v2);
}
printf("done %f %f %f\n", v[0], v[1], v[2]);
return 0;
}

Resources