Should exp2 be faster than exp?

Should exp2 be faster than exp? - math

I'm mostly interested in the "exp" and "exp2" functions in C/C++, but this question is probably more related to the IEEE 754 standard than specific language features.
In a homework problem I did some 10 years ago, which tries to rank different floating point operations by the cycles needed, the C function
double exp2 (double)
appear to be slightly faster than
double exp (double)
Given that "double" uses a binary representation for the mantissa, I feel this result is reasonable.
Today, however, after testing the two again in several different ways, I could not see any measurable differences. So my questions are
Should exp2 be (theoretically) faster than exp? and
Should there be any measurable differences? and
Has the answer changed in the recent years?

There are a number of platforms that don't take much care with their math library on which exp2(x) is simply implemented as exp(x * log(2)) or vice-versa. These implementations do not deliver good accuracy (or especially good performance), but they are fairly common. On platforms that do this, one function is exactly as costly as the other but for the cost of an extra multiply, and whichever gets the extra multiply will be the slower of the two.
On platforms that aggressively tune the math library and try to deliver good accuracy, the two functions are very similar in performance. Generating the exponent of the result is easier with exp2, but getting a high-accuracy significand can require slightly more work; the two factors roughly even out to the point that performance is usually equivalent within a factor of 10-15%. Speaking very broadly, exp2 is usually the faster of the two.

I made some measurements, I hope some of you will find it useful.
Conditions:
Intel(R) Xeon(R) CPU E5-2620 v2 # 2.10GHz (the server had high CPU load during the test)
Compiler version: g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Compiler options: -static -std=gnu++0x -ffast-math -Ofast -flto
The code:
#include <iostream>
#include <random>
#include <cmath>
#include <chrono>
using namespace std;
int main()
{
double g = 1/log(2);
mt19937 engine(1000);
uniform_real_distribution<double> u(0, 1);
double sum = 0;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1e7/4; ++i) // for non-parallel, `for (int i = 0; i < 1e7; ++i)`
{
sum += exp2(u(engine)*g); // for exp versions, sum += exp(u(engine)); for empty versions, sum += u(engine)*g;
sum += exp2(u(engine)*g); // removed for non-parallel
sum += exp2(u(engine)*g); // removed for non-parallel
sum += exp2(u(engine)*g); // removed for non-parallel
}
auto end = std::chrono::high_resolution_clock::now();
cout << chrono::duration_cast<chrono::nanoseconds>(end - begin).count()/1000./1000 << "ms" << "\t"
<< sum << "\t" << g << " exp2 p4" << endl;
}
Execution with:
for i in {1..100}; do ./empty.bin && ./exp2_p4.bin && ./exp_p4.bin && ./exp.bin && ./exp2.bin; done
where the file name tells whether the executable calls exp or exp2, and whether the summation is grouped by 4 (p4) or not.
Results
The table below shows the average runtime (time), the standard deviation in ms, and the fastest case.
| name | time (ms) | std (ms) | smallest (ms) |
|:-------:|:---------:|:--------:|:-------------:|
| empty | 244.7 | 26.2 | 130.9 |
| exp | 591.7 | 95.8 | 422.5 |
| exp2 | 536.5 | 85.4 | 393.7 |
| exp p4 | 612.3 | 89.6 | 433.2 |
| exp2 p4 | 557.2 | 87.6 | 396.8 |
For one operation, one needs to divide it with 1e7. I approximate the cost of the exponential by subtracting the time of the empty version (i.e. do the loop and summation without calculating the exp) from the exponential ones. These values are shown below:
Conclusion
exp2 can be around 11% faster than exp on Intel Xeon with gcc even if -ffast-math is on, in agreement with the accepted answer.
The manual loop unrolling by grouping the summation into a group of four does not help.

Should exp2 be (theoretically) faster than exp?
Yes.
The only way for x86 FPU to perform an exponentiation for non-integer power is using an instruction F2XM1 which calculates 2x-1.
No ex instruction exists on x86.
Any C library code for x86 is forced to calculate both exp and exp2 using 2x.
Should there be any measurable differences?
No.
The difference is only single FPU multiplication, which is very fast for modern processors.
Has the answer changed in the recent years?
Yes.
15-20 years ago price of multiplication was much higher than price of other operations. Nowadays, multiplication is almost as cheap as addition.

Related

Trying to understand Amdahl's Law

I am trying to answer a school assignment but i am getting confused to what the question is trying to ask.
A design optimization was applied to a computer system in order to increase the performance of a given
execution mode by a factor of 10. The optimized mode is used 50% of the time, measured as a percentage
of the execution time after the optimization has been applied.
(a)What is the global speedup value that is achieved with this optimization?
Remind:Amdahl’s law defines the global speedup as a function of the optimized fraction before the optimization is applied. As a consequence, the 50% ratio cannot be directly used to evaluate this
speedup value.
(b)What is the percentage of the original execution time that is affected by this optimization?
(c)How much should such execution mode be optimized in order to achieve a global speedup of 5?Can a global speedup of 12 be achieved?And 11?
When trying to calculate answer A) i came to the answer 1.81 ( 20/11 )
T' = 0.5 * T + 0.5 * T / 10 = T / 2 + ( 1 / 20 ) T = ( 11 / 20 ) * T
Speedup = T / T' = T / ( ( 11 / 20 ) * T = 20 / 11 = 1.81
For me this answer makes sense but in the professor's solutions say otherwise:
(a) 5.5
(b) 91%
(c)Yes it can with an optimization by a factor of 25 / 3.No, because the factor can’t be negative, so it is impossible.Also no, because ∞ optimization → impossible
I can't solve the other ones because I am confused with the first one.
Why is 5.5 the correct answer?

Let's suppose a computer has two states A and B, and after whatever optimization, it spends 0 ≤ p ≤ 1 of its time in state A, and q = 1 - p of its time in state B. (So p is something like .5, or .27).
State A was sped up by a factor of X. State B was sped up by a factor of Y.
So before, it was spending time p * X + q * Y time that it can now do in p + q = 1 unit of time. So its speed up is p * X + q * Y.
Applying this to the problem you were given:
p = q = .5, X=10, Y=1 (no speedup).
10 * (.5) + 1 * (.5) = 5.5
This easily generalizes.

After optimization, time = x minutes optimized mode + x minutes other = 2x.
Before optimization, time = 10x minutes unoptimized mode + x minutes other = 11x.
Speedup = 11x/2x = 5.5

I love Amdahl's argument, incl. "improvers", so let's start from facts
I will not answer the assignment questions directly, yet will help you learn the know-why, which is to my deepest belief & decades of joy to experience working with the most skilled people the core of what education should promote in our knowledge
( introducing text, decomposed )
A design optimization was applied to a COMPUTER SYSTEM ___ [Fig.1:A]
in order
to increase the performance
of a given
EXECUTION MODE_________________ [Fig.1:B]
by a FACTOR
of 10._________________________ [Fig.1:C]
Fig.1 :
BEFORE
+------------------------------------------------------------A: SYSTEM
| +----------------------------------------------------B |
| | | |
| | | |
| | | |
| +----------------------------------------------------+ |
+--:----------------------------------------------------:----+
: :
: :
: C: FACTOR ~ 10 x_________________________/
: /
AFTER : /
+--:--------/--A*
| +------B* |
| | 10x | |
| | less | |
| | time | |
| +123456+ |
+12+------+3456+
D: in smarter, optimised "EXECUTION MODE",
the 50% was duration of the said EXECUTION MODE, whereas
50% was duration of the original, not modified, part
( ... text continued, decomposed )
The optimized mode is used 50% of the TIME,__________ [FACT Fig.1:D]
measured
as
a percentage of the execution time
AFTER the optimization
has been applied.
( ... first question, decomposed )
(a) What is the global SPEEDUP value
that is achieved
with ( AFTER )
this optimization?
Remind: Amdahl’s law defines the global speedup as a function of the optimized fraction before the optimization is applied. As a consequence, the 50% ratio cannot be directly used to evaluate this speedup value.
( ... second question )
(b) What is the percentage of the original execution time that is affected by this optimization?
full-A-duration ~ 10 x duration-of-B* // == duration-of-B as was BEFORE
+ 1 x duration-of-B* // == duration-of-( A - B ) as is
// == duration-of-( A*- B*) the same
( ref: FACT [Fig.1:D] )
Since here,the classics apply
--- just do not forget what to compare to what ( and keep in mind, that one and the very same word may bear quite different actual meanings - just compare the original paper with Dr. Gene M. AMDAHL's ( IBM Research ) argument with the E. BARSIS' ( Sandia Natl. Lab.s ) "scaled speedup" and the later John L. GUSTAFSON's presented ( reversed optics or "opposite point of view" ) speedup - all use the same word S-P-E-E-D-U-P, yet their respective definitions differ ( and a lot )You might like to read the very original, authentic, Dr. Gene M. AMDAHL's paper, to see the actual argument wording as was archived in FAQs, the file is in section "FAQ part 20: IBM and Amdahl", where the paper is on the very bottom of that text ). Alan KARP's price ( and also its winners ) is also a delightful part of this part of the computing history :o)
( ... third, fourth and fifth questions )
(c) How much should such EXECUTION MODE (improving just the block B-to-B* ) be optimized in order to achieve a global speedup of 5?
Can a global speedup here not restricted to touch only B, so can be smart in improving A-to-A* :P professor will either accept and warmly appreciate your skills and insightful argumentation on doing this, or punish you to dare use crystal-clear logic of the task to the limits the text was not prohibiting us from doing so ;) -- [ SAFETY WARNING ] best not to use this skilled strategy on auto-grader(s) or Artificial-"Intelligence"-powered grading Bots... for obvious reasons these rigid, pre-wired or LSqE-penalised algorithms will hardly award you any extra points for innovative thinking, as thinking is "not included" there, while batteries might 've been, might've been not? )of 12 be achieved?And 11?

Counting the number of restricted Integer partitions

Original problem:
Let N be a positive integer (actually, N <= 2000) and P - set of all possible partitions of the N, where with and . Let A be the number of partitions . Find the A.
Input: N. Output: A - the number of partitions .
What have I tried:
I think that this problem can be solved by dynamic-based algorithm. Let p(n,a,b) be the function, which returns the number of partitons of n using only numbers a. . .b. Then we can compute the A with the code like:
int Ans = 2; // the 1+1+...+1=N & N=N partitions
for(int a = 2; a <= N/2; a += 1){ //a - from 2 to N/2
int b = a*2-1;
Ans += p[N][a][b]; // add all partitions using a..b to Answer
if(a < (a-1)*2-1){ // if a < previous b [ (a-1)*2-1 ]
Ans -= p[N][a][(a-1)*2-1]; // then we counted number of partitions
} // using numbers a..prev_b twice.
}
Next I tried to find the dynamic algorithm computing p(n,a,b) for any integer a <= b <= n. This paper (.pdf) provides the folowing algorithm:
, were I(n<=b) = 1 if n<=b and =0 otherwise.
Question(s):
How should I realize the algorithm from the paper? I'm new at d-p problems and as I can see, this problem has 3 dimensions (n,a & b), which is quite tricky for me.
How actually that algorithm works? I know how work the algorithms for computing p(n,0,b) or p(n,a,n), but a little explanation for p(n,a,b) will be very helpful.
Does original problem have simpler solution? I'm quite sure that there's another clean solution, but I didn't found it.

I calculated all A(1)-A(600) in 23 seconds with memoization approach (top-down dynamic programming). 3D table requires 1.7 GB of memory.
For reference: A[50] = 278, A(200)=465202, A(600)=38860513616
N=2000 requires too large table for 32-bit environment, and map approach worked too slow.
I can make 2D table with reasonable size, but this approach requires table zeroing at every iteration of external loop - slow again.
A(1000) = 107292471486730 in 131 sec. And I think that long arithmetic might be needed for larger values to avoid Int64 overflow.

How to get a sum array from array

I'm a new for new for openCL.
I know how to sum a 1D array. But my question is how to get a sum array from 1 1D array in openCL.
int a[1000];
int b[1000];
.... //save data to a
for(int i = 0 ;i < 1000; i ++){
int sum = 0;
for(int j = 0 ;j < i; j ++){
sum += a[j];
}
b[i] = sum;
}
Any suggestion is welcome.

As others have mentioned - what you want to do is use inclusive parallel prefix sum. If you're allowed to use OpenCL 2, they have a workgroup function for it - they should have had it in there from the start because of how often it is used - so now we have everybody implementing it themselves, often poorly in one way or another.
See Parallel Prefix Sum (Scan) with CUDA for the typical algorithms for teaching this.
At the number you mention really it makes no sense to use multiple compute units meaning you will attack it with a single compute unit - so just repeat the loop twice or so - at 64-256, you'll have the sum of so many elements very quickly. Building on workgroup functions to get the generic reduction functions for any size is an exercise to the reader.

This is a sequential problem. Expressed in another way
b[1] = a[0]
b[2] = b[1] + a[1]
b[3] = b[2] + a[2]
...
b[1000] = b[9999] + a[999]
Therefore, having multiple threads wont help you at all.
The most optimal way of doing that is using a single CPU. And not OpenCL/CUDA/OpenMP...
This problem is completely different from a reduction, were every step can be divided in 2 smaller steps that can be run in parallel.

Sum all elements in a quadword vector in ARM assembly with NEON

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.

You might try this (it's not in ASM, but you should be able to convert it easily):
float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);
In ASM it would be probably only VADD and VPADD.
I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...
PS. I'm new to NEON too

It seems that you want to get the sum of a certain length of array, and not only four float values.
In that case, your code will work, but is far from optimized :
many many pipeline interlocks
unnecessary 32bit addition per iteration
Assuming the length of the array is a multiple of 8 and at least 16 :
vldmia {q0-q1}, [pSrc]!
sub count, count, #8
loop:
pld [pSrc, #32]
vldmia {q3-q4}, [pSrc]!
subs count, count, #8
vadd.f32 q0, q0, q3
vadd.f32 q1, q1, q4
bgt loop
vadd.f32 q0, q0, q1
vpadd.f32 d0, d0, d1
vadd.f32 s0, s0, s1
pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.
I hope the rest of the code above is self explanatory.
You will notice that this version is many times faster than your initial one.

Here is the code in ASM:
vpadd.f32 d1,d6,d7 # q3 is register that needs all of its contents summed
vadd.f32 s1,s2,s3 # now we add the contents of d1 together (the sum)
vadd.f32 s0,s0,s1 # sum += s1;
I may have forgotten to mention that in C the code would look like this:
float sum = 1.0f;
sum += number1 * number2;
I have omitted the multiplication from this little piece asm of code.

Arbitrary-precision arithmetic Explanation

I'm trying to learn C and have come across the inability to work with REALLY big numbers (i.e., 100 digits, 1000 digits, etc.). I am aware that there exist libraries to do this, but I want to attempt to implement it myself.
I just want to know if anyone has or can provide a very detailed, dumbed down explanation of arbitrary-precision arithmetic.

It's all a matter of adequate storage and algorithms to treat numbers as smaller parts. Let's assume you have a compiler in which an int can only be 0 through 99 and you want to handle numbers up to 999999 (we'll only worry about positive numbers here to keep it simple).
You do that by giving each number three ints and using the same rules you (should have) learned back in primary school for addition, subtraction and the other basic operations.
In an arbitrary precision library, there's no fixed limit on the number of base types used to represent our numbers, just whatever memory can hold.
Addition for example: 123456 + 78:
12 34 56
78
-- -- --
12 35 34
Working from the least significant end:
initial carry = 0.
56 + 78 + 0 carry = 134 = 34 with 1 carry
34 + 00 + 1 carry = 35 = 35 with 0 carry
12 + 00 + 0 carry = 12 = 12 with 0 carry
This is, in fact, how addition generally works at the bit level inside your CPU.
Subtraction is similar (using subtraction of the base type and borrow instead of carry), multiplication can be done with repeated additions (very slow) or cross-products (faster) and division is trickier but can be done by shifting and subtraction of the numbers involved (the long division you would have learned as a kid).
I've actually written libraries to do this sort of stuff using the maximum powers of ten that can be fit into an integer when squared (to prevent overflow when multiplying two ints together, such as a 16-bit int being limited to 0 through 99 to generate 9,801 (<32,768) when squared, or 32-bit int using 0 through 9,999 to generate 99,980,001 (<2,147,483,648)) which greatly eased the algorithms.
Some tricks to watch out for.
1/ When adding or multiplying numbers, pre-allocate the maximum space needed then reduce later if you find it's too much. For example, adding two 100-"digit" (where digit is an int) numbers will never give you more than 101 digits. Multiply a 12-digit number by a 3 digit number will never generate more than 15 digits (add the digit counts).
2/ For added speed, normalise (reduce the storage required for) the numbers only if absolutely necessary - my library had this as a separate call so the user can decide between speed and storage concerns.
3/ Addition of a positive and negative number is subtraction, and subtracting a negative number is the same as adding the equivalent positive. You can save quite a bit of code by having the add and subtract methods call each other after adjusting signs.
4/ Avoid subtracting big numbers from small ones since you invariably end up with numbers like:
10
11-
-- -- -- --
99 99 99 99 (and you still have a borrow).
Instead, subtract 10 from 11, then negate it:
11
10-
--
1 (then negate to get -1).
Here are the comments (turned into text) from one of the libraries I had to do this for. The code itself is, unfortunately, copyrighted, but you may be able to pick out enough information to handle the four basic operations. Assume in the following that -a and -b represent negative numbers and a and b are zero or positive numbers.
For addition, if signs are different, use subtraction of the negation:
-a + b becomes b - a
a + -b becomes a - b
For subtraction, if signs are different, use addition of the negation:
a - -b becomes a + b
-a - b becomes -(a + b)
Also special handling to ensure we're subtracting small numbers from large:
small - big becomes -(big - small)
Multiplication uses entry-level math as follows:
475(a) x 32(b) = 475 x (30 + 2)
= 475 x 30 + 475 x 2
= 4750 x 3 + 475 x 2
= 4750 + 4750 + 4750 + 475 + 475
The way in which this is achieved involves extracting each of the digits of 32 one at a time (backwards) then using add to calculate a value to be added to the result (initially zero).
ShiftLeft and ShiftRight operations are used to quickly multiply or divide a LongInt by the wrap value (10 for "real" math). In the example above, we add 475 to zero 2 times (the last digit of 32) to get 950 (result = 0 + 950 = 950).
Then we left shift 475 to get 4750 and right shift 32 to get 3. Add 4750 to zero 3 times to get 14250 then add to result of 950 to get 15200.
Left shift 4750 to get 47500, right shift 3 to get 0. Since the right shifted 32 is now zero, we're finished and, in fact 475 x 32 does equal 15200.
Division is also tricky but based on early arithmetic (the "gazinta" method for "goes into"). Consider the following long division for 12345 / 27:
457
+-------
27 | 12345 27 is larger than 1 or 12 so we first use 123.
108 27 goes into 123 4 times, 4 x 27 = 108, 123 - 108 = 15.
---
154 Bring down 4.
135 27 goes into 154 5 times, 5 x 27 = 135, 154 - 135 = 19.
---
195 Bring down 5.
189 27 goes into 195 7 times, 7 x 27 = 189, 195 - 189 = 6.
---
6 Nothing more to bring down, so stop.
Therefore 12345 / 27 is 457 with remainder 6. Verify:
457 x 27 + 6
= 12339 + 6
= 12345
This is implemented by using a draw-down variable (initially zero) to bring down the segments of 12345 one at a time until it's greater or equal to 27.
Then we simply subtract 27 from that until we get below 27 - the number of subtractions is the segment added to the top line.
When there are no more segments to bring down, we have our result.
Keep in mind these are pretty basic algorithms. There are far better ways to do complex arithmetic if your numbers are going to be particularly large. You can look into something like GNU Multiple Precision Arithmetic Library - it's substantially better and faster than my own libraries.
It does have the rather unfortunate misfeature in that it will simply exit if it runs out of memory (a rather fatal flaw for a general purpose library in my opinion) but, if you can look past that, it's pretty good at what it does.
If you cannot use it for licensing reasons (or because you don't want your application just exiting for no apparent reason), you could at least get the algorithms from there for integrating into your own code.
I've also found that the bods over at MPIR (a fork of GMP) are more amenable to discussions on potential changes - they seem a more developer-friendly bunch.

While re-inventing the wheel is extremely good for your personal edification and learning, its also an extremely large task. I don't want to dissuade you as its an important exercise and one that I've done myself, but you should be aware that there are subtle and complex issues at work that larger packages address.
For example, multiplication. Naively, you might think of the 'schoolboy' method, i.e. write one number above the other, then do long multiplication as you learned in school. example:
123
x 34
-----
492
+ 3690
---------
4182
but this method is extremely slow (O(n^2), n being the number of digits). Instead, modern bignum packages use either a discrete Fourier transform or a Numeric transform to turn this into an essentially O(n ln(n)) operation.
And this is just for integers. When you get into more complicated functions on some type of real representation of number (log, sqrt, exp, etc.) things get even more complicated.
If you'd like some theoretical background, I highly recommend reading the first chapter of Yap's book, "Fundamental Problems of Algorithmic Algebra". As already mentioned, the gmp bignum library is an excellent library. For real numbers, I've used MPFR and liked it.

Don't reinvent the wheel: it might turn out to be square!
Use a third party library, such as GNU MP, that is tried and tested.

You do it in basically the same way you do with pencil and paper...
The number is to be represented in a buffer (array) able to take on an arbitrary size (which means using malloc and realloc) as needed
you implement basic arithmetic as much as possible using language supported structures, and deal with carries and moving the radix-point manually
you scour numeric analysis texts to find efficient arguments for dealing by more complex function
you only implement as much as you need.
Typically you will use as you basic unit of computation
bytes containing with 0-99 or 0-255
16 bit words contaning wither 0-9999 or 0--65536
32 bit words containing...
...
as dictated by your architecture.
The choice of binary or decimal base depends on you desires for maximum space efficiency, human readability, and the presence of absence of Binary Coded Decimal (BCD) math support on your chip.

You can do it with high school level of mathematics. Though more advanced algorithms are used in reality. So for example to add two 1024-byte numbers :
unsigned char first[1024], second[1024], result[1025];
unsigned char carry = 0;
unsigned int sum = 0;
for(size_t i = 0; i < 1024; i++)
{
sum = first[i] + second[i] + carry;
carry = sum - 255;
}
result will have to be bigger by one place in case of addition to take care of maximum values. Look at this :
9
+
9
----
18
TTMath is a great library if you want to learn. It is built using C++. The above example was silly one, but this is how addition and subtraction is done in general!
A good reference about the subject is Computational complexity of mathematical operations. It tells you how much space is required for each operation you want to implement. For example, If you have two N-digit numbers, then you need 2N digits to store the result of multiplication.
As Mitch said, it is by far not an easy task to implement! I recommend you take a look at TTMath if you know C++.

One of the ultimate references (IMHO) is Knuth's TAOCP Volume II. It explains lots of algorithms for representing numbers and arithmetic operations on these representations.
#Book{Knuth:taocp:2,
author = {Knuth, Donald E.},
title = {The Art of Computer Programming},
volume = {2: Seminumerical Algorithms, second edition},
year = {1981},
publisher = {\Range{Addison}{Wesley}},
isbn = {0-201-03822-6},
}

Assuming that you wish to write a big integer code yourself, this can be surprisingly simple to do, spoken as someone who did it recently (though in MATLAB.) Here are a few of the tricks I used:
I stored each individual decimal digit as a double number. This makes many operations simple, especially output. While it does take up more storage than you might wish, memory is cheap here, and it makes multiplication very efficient if you can convolve a pair of vectors efficiently. Alternatively, you can store several decimal digits in a double, but beware then that convolution to do the multiplication can cause numerical problems on very large numbers.
Store a sign bit separately.
Addition of two numbers is mainly a matter of adding the digits, then check for a carry at each step.
Multiplication of a pair of numbers is best done as convolution followed by a carry step, at least if you have a fast convolution code on tap.
Even when you store the numbers as a string of individual decimal digits, division (also mod/rem ops) can be done to gain roughly 13 decimal digits at a time in the result. This is much more efficient than a divide that works on only 1 decimal digit at a time.
To compute an integer power of an integer, compute the binary representation of the exponent. Then use repeated squaring operations to compute the powers as needed.
Many operations (factoring, primality tests, etc.) will benefit from a powermod operation. That is, when you compute mod(a^p,N), reduce the result mod N at each step of the exponentiation where p has been expressed in a binary form. Do not compute a^p first, and then try to reduce it mod N.

Here's a simple ( naive ) example I did in PHP.
I implemented "Add" and "Multiply" and used that for an exponent example.
http://adevsoft.com/simple-php-arbitrary-precision-integer-big-num-example/
Code snip
// Add two big integers
function ba($a, $b)
{
if( $a === "0" ) return $b;
else if( $b === "0") return $a;
$aa = str_split(strrev(strlen($a)>1?ltrim($a,"0"):$a), 9);
$bb = str_split(strrev(strlen($b)>1?ltrim($b,"0"):$b), 9);
$rr = Array();
$maxC = max(Array(count($aa), count($bb)));
$aa = array_pad(array_map("strrev", $aa),$maxC+1,"0");
$bb = array_pad(array_map("strrev", $bb),$maxC+1,"0");
for( $i=0; $i<=$maxC; $i++ )
{
$t = str_pad((string) ($aa[$i] + $bb[$i]), 9, "0", STR_PAD_LEFT);
if( strlen($t) > 9 )
{
$aa[$i+1] = ba($aa[$i+1], substr($t,0,1));
$t = substr($t, 1);
}
array_unshift($rr, $t);
}
return implode($rr);
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex