What is the max stack size limit of 32 bit process in HP-UX B.11.31 or what is the max stack size that can be alloacted to a 32 bit process in HP -UX
MAX STACK SIZE LIMIT
Default
32 bit: 0x800000 (8MB)
64 bit: 0x10000000 (256MB)
Allowed values
32 bit minimum: 0x40000
32 bit maximum: 0x17F00000
64 bit minimum: 0x40000
64 bit maximum: 0x80000000
Related
Let's say ssthresh == 12 units
and congwin at time0 == 1 unit
I know congwin grows exponentially (powers of 2)
until ssthresh value is surpassed/reached
but at the moment congwin == 8.
will it will keep growing expo until 16 is reached
or will it grow until reaching 12 and from there will continue in a linear ascend?
Background
I've self-taught myself machine learning and have recently started delving into the Julia Machine Learning Ecosystem.
Coming from a python background and having some Tensorflow and OpenCV/skimage experience, I want to benchmark Julia ML libraries (Flux/JuliaImages) against its counterparts to see how fast or slow it really performs CV(any) task(s) and to decide if I should shift to using Julia.
I know how to get the time taken to execute a function in python using timeit module like this :
#Loading an Image using OpenCV
s = """\
img = cv2.imread('sample_image.png', 1)
"""
setup = """\
import timeit
"""
print(str(round((timeit.timeit(stmt = s, setup = setup, number = 1))*1000, 2)) + " ms")
#printing the time taken in ms rounded to 2 digits
How does one compare the execution time of a function performing the same task in Julia using the appropriate library (in this case, JuliaImages).
Does Julia provide any function/macro to time/benchmark ?
using BenchmarkTools is the recommended way to benchmark Julia functions. Unless you are timing something that takes quite a while, use either #benchmark or the less verbose #btime macros exported from it. Because the machinery behind these macros evaluates the target function many times, #time is useful for benchmarking things that run slowly (e.g. where disk access or very time-consuming calculations are involved).
It is important to use #btime or #benchmark correctly, this avoids misleading results. Usually, you are benchmarking a function that takes one or more arguments. When benchmarking, all arguments should be external variables: (without the benchmark macro)
x = 1
f(x)
# do not use f(1)
The function will be evaluated many times. To prevent the function arguments from being re-evaluated whenever the function is evaluated, we must mark each argument by prefixing a $ to the name of each variable that is used as an argument. The benchmarking macros use this to indicate that the variable should be evaluated (resolved) once, at the start of the benchmarking process and then the result is to be reused directly as is:
julia> using BenchmarkTools
julia> a = 1/2;
julia> b = 1/4;
julia> c = 1/8;
julia> a, b, c
(0.5, 0.25, 0.125)
julia> function sum_cosines(x, y, z)
return cos(x) + cos(y) + cos(z)
end;
julia> #btime sum_cosines($a, $b, $c); # the `;` suppresses printing the returned value
11.899 ns (0 allocations: 0 bytes) # calling the function takes ~12 ns (nanoseconds)
# the function does not allocate any memory
# if we omit the '$', what we see is misleading
julia> #btime sum_cosines(a, b, c); # the function appears more than twice slower
28.441 ns (1 allocation: 16 bytes) # the function appears to be allocating memory
# #benchmark can be used the same way that #btime is used
julia> #benchmark sum_cosines($a,$b,$c) # do not use a ';' here
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.111 ns (0.00% GC)
median time: 12.213 ns (0.00% GC)
mean time: 12.500 ns (0.00% GC)
maximum time: 39.741 ns (0.00% GC)
--------------
samples: 1500
evals/sample: 999
While there are parameters than can be adjusted, the default values usually work well. For additional information about BenchmarkTools for experienced ursers, see the manual.
Julia provides two macros for timing/benchmarking code runtime. These are :
#time
#benchmark : external, install by Pkg.add("BenchmarkTools")
Using BenchmarkTools' #benchmark is very easy and would be helpful to you in comparing the speed of the two languages.
Example of using #benchark against the python bench you provided.
using Images, FileIO, BenchmarkTools
#benchmark img = load("sample_image.png")
Output :
BenchmarkTools.Trial:
memory estimate: 3.39 MiB
allocs estimate: 322
--------------
minimum time: 76.631 ms (0.00% GC)
median time: 105.579 ms (0.00% GC)
mean time: 110.319 ms (0.41% GC)
maximum time: 209.470 ms (0.00% GC)
--------------
samples: 46
evals/sample: 1
Now to compare for the mean time, you should put samples (46) as the number in your python timeit code and divide it by the same number to get the mean time of execution.
print(str(round((timeit.timeit(stmt = s, setup = setup, number = 46)/46)*1000, 2)) + " ms")
You can follow this process for benchmarking any function in both Julia and Python.
I hope you're doubt has been cleared.
Note : From a statistical point of view, #benchmark is much better than #time.
I want to calculate the max network throughput on 1G Ethernet link. I understand how to estimate max rate in packets/sec units for 64-bytes frame:
IFG 12 bytes
MAC Preamble 8 bytes
MAC DA 6 bytes
MAC SA 6 bytes
MAC type 2 bytes
Payload 46 bytes
FCS 4 bytes
Total Frame size -> 84 bytes
Now for 1G link we get:
1,000,000,000 bits/sec * 8 bits/byte => 1,488,096 fps
As I understand, this is a data link performance, correct?
But how to calculate throughput in megabits per second for different packets size, i.e. 64,128...1518? Also, how to calculate UDP/TCP throughput, since I have to consider headers overhead.
Thanks.
Max throughput over Ethernet = (Payload_size / (Payload_size + 38)) * Link bitrate
I.e. if you send 50 bytes of payload data, max throughput would be (50 / 88) * 1,000,000,000 for a 1G link, or about 568 Mbit/s. If you send 1000 bytes of payload, max throughput is (1000/1038) * 1,000,000,000 = 963 Mbit/s.
IP+UDP adds 28 bytes of headers, so if you're looking for data throughput over UDP, you should use this formula:
Max throughput over UDP = (Payload_size / (Payload_size + 66)) * Link bitrate
And IP+TCP adds 40 bytes of headers, so that would be:
Max throughput over TCP = (Payload_size / (Payload_size + 78)) * Link bitrate
Note that these are optimistic calculations. I.e. in reality, you might have extra options in the header data that increases the size of the headers, lowering payload throughput. You could also have packet loss that causes performance to drop.
Check out the Wikipedia article on the ethernet frame, and particularly the "Maximum throughput" section:
http://en.wikipedia.org/wiki/Ethernet_frame
In terms of probability distribution they use? I know that runif gives fractional numbers and sample gives whole numbers, but what I am interested in is if sample also use the 'uniform probability distribution'?
Consider the following code and output:
> set.seed(1)
> round(runif(10,1,100))
[1] 27 38 58 91 21 90 95 66 63 7
> set.seed(1)
> sample(1:100, 10, replace=TRUE)
[1] 27 38 58 91 21 90 95 67 63 7
This strongly suggests that when asked to do the same thing, the 2 functions give pretty much the same output (though interestingly it is round that gives the same output rather than floor or ceiling). The main differences are in the defaults and if you don't change those defaults then both would give something called a uniform (though sample would be considered a discrete uniform and by default without replacement).
Edit
The more correct comparison is:
> ceiling(runif(10,0,100))
[1] 27 38 58 91 21 90 95 67 63 7
instead of using round.
We can even step that up a notch:
> set.seed(1)
> tmp1 <- sample(1:100, 1000, replace=TRUE)
> set.seed(1)
> tmp2 <- ceiling(runif(1000,0,100))
> all.equal(tmp1,tmp2)
[1] TRUE
Of course if the probs argument to sample is used (with not all values equal), then it will no longer be uniform.
sample samples from a fixed set of inputs, and if a length-1 input is passed as the first argument, returns an integer output(s).
On the other hand, runif returns a sample from a real-valued range.
> sample(c(1,2,3), 1)
[1] 2
> runif(1, 1, 3)
[1] 1.448551
sample() runs faster than ceiling(runif())
This is useful to know if doing many simulations or bootstrapping.
Crude time trial script that time tests 4 equivalent scripts:
n<- 100 # sample size
m<- 10000 # simulations
system.time(sample(n, size=n*m, replace =T)) # faster than ceiling/runif
system.time(ceiling(runif(n*m, 0, n)))
system.time(ceiling(n * runif(n*m)))
system.time(floor(runif(n*m, 1, n+1)))
The proportional time advantage increases with n and m but watch you don't fill memory!
BTW Don't use round() to convert uniformly distributed continuous to uniformly distributed integer since terminal values get selected only half the time they should.
Where can I find the information about error ranges for trigonometric function instructions on x86 processors, like fsincos?
What you ask is rarely an interesting question, and most likely you really want to know something different. So let me answer different questions first:
How to calculate trigonometric function to a certain accuracy?
Just use a longer datatype. With x86, if you need the result with double accuracy, do an 80-bit extended double calculation and you are on the safe side.
How to get platform-independent accuracy?
You need a specialized software solution for this, like MPFR
That said, let me come back to your original question. Short answer: for small operands it should be typically within 1 ulp. For larger operands it's getting worse. The only way to find out for sure is to test this for yourself, like this guy did. There is no reliable information from the processor vendors.
For Intel CPUs the accuracy of the built-in transcendental instructions is documented in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, section 8.3.10 Transcendental Instruction Accuracy:
With the Pentium processor and later IA-32 processors, the worst case error on transcendental functions is less than 1 ulp when rounding to the nearest (even) and less than 1.5 ulps when rounding in other modes.
It should be noted that the error bound of 1 ulp applies to the 80-bit extended-precision format, as all transcendental function instructions deliver extended-precision results. The issue noted by Stephen Cannon in an earlier comment regarding a loss of accuracy, relative to a mathematical reference, for the trigonometric function instructions FSIN, FCOS, FSCINCOS, FPTAN, due to argument reduction with a 66-bit machine PI, is acknowledged by Intel. Guidance is provided as follows:
Regardless of the target precision (single, double, or double-extended), it is safe to reduce the argument to a value smaller in absolute value than about 3π/4 for FSIN, and smaller than about 3π /8 for FCOS, FSINCOS, and FPTAN. [...] For example, accuracy measurements show that the double-extended precision result of FSIN will not have errors larger than 0.72 ulp for |x| < 2.82 [...]
Likewise, the double-extended precision result of FCOS will not have errors larger than 0.82 ulp for |x| < 1.31 [...]
It is further acknowledged that the error bound of 1 ulp for the logarithmic function instructions FYL2X and FYL2XP1 only holds when y = 1 (this was not clear in some of Intel's older documentation):
The instructions FYL2X and FYL2XP1 are two operand instructions and are guaranteed to be within 1 ulp only when y equals 1. When y is not equal to 1, the maximum ulp error is always within 1.35
Using a multi-precision library, it is straightforward to put Intel's claims to a test. To collect the following data, I used Richard Brent's MP library as a reference, and ran 231 random test cases in the intervals indicated:
Intel Xeon CPU E3-1270 v2 "IvyBridge", Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
2xm1 [-1,1] max. ulp = 0.898306 at x = -1.8920e-001 (BFFC C1BED062 C071D472)
sin [-2.82,+2.82] max. ulp = 0.706783 at x = 5.1323e-001 (3FFE 8362D6B1 FC93DFA0)
cos [-1.41,+1.41] max. ulp = 0.821634 at x = -1.3201e+000 (BFFF A8F8486E 591A59D7)
tan [-1.41,+1.41] max. ulp = 0.990388 at x = 1.3179e+000 (3FFF A8B0CAB9 0039C790)
atan [-1,1] max. ulp = 0.747328 at x = 1.2252e-002 (3FF8 C8BB9E06 B9EB4DF8), y = 3.9204e-001 (3FFD C8B8DC94 AA6655B4)
y2lx [0.5,2.0] max. ulp = 0.994396 at x = 1.0218e+000 (3FFF 82C95B56 8A70EB2D), y = 1.0000e+000 (3FFF 80000000 00000000)
yl2x [1.0,1.2] max. ulp = 1.202769 at x = 1.0915e+000 (3FFF 8BB70F1B C5F7E103), y = -9.8934e-001 (BFFE FD453A23 AC926478)
yl2xp1 [-0.7,1.44] max. ulp = 0.990469 at x = 2.1709e-002 (3FF9 B1D61A98 BF349080), y = 1.0000e+000 (3FFF 80000000 00000000)
yl2xp1 [-1, 1] max. ulp = 1.206979 at x = 9.1169e-002 (3FFB BAB69127 C1D5C158), y = -9.9281e-001 (BFFE FE28A91F 132F0C35)
While such non-exhaustive testing cannot prove error bounds, the maximum errors found appear to confirm Intel's documentation.
I do not have any modern AMD processors to test, but do have test data for an old 32-bit Athlon CPU. Full disclosure: I designed the algorithms for the transcendental functions instructions used in 32-bit Athlon processors. My accuracy target was less than 1 ulp for all the instructions; however the same caveat about argument reduction by 66-bit machine PI for trigonometric functions already mentioned above applies.
Athlon XP-2100 "Palomino", x86 Family 6 Model 6 Stepping 2, AuthenticAMD
2xm1 [-1,1] max. ulp = 0.720006 at x = 5.6271e-001 (3FFE 900D9E90 A533535D)
sin [-2.82, +2.82] max. ulp = 0.663069 at x = -2.8200e+000 (C000 B47A7BB2 305631FE)
cos [-1.41, +1.41] max. ulp = 0.671089 at x = -1.3189e+000 (BFFF A8D0CF9E DC0BCA43)
tan [-1.41, +1.41] max. ulp = 0.783821 at x = -1.3225e+000 (BFFF A947067E E3F4C39C)
atan [-1,1] max. ulp = 0.665893 at x = 5.5333e-001 (3FFE 8DA6B606 C58B206A) y = 5.5169e-001 (3FFE 8D3B9DC8 5EA87546)
yl2x [0.4,2.5] max. ulp = 0.716276 at x = 6.9826e-001 (3FFE B2C128C3 0EF1EC00) y = -1.2062e-001 (BFFB F7064049 BC362838)
yl2xp1 [-1,4] max. ulp = 0.691403 at x = 1.9090e-001 (3FFC C37C0397 F8184934) y = -2.4796e-001 (BFFC FDE93CA9 980BF78C)
The AMD64 Architecture Programmer’s Manual, Vol. 1, in section 6.4.5.1 Accuracy of Transcendental Results, documents the error bounds as follows:
x87 computations are carried out in double-extended-precision format, so that the transcendental functions provide results accurate to within one unit in the last place (ulp) for each of the floating-point data types.
You can read the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 1 section 8.3.10 on Transcendental Instruction Accuracy. There is a precise formula, but also the more accessible statement
With the Pentium processor and later IA-32 processors, the worst case error on transcendental functions is less than 1 ulp when rounding to the nearest (even) and less than 1.5 ulps when rounding in other modes.