I am trying to talk to cloud services in F# and have timeouts around these operations.
This is now multiple iterations later but I still cannot figure out how to use Async.
let tokenSource = new CancellationTokenSource()
let task = Async.StartAsTask(async {return (getS3Meta keyMeta)}, TaskCreationOptions.None, tokenSource.Token)
let metaMaybe = Async.AwaitTask(task, 700) |> Async.RunSynchronously
To support the timeout we introduced:
type Timeout = Timeout
type Microsoft.FSharp.Control.Async with
static member AwaitTask (t : Task<'T>, timeout : int) =
async {
use cts = new CancellationTokenSource()
use timer = Task.Delay (timeout, cts.Token)
let! completed = Async.AwaitTask <| Task.WhenAny(t, timer)
if completed = (t :> Task) then
cts.Cancel ()
let! result = Async.AwaitTask t
return Ok result
else return Error Timeout
}
Whatever I do the task is always in WaitingForActivation.
What is the recommended way of having a timeout on an IO operation, it can block the main execution, I do not need async code here. If I must use async then what is the primitive to use to start a task that has a timeout?
The currently working code (we have not run into issues with this, even though we do not understand the different tradeoffs associated with different Async workflows).
let doSomeIoAsync =
async {
return (anotherAsyncWokFlowIoInCsharpLib someParam)
}
let task = Async.StartAsTask doSomeIoAsync
match task.Wait(500) with
| true -> Success task.Result
| false -> Error Timeout
Some fsi:
> let ioFn () = Thread.Sleep(5000); System.Console.WriteLine 100; 100;;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val ioFn : unit -> int
> ioFn();;
100
Real: 00:00:05.005, CPU: 00:00:00.001, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 100
> let taskAsync = async { return ioFn() };;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val taskAsync : Async<int>
> let t = Async.StartAsTask taskAsync ;;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val t : Tasks.Task<int>
> t.Wait(500) ; t.Status;;
Real: 00:00:00.505, CPU: 00:00:00.001, GC gen0: 0, gen1: 0, gen2: 0
val it : Tasks.TaskStatus = WaitingForActivation
> 100
- t.Wait(500) ; t.Status;;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : Tasks.TaskStatus = RanToCompletion
> t.Result;;
Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
val it : int = 100
I think it is a bit misleading that WaitingForActivation is used like this. Maybe I am misinterpreting that and it is WaitingForCompletion really.
As far as performance goes the first couple of requests are extremely slow compare to the rest:
Error: WaitingForActivation
Elapsed Time: 296
Error: WaitingForActivation
Elapsed Time: 254
Error: WaitingForActivation
Elapsed Time: 255
Error: WaitingForActivation
Elapsed Time: 254
Error: WaitingForActivation
Elapsed Time: 254
OK: RanToCompletion
Elapsed Time: 249
OK: RanToCompletion
Elapsed Time: 147
OK: RanToCompletion
Elapsed Time: 246
OK: RanToCompletion
Elapsed Time: 146
OK: RanToCompletion
Elapsed Time: 150
OK: RanToCompletion
Elapsed Time: 142
OK: RanToCompletion
Elapsed Time: 143
OK: RanToCompletion
Elapsed Time: 139
OK: RanToCompletion
Elapsed Time: 139
OK: RanToCompletion
Elapsed Time: 141
Not sure how to get rid off the slow start.
Related
I am attempting to write a program that executes Monte Carlo simulations using OpenCL. I have run into an issue involving exponentials. When the value of the variable steps becomes large, approximately 20000, the calculation of the exponent fails unexpectedly, and the program quits with "Abort Trap: 6". This seems to be a bizarre error given that steps should not affect memory allocation. I have tried setting normal, alpha, and beta to 0 but this does not resolve the problem however commenting out the exponent and replacing it with the constant 1 seems to fix the problem. I have run my code on an AWS GPU instance and it does not run into any issues. Does anybody have any ideas as to why this might be a problem on an integrated graphics card?
SOLUTION
Execute the kernel multiple times over a smaller ranges to keep kernel execution time under 5 seconds
Code Snippet
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
static uint MWC64X(uint2 *state) {
enum { A = 4294883355U };
uint x = (*state).x, c = (*state).y;
uint res = x ^ c;
uint hi = mul_hi(x, A);
x = x * A + c;
c = hi + (x < c);
*state = (uint2)(x, c);
return res;
}
__kernel void discreteMonteCarloKernel(...) {
float cumulativeWalk = stockPrice;
float currentValue = stockPrice;
...
uint n = get_global_id(0);
uint2 seed2 = (uint2)(n, seed);
uint random1 = MWC64X(&seed2);
uint2 seed3 = (uint2)(random1, seed);
uint random2 = MWC64X(&seed3);
float alpha = (interestRate - 0.5 * sigma * sigma) * dt;
float beta = sigma * sqrt(dt);
float u1;
float u2;
float a;
float b;
float normal;
for (int j = 0; j < steps; j++) {
random1 = MWC64X(&seed2);
if (random1 == 0) {
random1 = MWC64X(&seed2);
}
random2 = MWC64X(&seed3);
u1 = (float)random1 / (float)0xffffffff;
u2 = (float)random2 / (float)0xffffffff;
a = sqrt(-2 * log(u1));
b = 2 * M_PI * u2;
normal = a * sin(b);
exponent = exp(alpha + beta * normal);
currentValue = currentValue * exponent;
cumulativeWalk += currentValue;
...
}
Problem Report
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Application Specific Information:
abort() called
Application Specific Signatures:
Graphics hardware encountered an error and was reset: 0x00000813
Thread 0 Crashed:: Dispatch queue: opencl_runtime
0 libsystem_kernel.dylib 0x00007fffb14bad42 __pthread_kill + 10
1 libsystem_pthread.dylib 0x00007fffb15a85bf pthread_kill + 90
2 libsystem_c.dylib 0x00007fffb1420420 abort + 129
3 libGPUSupportMercury.dylib 0x00007fffa98e6fbf gpusGenerateCrashLog + 158
4 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x000000010915f13b gpusKillClientExt + 9
5 libGPUSupportMercury.dylib 0x00007fffa98e7983 gpusQueueSubmitDataBuffers + 168
6 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa031 IntelCLCommandBuffer::getNew(GLDQueueRec*) + 31
7 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091a9f99 intelSubmitCLCommands(GLDQueueRec*, unsigned int) + 65
8 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091b00a1 CHAL_INTEL::ChalContext::ChalFlush() + 83
9 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa2c3 gldFinishQueue + 43
10 com.apple.opencl 0x00007fff9ffeeb37 0x7fff9ffed000 + 6967
11 com.apple.opencl 0x00007fff9ffef000 0x7fff9ffed000 + 8192
12 com.apple.opencl 0x00007fffa000ccca 0x7fff9ffed000 + 130250
13 com.apple.opencl 0x00007fffa001029d 0x7fff9ffed000 + 144029
14 libdispatch.dylib 0x00007fffb13568fc _dispatch_client_callout + 8
15 libdispatch.dylib 0x00007fffb1357536 _dispatch_barrier_sync_f_invoke + 83
16 com.apple.opencl 0x00007fffa001011d 0x7fff9ffed000 + 143645
17 com.apple.opencl 0x00007fffa000bda6 0x7fff9ffed000 + 126374
18 com.apple.opencl 0x00007fffa00011df clEnqueueReadBuffer + 813
19 simplisticComparison 0x0000000107b953cf BinomialMultiplication::execute(int) + 1791
20 simplisticComparison 0x0000000107b9ec7f main + 767
21 libdyld.dylib 0x00007fffb138c235 start + 1
Thread 1:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x000070000eed6b30 0 + 123145552751408
Thread 2:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
Thread 3:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x007865646e496d65 0 + 33888479226719589
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0000000000000000 rbx: 0x0000000000000006 rcx: 0x00007fff58074078 rdx: 0x0000000000000000
rdi: 0x0000000000000307 rsi: 0x0000000000000006 rbp: 0x00007fff580740a0 rsp: 0x00007fff58074078
r8: 0x0000000000000000 r9: 0x00007fffb140ba50 r10: 0x0000000008000000 r11: 0x0000000000000206
r12: 0x00007f92de80a7e0 r13: 0x00007f92e0008c00 r14: 0x00007fffba29e3c0 r15: 0x00007f92de801a00
rip: 0x00007fffb14bad42 rfl: 0x0000000000000206 cr2: 0x00007fffba280128
Logical CPU: 0
Error Code: 0x02000148
Trap Number: 133
I have a guess. The driver can crash in two ways:
We reference a bad buffer address. This is probably not your case.
We time out (exceed the TDR). A kernel has a few seconds to complete.
My money is on #2. If the larger value (steps) makes the GPU run too long, the system will kill things.
I am not familiar with the guts of Apple's Intel driver, but typically there is a way to disable the TDR in extreme cases. E.g. see the Windows Documenation on TDRs to get the gist. (Linux drivers have a way to disable this too.)
Normally we want to avoid running things that take super long and it might be a good idea to decompose the workload in some way so that you naturally don't hit this kill switch. E.g. perhaps chunk the "steps" into smaller chunks (pass in and save your state for parts you can't recompute).
i am trying to have a function fire every x amount without blocking the main loop, i saw some example code to do this, see code below:
// Interval is how long we wait
// add const if this should never change
int interval=1000;
// Tracks the time since last event fired
unsigned long previousMillis=0;
void loop() {
// Get snapshot of time
unsigned long currentMillis = millis();
Serial.print("TIMING: ");
Serial.print(currentMillis);
Serial.print(" - ");
Serial.print(previousMillis);
Serial.print(" (");
Serial.print((unsigned long)(currentMillis - previousMillis));
Serial.print(") >= ");
Serial.println(interval);
if ((unsigned long)(currentMillis - previousMillis) >= interval) {
previousMillis = currentMillis;
}
}
Now the following happens:
TIMING: 3076 - 2067 (1009) >= 1000
TIMING: 4080 - 3076 (1004) >= 1000
TIMING: 5084 - 4080 (1004) >= 1000
TIMING: 6087 - 5084 (1003) >= 1000
TIMING: 7091 - 6087 (1004) >= 1000
Why is the currentMillis getting so much higher every loop? It looks like it is sharing a pointer or something like that, because it adds the interval value everytime. I am confused!
I think that the code you presented us is an incomplete picture of what you have uploaded on the Arduino, since on my device I obtain the following sequence
TIMING: 0 - 0 (0) >= 1000
TIMING: 0 - 0 (0) >= 1000
TIMING: 1 - 0 (1) >= 1000
TIMING: 32 - 0 (32) >= 1000
TIMING: 93 - 0 (93) >= 1000
TIMING: 153 - 0 (153) >= 1000
TIMING: 218 - 0 (218) >= 1000
TIMING: 283 - 0 (283) >= 1000
TIMING: 348 - 0 (348) >= 1000
TIMING: 412 - 0 (412) >= 1000
TIMING: 477 - 0 (477) >= 1000
TIMING: 541 - 0 (541) >= 1000
TIMING: 606 - 0 (606) >= 1000
TIMING: 670 - 0 (670) >= 1000
TIMING: 735 - 0 (735) >= 1000
TIMING: 799 - 0 (799) >= 1000
TIMING: 865 - 0 (865) >= 1000
TIMING: 929 - 0 (929) >= 1000
TIMING: 994 - 0 (994) >= 1000
TIMING: 1058 - 0 (1058) >= 1000
TIMING: 1127 - 1058 (69) >= 1000
TIMING: 1198 - 1058 (140) >= 1000
TIMING: 1271 - 1058 (213) >= 1000
TIMING: 1344 - 1058 (286) >= 1000
and it sounds correct given the code you provided.
Are you sure there isn't any sleep() call in your original source code?
(perhaps you didn't upload the updated code on the device?)
To expand on #patrick-trentin's answer, it's looking very likely that your code is not the only thing you run on your Arduino. The code you see in your sketches is never the only code that the arduino runs. It handles for you the Serial data incoming, and if you use some other modules (like SPI or network) it's having some other code that runs in ISP, which are functions run regularly using a timer.
But the arduino CPU cannot run code in parallel. To mimic parallel behaviour, it's actually stopping your main loop, to run a subroutine (the ISP), which will read a byte coming in through Serial (for example), buffer it to then make it available to you using a nice method on the Serial object.
So the more things you do in those interrupt based subroutines, the less often you'll iterate your main loop, having it pass less often over the millis() comparaison you do.
I have a large-ish matrix and I'd like to apply sortperm to each column of that matrix. The naive thing to do is
order = sortperm(X[:,j])
which makes a copy. That seems like a shame, so I thought I'd try a SubArray:
order = sortperm(sub(X,1:n,j))
but that was even slower. For a laugh I tried
order = sortperm(1:n,by=i->X[i,j])
but of course that was terrible. What is the fastest way to do this?
Here is some benchmark code:
getperm1(X,n,j) = sortperm(X[:,j])
getperm2(X,n,j) = sortperm(sub(X,1:n,j))
getperm3(X,n) = mapslices(sortperm, X, 1)
n = 1000000
X = rand(n, 10)
for f in [getperm1, getperm2]
println(f)
for it in 1:5
gc()
#time f(X,n,5)
end
end
for f in [getperm3]
println(f)
for it in 1:5
gc()
#time getperm3(X,n)
end
end
results:
getperm1
elapsed time: 0.258576164 seconds (23247944 bytes allocated)
elapsed time: 0.141448346 seconds (16000208 bytes allocated)
elapsed time: 0.137306078 seconds (16000208 bytes allocated)
elapsed time: 0.137385171 seconds (16000208 bytes allocated)
elapsed time: 0.139137529 seconds (16000208 bytes allocated)
getperm2
elapsed time: 0.433251141 seconds (11832620 bytes allocated)
elapsed time: 0.33970986 seconds (8000624 bytes allocated)
elapsed time: 0.339840795 seconds (8000624 bytes allocated)
elapsed time: 0.342436716 seconds (8000624 bytes allocated)
elapsed time: 0.342867431 seconds (8000624 bytes allocated)
getperm3
elapsed time: 1.766020534 seconds (257397404 bytes allocated, 1.55% gc time)
elapsed time: 1.43763525 seconds (240007488 bytes allocated, 1.85% gc time)
elapsed time: 1.41373546 seconds (240007488 bytes allocated, 1.82% gc time)
elapsed time: 1.42215519 seconds (240007488 bytes allocated, 1.83% gc time)
elapsed time: 1.419174037 seconds (240007488 bytes allocated, 1.83% gc time)
Where the mapslices version is 10x the getperm1 version, as you'd expect.
Its worth pointing out that, on my machine at least, the copy+sortperm option is not that much slower than just a sortperm on a vector of the same length, but no memory allocation is necessary so it'd be nice to avoid it.
You can beat the performance of SubArray in a few very specific cases (like taking a continuous view of an Array) with pointer magic:
function colview(X::Matrix,j::Int)
n = size(X,1)
offset = 1+n*(j-1) # The linear start position
checkbounds(X, offset+n-1)
pointer_to_array(pointer(X, offset), (n,))
end
getperm4(X,n,j) = sortperm(colview(X,j))
The function colview will return a full-fledged Array that shares its data with the original X. Note that this is a terrible idea because the returned array is referencing data that Julia is only keeping track of through X. This means that if X goes out of scope before the column "view" data access will crash with a segfault.
With results:
getperm1
elapsed time: 0.317923176 seconds (15 MB allocated)
elapsed time: 0.252215996 seconds (15 MB allocated)
elapsed time: 0.215124686 seconds (15 MB allocated)
elapsed time: 0.210062109 seconds (15 MB allocated)
elapsed time: 0.213339974 seconds (15 MB allocated)
getperm2
elapsed time: 0.509172302 seconds (7 MB allocated)
elapsed time: 0.509961218 seconds (7 MB allocated)
elapsed time: 0.506399583 seconds (7 MB allocated)
elapsed time: 0.512562736 seconds (7 MB allocated)
elapsed time: 0.506199265 seconds (7 MB allocated)
getperm4
elapsed time: 0.225968056 seconds (7 MB allocated)
elapsed time: 0.220587707 seconds (7 MB allocated)
elapsed time: 0.219854355 seconds (7 MB allocated)
elapsed time: 0.226289377 seconds (7 MB allocated)
elapsed time: 0.220391515 seconds (7 MB allocated)
I've not looked into why the performance is worse with SubArray, but it may simply be from an extra pointer dereference on every memory access. It's very remarkable how little the allocation actually costs you in terms of time - getperm1's timings are more variable, but it still occasionally beats getperm4! I think that this is due to some extra pointer math in Array's internal implementation with shared data. There's also some crazy caching behavior… getperm1 gets significantly faster on repeated runs.
UPDATE: Note that the relevant function in Julia v1+ is view
Question: I would like to index into an array without triggering memory allocation, especially when passing the indexed elements into a function. From reading the Julia docs, I suspect the answer revolves around using the sub function, but can't quite see how...
Working Example: I build a large vector of Float64 (x) and then an index to every observation in x.
N = 10000000
x = randn(N)
inds = [1:N]
Now I time the mean function over x and x[inds] (I run mean(randn(2)) first to avoid any compiler irregularities in the timing):
#time mean(x)
#time mean(x[inds])
It's an identical calculation, but as expected the results of the timings are:
elapsed time: 0.007029772 seconds (96 bytes allocated)
elapsed time: 0.067880112 seconds (80000208 bytes allocated, 35.38% gc time)
So, is there a way around the memory allocation problem for arbitrary choices of inds (and arbitrary choice of array and function)?
Just use xs = sub(x, 1:N). Note that this is different from x = sub(x, [1:N]); on julia 0.3 the latter will fail, and on julia 0.4-pre the latter will be considerably slower than the former. On julia 0.4-pre, sub(x, 1:N) is just as fast as view:
julia> N = 10000000;
julia> x = randn(N);
julia> xs = sub(x, 1:N);
julia> using ArrayViews
julia> xv = view(x, 1:N);
julia> mean(x)
-0.0002491126429772525
julia> mean(xs)
-0.0002491126429772525
julia> mean(xv)
-0.0002491126429772525
julia> #time mean(x);
elapsed time: 0.015345806 seconds (27 kB allocated)
julia> #time mean(xs);
elapsed time: 0.013815785 seconds (96 bytes allocated)
julia> #time mean(xv);
elapsed time: 0.015871052 seconds (96 bytes allocated)
There are several reasons why sub(x, inds) is slower than sub(x, 1:N):
Each access xs[i] corresponds to x[inds[i]]; we have to look up two memory locations rather than one
If inds is not in order, you will get poor cache behavior in accessing the elements of x
It destroys the ability to use SIMD vectorization
In this case, the latter is probably the most important effect. This is not a Julia limitation; the same thing would happen were you to write the equivalent code in C, Fortran, or assembly.
Note that it's still faster to say sum(sub(x, inds)) than sum(x[inds]), (until the latter becomes the former, which it should by the time julia 0.4 is officially out). But if you have to do many operations with xs = sub(x, inds), in some circumstances it will be worth your while to make a copy, even though it allocates memory, just so you can take advantage of the optimizations possible when values are stored in contiguous memory.
EDIT: Read tholy's answer too to get a full picture!
When using an array of indices, the situation is not great right now on Julia 0.4-pre (start of Feb 2015):
julia> N = 10000000;
julia> x = randn(N);
julia> inds = [1:N];
julia> #time mean(x)
elapsed time: 0.010702729 seconds (96 bytes allocated)
elapsed time: 0.012167155 seconds (96 bytes allocated)
julia> #time mean(x[inds])
elapsed time: 0.088312275 seconds (76 MB allocated, 17.87% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.073672734 seconds (76 MB allocated, 3.27% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.071646757 seconds (76 MB allocated, 1.08% gc time in 1 pauses with 0 full sweep)
julia> xs = sub(x,inds); # Only works on 0.4
julia> #time mean(xs)
elapsed time: 0.057446177 seconds (96 bytes allocated)
elapsed time: 0.096983673 seconds (96 bytes allocated)
elapsed time: 0.096711312 seconds (96 bytes allocated)
julia> using ArrayViews
julia> xv = view(x, 1:N) # Note use of a range, not [1:N]!
julia> #time mean(xv)
elapsed time: 0.012919509 seconds (96 bytes allocated)
elapsed time: 0.013010655 seconds (96 bytes allocated)
elapsed time: 0.01288134 seconds (96 bytes allocated)
julia> xs = sub(x,1:N) # Works on 0.3 and 0.4
julia> #time mean(xs)
elapsed time: 0.014191482 seconds (96 bytes allocated)
elapsed time: 0.014023089 seconds (96 bytes allocated)
elapsed time: 0.01257188 seconds (96 bytes allocated)
So while we can avoid the memory allocation, we are actually slower(!) still.
The issue is indexing by an array, as opposed to a range. You can't use sub for this on 0.3, but you can on 0.4.
If we can index by a range, then we can use ArrayViews.jl on 0.3 and its inbuilt on 0.4. This case is pretty much as good as the original mean.
I noticed that with a smaller number of indices used (instead of the whole range), the gap is much smaller, and the memory allocation is low, so sub might be worth:
N = 100000000
x = randn(N)
inds = [1:div(N,10)]
#time mean(x)
#time mean(x)
#time mean(x)
#time mean(x[inds])
#time mean(x[inds])
#time mean(x[inds])
xi = sub(x,inds)
#time mean(xi)
#time mean(xi)
#time mean(xi)
gives
elapsed time: 0.092831612 seconds (985 kB allocated)
elapsed time: 0.067694917 seconds (96 bytes allocated)
elapsed time: 0.066209038 seconds (96 bytes allocated)
elapsed time: 0.066816927 seconds (76 MB allocated, 20.62% gc time in 1 pauses with 1 full sweep)
elapsed time: 0.057211528 seconds (76 MB allocated, 19.57% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.046782848 seconds (76 MB allocated, 1.81% gc time in 1 pauses with 0 full sweep)
elapsed time: 0.186084807 seconds (4 MB allocated)
elapsed time: 0.057476269 seconds (96 bytes allocated)
elapsed time: 0.05733602 seconds (96 bytes allocated)
I compiled the program with -Criot -gl flags and instead of 1 I get a lot of results to my surpise (in fact, I was looking for fix a 216 error). The first is with the below code that's a simple hashing function. I have no idea how to fix this.
function HashStr(s : string) : integer;
var h : integer;
var c : char;
begin
h := 0;
for c in s do
h := ord(c) + 31 * h; { This is the line of error }
HashStr := h;
end;
How can this be out of ranges?
Easily, say you have a string "zzzzzzzzzzz". Ord(c) wil be 122, so the sequence is
H = 122 + (31* 0 ) = 122
H = 122 +(31*122) = 3902
H = 122 +(31*3902) = 121146
Which exceeds the 32767 limit for 16 bit ints, if it's a 32 but int, it won't take many more iterations to exceed that limit.