CUDA functions in DLL, __declspec(dllexport) works but strange results?

CUDA functions in DLL, __declspec(dllexport) works but strange results? - qt

I've come across a weird problem with a piece of CUDA code. It's compiled into a DLL using msvc community 2015 and nvcc in Windows 10. I'm using CUDA 8. The application calling the dll is being developed with Qt5.
The application is fairly large and complicated: using Qt, CUDA, VTK, HDF5. It all seems to work, the app runs and does what it's supposed to, but fails in a reproducible manner that doesn't seem to make any sense. The example function below seems to reproduce a similar error.
I'm compiling the dll with:
nvcc -m64 -arch=sm_20 -o fdm1_cuda.dll -Xcompiler "/LD /D_USRDLL /D_WINDLL" fdm1_cuda.cu
This function seems to exhibit the same problem as the main code:
extern "C" __declspec(dllexport) void fdm1_funnyproblemchecker(){
cudaError_t errorcode;
float *a_host;
float *b_host;
float *a_device;
int num, i;
num=10;
a_host = (float *)malloc(sizeof(float)*num);
if( a_host) printf("Result check, allocate host memory a: success\n");
if(!a_host) printf("Result check, allocate host memory a: failed!\n");
for(i=0;i<num;i++) a_host[i] = (float)i;
for(i=0;i<num;i++) printf("%6.3f ", a_host[i]);
printf("\n");
b_host = (float *)malloc(sizeof(float)*num);
if( b_host) printf("Result check, allocate host memory b: success\n");
if(!b_host) printf("Result check, allocate host memory b: failed!\n");
errorcode = cudaSuccess;
cudaMalloc((void **) &a_device, sizeof(float)*num);
errorcode = cudaGetLastError();
printf("Result check, allocate device memory: %s\n", cudaGetErrorString(errorcode));
errorcode = cudaSuccess;
cudaMemcpy(a_device, a_host, num*sizeof(float), cudaMemcpyHostToDevice);
errorcode = cudaGetLastError();
printf("Result check, copy host to device : %s\n", cudaGetErrorString(errorcode));
errorcode = cudaSuccess;
cudaMemcpy(b_host, a_device, num*sizeof(float), cudaMemcpyDeviceToHost);
errorcode = cudaGetLastError();
printf("Result check, copy device to host : %s\n", cudaGetErrorString(errorcode));
for(i=0;i<num;i++) printf("%6.3f ", b_host[i]);
printf("\n");
fflush(stdout);
cudaFree(a_device);
free(a_host);
free(b_host);
}
Sometimes the output from this is:
Result check, allocate host memory a: success
0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000
Result check, allocate host memory b: success
Result check, allocate device memory: no error
Result check, copy host to device : no error
Result check, copy device to host : no error
0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000
If I change something which I don't think is related elsewhere in the application (changing the size of a model during runtime), I get this:
Result check, allocate host memory a: success
0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000
Result check, allocate host memory b: success
Result check, allocate device memory: no error
Result check, copy host to device : an illegal memory access was encountered
Result check, copy device to host : an illegal memory access was encountered
0.000 0.000 0.000 0.000 0.000 0.000 0.000 270355481144287188484096.000 74936693461279934656588472647680.000 0.000
So, there is a cudaMemcpy failure.
I can't tell if it is a host malloc issue, a cudaMalloc issue, or something related to it running from a dll.
Can anyone see what I'm missing here?
I've had this application running in Linux and Mac, using dynamic libraries, without any major problems. I'm now trying to get it going under windows.

Problem solved.
It was a kernel elsewher in the code accessing an array[-1] out of bounds.
My error checking with cudaGetLastError was incorrect. If I do cudaDeviceSynchronize before each cudaGetLastError it reports errors I was missing.
Thanks.

Related

ffplay attempt to subscribe to rtmp server failing with: RTMP_ReadPacket, failed to read RTMP packet header

I have an nginx rtmp server loaded with this docker image: https://github.com/DvdGiessen/nginx-rtmp-docker.
In general I can stream to it fine with ffmpeg and most of the time connect to the stream fine as well with ffplay. However, for some people, they are unable to subscribe to the RTMP stream at all.
ffmpeg hosts with this command:
ffmpeg.exe -f,gdigrab,-framerate,20,-draw_mouse,1,-i,desktop,-c:v,h264_nvenc,-profile:v,main,-delay,0,-preset,default,-rc,vbr,-cq,36,-vf,scale=1024:-2,format=yuv420p,-r,20,-g,40,-y,-f,flv,rtmp://url
ffplay subscribes with this command:
ffplay.exe -fflags,nobuffer,-flags,low_delay,-an,-window_title,Screen of User,-framedrop,rtmp://url
The URL does match the url to which the host is streaming from. What happens is that for about 30 seconds, nothing happens with the following ffplay output:
nan : 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
nan : 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
nan : 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
which repeats until after a while I get the following error:
RTMP_ReadPacket, failed to read RTMP packet header
2018/mm/dd 12:--:--:-- [web] rtmp://url: Invalid data found when processing input
I tried doing what this recommended in regards to the NGINX server setup here: https://github.com/arut/nginx-rtmp-module/issues/1039, setting my worker_processes to 1 which did not change anything.
It seems like it may just be ffplay timing out but I cannot tell why it occurs only for a few users and not widely. If it is ffplay timing out, what can be done to fix the problem? It doesn't seem like an internet speed issue, as these subscribers have pretty good internet. I cannot replicate across different machines, only those few who have continued to have this problem. Any and all help would be appreciated!

ROS Crashing on macOS X Sierra with JavaScript heap out of memory error

I'm running the developers edition of Realm Object Server v1.8.3 as a mac app. I start it with the start-object-server.command. It has been running fine for a number of days and everything was working really well, but ROS is now crashing within seconds of starting it.
Clearly the issue is with the JavaScript element, but I am not sure what has led to this position, nor how best to recover from this error. I have not created any additional functions, so not adding any NODE.js issues: it's just ROS with half a dozen realms.
The stack dump I get from the terminal session is as below. Any thoughts on recovery steps and how to prevent it happening again would be appreciated.
Last few GCs
607335 ms: Mark-sweep 1352.1 (1404.9) -> 1351.7 (1402.9) MB, 17.4 / 0.0 ms [allocation failure] [GC in old space requested].
607361 ms: Mark-sweep 1351.7 (1402.9) -> 1351.7 (1367.9) MB, 25.3 / 0.0 ms [last resort gc].
607376 ms: Mark-sweep 1351.7 (1367.9) -> 1351.6 (1367.9) MB, 15.3 / 0.0 ms [last resort gc].
JS stacktrace
Security context: 0x3eb4332cfb39
1: DoJoin(aka DoJoin) [native array.js:~129] [pc=0x1160420f24ad] (this=0x3eb433204381 ,w=0x129875f3a8b1 ,x=3,N=0x3eb4332043c1 ,J=0x3828ea25c11 ,I=0x3eb4332b46c9 )
2: Join(aka Join) [native array.js:180] [pc=0x116042067e32] (this=0x3eb433204381 ,w=0x129875f3a8b1
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node::Abort() [/Applications/realm-mobile-platform/realm-object-server/.prefix/bin/node]
2: node::FatalException(v8::Isolate*, v8::Local, v8::Local) [/Applications/realm-mobile-platform/realm-object-server/.prefix/bin/node]
3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Applications/realm-mobile-platform/realm-object-server/.prefix/bin/node]
4: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/Applications/realm-mobile-platform/realm-object-server/.prefix/bin/node]
5: v8::internal::Runtime_StringBuilderJoin(int, v8::internal::Object**, v8::internal::Isolate*) [/Applications/realm-mobile-platform/realm-object-server/.prefix/bin/node]
6: 0x1160411092a7
/Applications/realm-mobile-platform/start-object-server.command: line 94: 39828 Abort trap: 6 node "$package/node_modules/.bin/realm-object-server" -c configuration.yml (wd: /Applications/realm-mobile-platform/realm-object-server/object-server)

Your ROS instance has run out of memory. To figure out why it runs out of memory, it would be helpful to see the log file of the server. Can you turn
on the debug level for logging.
If you want to send a log file to Realm, it is better to open an issue for this at https://github.com/realm/realm-mobile-platform/issues.

Asterisk Getting stopped when handling more than 200 calls

I have a soft PBX setup using asterisk,dahdi and libpri. Asterisk getting stopped frequently when handling more than 200 calls. Due to this, all processing calls are getting abandon.
Server Configurations :
RAM : 32 GB
Processor : 16 core
OS : debian Squeeze - 64 bit ( installed without X )
Asterisk Version : 13.10
Dahdi TE435/235 Version : 2.11.1 (we are using 4 port card 2 Nos)
Libpri Version : 1.4.11
We have changed maxfiles to 2000 in asterisk.conf for handling 240 calls
Getting below error in dmesg:
wcte43x 0000:05:00.0: Underrun detected by hardware. Latency at max of 12ms.
[406144.759396] __ratelimit: 48 callbacks suppressed
Getting below warning in asterisk log:
WARNING[4876][C-000000db] sig_analog.c: Ring/Off-hook in strange state 6 on channel 37
WARNING[4876][C-000000db] channel.c: Unexpected control subclass '2'
Getting below message in message log,
Altumivr kernel: [165794.686917] asterisk[32641] trap divide error ip:7f14375e75eb sp:7f1411b1c1a0 error:0 in res_musiconhold.so[7f14375e1000+b000]
Is there need to do any tweak in configuration level. Please assist and suggest.

The problem with digium card (wcte43x). The error
Underrun detected by hardware. Latency at max of 12ms is resolved once replaced the card. Thank you.

mpi + infiniband too many connections

I am running a MPI application on a cluster, using 4 nodes each with 64 cores.
The application performs an all to all communication pattern.
Executing the application by the following runs fine:
$: mpirun -npernode 36 ./Application
Adding a further process per node let the application crash:
$: mpirun -npernode 37 ./Application
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.
For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: laser045
Local device: qib0
Queue pair type: Reliable connected (RC)
--------------------------------------------------------------------------
[laser045:15359] *** An error occurred in MPI_Issend
[laser045:15359] *** on communicator MPI_COMM_WORLD
[laser045:15359] *** MPI_ERR_OTHER: known error not in list
[laser045:15359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
[laser040:49950] [[53382,0],0]->[[53382,1],30] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 163]
[laser040:49950] [[53382,0],0]->[[53382,1],21] mca_oob_tcp_msg_send_handler: writev failed: Connection reset by peer (104) [sd = 154]
--------------------------------------------------------------------------
mpirun has exited due to process rank 128 with PID 15358 on
node laser045 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[laser040:49950] 4 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed
[laser040:49950] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[laser040:49950] 4 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
EDIT added some source code of all to all communication pattern:
// Send data to all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Request request;
MPI_Issend(&data, dataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &request);
requests.push_back(request);
}
// Recv data from all other ranks
for(unsigned i = 0; i < (unsigned)size; ++i){
if((unsigned)rank == i){
continue;
}
MPI_Status status;
MPI_Recv(&recvData, recvDataSize, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
}
// Finish communication operations
for(MPI_Request &r: requests){
MPI_Status status;
MPI_Wait(&r, &status);
}
Is there something I can do as cluster user or some advices I can give the cluster admin ?

The line mca_oob_tcp_msg_send_handler error line may indicate that the node corresponding to a receiving rank died (ran out of memory or received a SIGSEGV):
http://www.open-mpi.org/faq/?category=tcp#tcp-connection-errors
The OOB (out-of-band) framework in Open-MPI is used for control messages, not for the messages of your applications. Indeed, messages typically go throught byte transfer layers (BTLs) such as self, sm, vader, openib (Infiniband), and so on.
The output of 'ompi_info -a' is useful in that regard.
Finally, it is not specified in the question is the Infiniband hardware vendor is Mellanox, so the XRC option may not work (for instance, Intel/QLogic Infiniband does not support this option).

The error is connected to the buffer size
of the mpi message queues commented here:
http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc
The following environment setting solved my problem:
$ export OMPI_MCA_btl_openib_receive_queues="P,128,256,192,128:S,65536,256,192,128"

What do programs see when ZFS can't deliver uncorrupted data?

Say my program attempts a read of a byte in a file on a ZFS filesystem. ZFS can locate a copy of the necessary block, but cannot locate any copy with a valid checksum (they're all corrupted, or the only disks present have corrupted copies). What does my program see, in terms of the return value from the read, and the byte it tried to read? And is there a way to influence the behavior (under Solaris, or any other ZFS-implementing OS), that is, force failure, or force success, with potentially corrupt data?

EIO is indeed the only answer with current ZFS implementations.
An open ZFS "bug" asks for some way to read corrupted data:
http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=6186106
I believe this is already doable using the undocumented but open source zdb utility.
Have a look at http://www.cuddletech.com/blog/pivot/entry.php?id=980 for explanations about how to dump a file content using zdb -R option and "r" flag.

Solaris 10:
# Create a test pool
[root#tesalia z]# cd /tmp
[root#tesalia tmp]# mkfile 100M zz
[root#tesalia tmp]# zpool create prueba /tmp/zz
# Fill the pool
[root#tesalia /]# dd if=/dev/zero of=/prueba/dummy_file
dd: writing to `/prueba/dummy_file': No space left on device
129537+0 records in
129536+0 records out
66322432 bytes (66 MB) copied, 1.6093 s, 41.2 MB/s
# Umount the pool
[root#tesalia /]# zpool export prueba
# Corrupt the pool on purpose
[root#tesalia /]# dd if=/dev/urandom of=/tmp/zz seek=100000 count=1 conv=notrunc
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0715209 s, 7.2 kB/s
# Mount the pool again
zpool import -d /tmp prueba
# Try to read the corrupted data
[root#tesalia tmp]# md5sum /prueba/dummy_file
md5sum: /prueba/dummy_file: I/O error
# Read the manual
[root#tesalia tmp]# man -s2 read
[...]
RETURN VALUES
Upon successful completion, read() and readv() return a
non-negative integer indicating the number of bytes actually
read. Otherwise, the functions return -1 and set errno to
indicate the error.
ERRORS
The read(), readv(), and pread() functions will fail if:
[...]
EIO A physical I/O error has occurred, [...]
You must export/import the test pool because, if not, the direct overwrite (pool corruption) will be missed since the file will still be cached in OS memory.
And no, currently ZFS will refuse to give you corrupted data. As it should.

How would returning anything but an EIO error from read() make sense outside a file system specific low level data rescue utility?
The low level data rescue utility would need to use an OS and FS specific API other than open/read/write/close to to access the file. The semantics it would need are fundamentally different from reading normal files, so it would need a specialized API.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex