write a wav file to audio device using alsa

write a wav file to audio device using alsa - wav

I want to write a .wav file to audio device. I used alsa for that purpose. I took a sample file.Here I am copying the file sample data to buffer.
file.read(reinterpret_cast<char*>(&buff),wave.dataHeader.chunkSize);
int nframes = wave.dataHeader.chunkSize/2;
while( nframes > 0 ){
int r = snd_pcm_writei (playback_handle,
reinterpret_cast<const char*>(buff), nframes);
nframes = nframes - r;
if(r < 0){
fprintf (stderr, "write to audio interface failed (%s)\n",
snd_strerror (err));
exit (1);
}
buff += r * 2;
}
My files data is
File Size: 2084
FmtChunk Size: 16
Format Type: 1
Channels: 2
Sample Rate: 48000
Bytes Per Sec: 192000
Bits per sample: 16
here it is giving error
a.out: pcm.c:1250: snd_pcm_writei: Assertion `pcm' failed.
Aborted.
Here what should be the nframes value and how should i write to audio device any help thanks in advance.

Related

Why is my audio delay filter losing the delay?

I'm trying to stream audio from a QAudioInput to a QAudioOutput (Qt 5.15) but with a few seconds of artificial delay. In an effort to keep the code simple I implemented a delay filter based on a QIODevice, which sits between the input and output. The audio i/o is initialized like so:
QAudioFormat audioformat = ...;
QAudioInput *audioin = new QAudioInput(..., audioformat, this);
QAudioOutput *audioout = new QAudioOutput(..., audioformat, this);
DelayFilter *delay = new DelayFilter(this);
delay->setDelay(3.0, audioformat);
audioout->start(delay); // output reads from 'delay', writes to speakers
audioin->start(delay); // input reads from line input, writes to 'delay'
The delay filter is:
class DelayFilter : public QIODevice {
Q_OBJECT
public:
explicit DelayFilter (QObject *parent = nullptr);
void setDelay (int bytes);
void setDelay (double seconds, const QAudioFormat &format);
int delay () const { return delay_; }
protected:
qint64 readData (char *data, qint64 maxlen) override;
qint64 writeData (const char *data, qint64 len) override;
private:
int delay_; // delay length in bytes
QByteArray buffer_; // buffered data for delaying
int leadin_; // >0 = need to increase output delay, <0 = need to decrease
// debugging:
qint64 totalread_, totalwritten_;
};
And implemented like this:
DelayFilter::DelayFilter (QObject *parent)
: QIODevice(parent), delay_(0), leadin_(0), totalread_(0), totalwritten_(0)
{
open(QIODevice::ReadWrite);
}
void DelayFilter::setDelay (double seconds, const QAudioFormat &format) {
setDelay(format.bytesForFrames(qRound(seconds * format.sampleRate())));
}
void DelayFilter::setDelay (int bytes) {
bytes = std::max(0, bytes);
leadin_ += (bytes - delay_);
delay_ = bytes;
}
qint64 DelayFilter::writeData (const char *data, qint64 len) {
qint64 written = -1;
if (len >= 0) {
try {
buffer_.append(data, len);
written = len;
} catch (std::bad_alloc) {
}
}
if (written > 0) totalwritten_ += written;
//qDebug() << "wrote " << written << leadin_ << buffer_.size();
return written;
}
qint64 DelayFilter::readData (char *dest, qint64 maxlen) {
//qDebug() << "reading" << maxlen << leadin_ << buffer_.size();
qint64 w, bufpos;
for (w = 0; leadin_ > 0 && w < maxlen; --leadin_, ++w)
dest[w] = 0;
for (bufpos = 0; bufpos < buffer_.size() && leadin_ < 0; ++bufpos, ++leadin_)
;
// todo if needed: if upper limit is ok on buffered data, use a fixed size ring instead
if (leadin_ == 0) {
const char *bufdata = buffer_.constData();
for ( ; bufpos < buffer_.size() && w < maxlen; ++bufpos, ++w)
dest[w] = bufdata[bufpos];
buffer_ = buffer_.mid(bufpos);
}
totalread_ += w;
qDebug() << "read " << w << leadin_ << buffer_.size()
<< bufpos << totalwritten_ << totalread_ << (totalread_ - totalwritten_);
return w;
}
Where the fundamental idea is:
If I want to delay 3 seconds, I write out 3 seconds of silence and then start piping the data as usual.
And delay changes are handled like this (for completeness, although it's not relevant to this question because I'm seeing issues without changing delays):
If I decrease that 3 second delay to 2 seconds then I have to skip 1 second worth of input to catch up.
If I increase that 3 second delay to 4 seconds then I have to write 1 second of silence to fall behind.
That is all implemented via the leadin_ counter, which contains the number of frames I have to delay (> 0) or skip (< 0) to get to the desired delay. In my example case, it's set > 0 when the 3 second delay is configured, and that will provide 3 seconds of silence to the QAudioOutput before it starts passing along the buffered data.
The problem is that the delay is there when the app starts, but over the course of a few seconds, the delay completely disappears and there is no delay any more. I can hear that it's skipping samples here and there to catch up, with an occasional light click or pop in the audio output.
The debug printouts show some things are working:
Seemingly matched read / write timings
Smooth decrease in leadin_ to 0 over first 3 seconds
Smooth increase in total bytes read and written
Constant bytes_read - bytes_written value equal to the delay time after initial ramp up
But they also show some stuff isn't:
The buffer size is filled up by the QAudioInput and initially increases, but then begins decreasing (once leadin_ is exhausted) and stays stable at a low value. But I expect the buffer size to grow and then stay constant, equal to the delay time. The decrease means reads are happening faster than writes.
I can't make any sense of it. I added some debugging code to watch for state changes in the input / output to see if they were popping into Idle state (the output will do this to avoid buffer underruns) but they're not, they're just happily handling data with no apparent hiccups.
I expected this to work because both the input and output are using the same sample rate, and so once I initially get 3 seconds behind (or whatever delay time) I expected it to stay that way forever. I can't understand why, given that the input and output are configured at the same sample rate, the output is skipping samples and eating up the delay, and then playing smoothly again.
Am I missing some important override in my QIODevice implementation? Or is there some weird thing that Qt Multimedia does with audio buffering that is breaking this? Or am I just doing something fundamentally wrong here? Since this QIODevice-based delay is all very passive, I don't think I'm doing anything to drive the timing forward faster than it should be going, am I?
I hope this is clear; my definition of "read" and "write" kind of flip/flops above depending on context but I did my best.
Initial debug output (2nd number is leadin_, 3rd number is amount of buffered data, last number is read-written):
read 19200 268800 0 0 0 19200 19200
read 16384 252416 11520 0 11520 35584 24064
read 16384 236032 19200 0 19200 51968 32768
read 16384 219648 26880 0 26880 68352 41472
read 16384 203264 34560 0 34560 84736 50176
read 16384 186880 46080 0 46080 101120 55040
read 16384 170496 53760 0 53760 117504 63744
read 16384 154112 61440 0 61440 133888 72448
read 16384 137728 69120 0 69120 150272 81152
read 16384 121344 80640 0 80640 166656 86016
read 16384 104960 88320 0 88320 183040 94720
read 16384 88576 96000 0 96000 199424 103424
read 16384 72192 103680 0 103680 215808 112128
read 16384 55808 115200 0 115200 232192 116992
read 16384 39424 122880 0 122880 248576 125696
read 16384 23040 130560 0 130560 264960 134400
read 16384 6656 138240 0 138240 281344 143104
read 16384 0 140032 9728 149760 297728 147968
read 16384 0 131328 16384 157440 314112 156672
read 16384 0 122624 16384 165120 330496 165376
read 16384 0 113920 16384 172800 346880 174080
read 16384 0 109056 16384 184320 363264 178944
read 16384 0 100352 16384 192000 379648 187648
read 16384 0 91648 16384 199680 396032 196352
read 16384 0 82944 16384 207360 412416 205056
read 16384 0 78080 16384 218880 428800 209920
read 16384 0 69376 16384 226560 445184 218624
read 16384 0 60672 16384 234240 461568 227328
read 16384 0 51968 16384 241920 477952 236032
read 16384 0 47104 16384 253440 494336 240896
read 16384 0 38400 16384 261120 510720 249600
read 16384 0 29696 16384 268800 527104 258304
read 16384 0 20992 16384 276480 543488 267008
read 16384 0 16128 16384 288000 559872 271872
read 16384 0 7424 16384 295680 576256 280576
read 15104 0 0 15104 303360 591360 288000
read 7680 0 0 7680 311040 599040 288000
read 3840 0 0 3840 314880 602880 288000
read 3840 0 0 3840 318720 606720 288000
read 3840 0 0 3840 322560 610560 288000
read 3840 0 0 3840 326400 614400 288000
read 3840 0 0 3840 330240 618240 288000
read 3840 0 0 3840 334080 622080 288000
read 3840 0 0 3840 337920 625920 288000

esp32 pico kit - actual clock speed vs crystal vs max_speed etc

I am trying to understand how the clock speed in the esp32 works.
I am running arduino ide 1.8.12 with 1.04 board definition - on a pico kit 4 module .
The device datasheet says it has a 40mhz crystal.
getCpuFrequencyMhz() reports 240mhz
But the following code is generating 1sec interrupts with 80mgz as a base i.e., as if the clock speed is 80mhz.
If I explicitly set the clock speed to 80 mhz then getCpufrequency correctly reports it as having changed, but the code still keeps generating 1sec interrupts. If I set the clock to 40mhz then I finally see interrupts at 2 secs as expected.
I am sure that the explanation is somewhere in expressif's docs, but I have been unable to glean it. Can anyone quickly explain what is going on or point me to some resource that explains this particular circumstance?
---------------
Sample output
getCpuFrequencyMhz() = 240
interrupts so far = 1 - : - mills() = 1018
interrupts so far = 2 - : - mills() = 2018
interrupts so far = 3 - : - mills() = 3018
interrupts so far = 4 - : - mills() = 4018
interrupts so far = 5 - : - mills() = 5018
----------------Sample code
#include "esp32-hal-cpu.h"
volatile int interruptCounter;
int totalInterruptCounter;
hw_timer_t * timer = NULL;
portMUX_TYPE timerMux = portMUX_INITIALIZER_UNLOCKED;
void IRAM_ATTR onTimer() {
portENTER_CRITICAL_ISR(&timerMux);
interruptCounter++;
portEXIT_CRITICAL_ISR(&timerMux);
}
void setup() {
Serial.begin(115200);
Serial.printf("getCpuFrequencyMhz() = %d\n",getCpuFrequencyMhz()); //Get CPU clock
timer = timerBegin(0, 80, true);
timerAttachInterrupt(timer, &onTimer, true);
timerAlarmWrite(timer, 1000000, true);
timerAlarmEnable(timer);
}
void loop() {
if (interruptCounter > 0) {
portENTER_CRITICAL(&timerMux);
interruptCounter--;
portEXIT_CRITICAL(&timerMux);
totalInterruptCounter++;
Serial.printf("interrupts so far = %d - : - mills() = %d \n", totalInterruptCounter, millis());
}
}

audio decoding error using libavcodec avcodec_decode_audio4

At present, I am trying audio decoding for the first time using avcodec_decode_audio4(), which is always returning error.
Note: My overall intention is to merge multiple AMR_NB encoded files into one audio file, and eventually mux this final audio file with another video file into .mp4 container.
Here's the code:
if(avformat_open_input(&sound, "/tmp/aud0", NULL, NULL)!=0) {
printf("Error in opening input file aud0\n");
exit(1);
} else {
printf("Opened aud0 into FormatContext\n");
}
if(avformat_find_stream_info(sound, NULL)<0){
printf("Error in finding stream info\n");
avformat_close_input(&sound);
exit(1);
}
int aud_stream_idx=-1;
for(int count=0; count<sound->nb_streams; count++) {
if(sound->streams[count]->codec->codec_type == AVMEDIA_TYPE_AUDIO) {
aud_stream_idx = count;
printf("Audio stream index %d found\n", count);
break;
}
}
if(aud_stream_idx==-1) {
printf("Audio stream not found in the aud0 file\n");
avformat_close_input(&sound);
exit(1);
}
AVCodecContext *audioCC = sound->streams[aud_stream_idx]->codec;
printf("Audio codec_id=%d, codec_name=%s\n", audioCC->codec_id, audioCC->codec_name);
AVCodec *audioC = avcodec_find_decoder(audioCC->codec_id);
if (audioC == NULL) { printf("Couldn't find decoder\n");
avformat_close_input(&sound);
exit(1);
} else {
printf("Found decoder name %s\n", audioCC->codec_name);
}
if(avcodec_open2(audioCC, audioC, NULL)<0) {
printf("Couldn't open decoder\n");
avformat_close_input(&sound);
exit(1);
} else {
printf("Found decoder name %s\n", audioCC->codec_name);
printf("Found bitrate %d\n", audioCC->bit_rate);
printf("Found sample_rate %d\n", audioCC->sample_rate);
printf("Found sample_fmt %d\n", audioCC->sample_fmt); }
/* decode until eof */
av_init_packet(&avpkt);
avpkt.data=NULL;
avpkt.size=0;
if(av_read_frame(sound, &avpkt)<0) {
printf("Couldn't read encoded audio packet\n");
av_free_packet(&avpkt);
avformat_close_input(&sound);
exit(1);
} else {
printf("Read encoded audio packet\n");
printf("avpkt.pts = %d\n", avpkt.pts);
printf("avpkt.dts = %d\n", avpkt.dts);
printf("avpkt.duration = %d\n", avpkt.duration);
printf("avpkt.stream_index = %d\n", avpkt.stream_index);
printf("avpkt.data = %x\n", avpkt.data);
printf("avpkt.data[0] = %02x\n", avpkt.data[0]);
printf("avpkt.data[0]>>>3&0x0f = %02x\n", avpkt.data[0]>>3 & 0x0f);
}
fprintf(stderr, "avpkt.size=%d\n", avpkt.size);
while (avpkt.size > 0) {
int got_frame = 0;
if (!decoded_frame) {
if (!(decoded_frame = avcodec_alloc_frame())) {
fprintf(stderr, "out of memory\n");
exit(1);
} } else
avcodec_get_frame_defaults(decoded_frame);
len = avcodec_decode_audio4(audioCC, decoded_frame, &got_frame, &avpkt);
if (len < 0) {
fprintf(stderr, "Error %d while decoding\n", len);
exit(1);
}
Here is the output I see (I think demuxing is working fine):
Opened aud0 into FormatContext
Audio stream index 0 found
Audio codec_id=73728, codec_name=
Found decoder name
Found decoder name
Found bitrate 5200
Found sample_rate 8000
Found sample_fmt 3
Read encoded audio packet
avpkt.pts = 0
avpkt.dts = 0
avpkt.duration = 160
avpkt.stream_index = 0
avpkt.data = 91e00680
avpkt.data[0] = 04
avpkt.data[0]>>>3&0x0f = 00
avpkt.size=13
[amrnb # 0x7fce9300bc00] Corrupt bitstream
Error -1094995529 while decoding
This audio was created by recording voice on an Android device.
I also tried a .m4a generated by QuickTime, but no luck.
I feel I am missing some crucial step (like not initializing some CodecContext field, or not reading into AVPacket properly). In any case, if someone has any input or similar example, please let me know.

To help others understand ffmpeg error codes, this may be helpful. Error codes from ffmpeg (error.h from avutil) :
http://ffmpeg.org/doxygen/trunk/error_8h_source.html
It turns out the value you specified is :
#define AVERROR_INVALIDDATA FFERRTAG( 'I','N','D','A')
The -1094995529 becomes -0x41444E49 and when you look at those letters, in ACSII, 0x41 = 'A', 0x44 = 'D', 0x4E = 'N, and 0x49 = 'I'. Due to the macro/etc things are reversed, so ADNI becomes INDA, which you can see from the #define snippet, is the AVERROR_INVALIDDATA defined FFERRTAG( 'I','N','D','A').
The rest of the error codes are in that file and I've pasted them below here :
#define AVERROR_BSF_NOT_FOUND FFERRTAG(0xF8,'B','S','F') ///< Bitstream filter not found
#define AVERROR_BUG FFERRTAG( 'B','U','G','!') ///< Internal bug, also see AVERROR_BUG2
#define AVERROR_BUFFER_TOO_SMALL FFERRTAG( 'B','U','F','S') ///< Buffer too small
#define AVERROR_DECODER_NOT_FOUND FFERRTAG(0xF8,'D','E','C') ///< Decoder not found
#define AVERROR_DEMUXER_NOT_FOUND FFERRTAG(0xF8,'D','E','M') ///< Demuxer not found
#define AVERROR_ENCODER_NOT_FOUND FFERRTAG(0xF8,'E','N','C') ///< Encoder not found
#define AVERROR_EOF FFERRTAG( 'E','O','F',' ') ///< End of file
#define AVERROR_EXIT FFERRTAG( 'E','X','I','T') ///< Immediate exit was requested; the called function should not be restarted
#define AVERROR_EXTERNAL FFERRTAG( 'E','X','T',' ') ///< Generic error in an external library
#define AVERROR_FILTER_NOT_FOUND FFERRTAG(0xF8,'F','I','L') ///< Filter not found
#define AVERROR_INVALIDDATA FFERRTAG( 'I','N','D','A') ///< Invalid data found when processing input
#define AVERROR_MUXER_NOT_FOUND FFERRTAG(0xF8,'M','U','X') ///< Muxer not found
#define AVERROR_OPTION_NOT_FOUND FFERRTAG(0xF8,'O','P','T') ///< Option not found
#define AVERROR_PATCHWELCOME FFERRTAG( 'P','A','W','E') ///< Not yet implemented in FFmpeg, patches welcome
#define AVERROR_PROTOCOL_NOT_FOUND FFERRTAG(0xF8,'P','R','O') ///< Protocol not found
#define AVERROR_STREAM_NOT_FOUND FFERRTAG(0xF8,'S','T','R') ///< Stream not found
#define AVERROR_BUG2 FFERRTAG( 'B','U','G',' ')
#define AVERROR_UNKNOWN FFERRTAG( 'U','N','K','N') ///< Unknown error, typically from an external library
#define AVERROR_EXPERIMENTAL (-0x2bb2afa8) ///< Requested feature is flagged experimental. Set strict_std_compliance if you really want to use it.
#define AVERROR_INPUT_CHANGED (-0x636e6701) ///< Input changed between calls. Reconfiguration is required. (can be OR-ed with AVERROR_OUTPUT_CHANGED)
#define AVERROR_OUTPUT_CHANGED (-0x636e6702) ///< Output changed between calls. Reconfiguration is required. (can be OR-ed with AVERROR_INPUT_CHANGED)
#define AVERROR_HTTP_BAD_REQUEST FFERRTAG(0xF8,'4','0','0')
#define AVERROR_HTTP_UNAUTHORIZED FFERRTAG(0xF8,'4','0','1')
#define AVERROR_HTTP_FORBIDDEN FFERRTAG(0xF8,'4','0','3')
#define AVERROR_HTTP_NOT_FOUND FFERRTAG(0xF8,'4','0','4')
#define AVERROR_HTTP_OTHER_4XX FFERRTAG(0xF8,'4','X','X')
#define AVERROR_HTTP_SERVER_ERROR FFERRTAG(0xF8,'5','X','X')

first you need to call av_read_frame until you get to EOF. So you need av_read_frame in the loop.
Let me explain how decoding works:
you open formatcontext, codeccontext and decoder. that is the easy part.
in a loop you read packets form one stream. so you need
while(av_read_frame(sound, &avpkt))
in this loop you want to decode the data from the packet. but care, you need to check from which stream the data in the packet is. so its always better to check if the packet contains data from the audiostream
if (avpkt.stream_index == aud_stream_idx)
now you can decode and process the audio.
exmaple:
AVFrame * decoded_frame = av_frame_alloc();
AVPacket avpkt;
av_init_packet(&avpkt);
avpkt.data = NULL;
avpkt.size = 0;
while(av_read_frame(sound, &avpkt))
{
if (avpkt.stream_index == aud_stream_idx)
{
int got_frame;
int len = avcodec_decode_audio4(audioCC, decoded_frame, &got_frame, &avpkt);
}
}
Thats all the the magic :)
PS: to find a stream just call
av_find_best_stream(sound, AVMEDIA_TYPE_AUDIO, -1, -1, NULL, 0)
PPS: did not tested code. just codec here in window :D
PPPS: read the examples. this helped me a lot :)

Playing AAC RTP stream using ffdshow

I am trying to play an RTP stream from using a custom network source filter and ffdshow audio decoder (ffdshow-tryout stable).
The mediatype that I set on my source output stream is MEDIASUBTYPE_RAW_AAC1. Here is what I am setting:
pmt->SetType(&MEDIATYPE_Audio);
pmt->SetSubtype(&MEDIASUBTYPE_RAW_AAC1);
pmt->SetFormatType(&FORMAT_WaveFormatEx);
BYTE *AACRAW;
DWORD dwLen = sizeof(WAVEFORMATEX) + 2; //2 bytes of audio config
AACRAW = (BYTE *)::CoTaskMemAlloc(dwLen);
memset(AACRAW, 0, dwLen);
WAVEFORMATEX wfx;
wfx.wFormatTag = WAVE_FORMAT_RAW_AAC1;
wfx.nChannels = 1;
wfx.nSamplesPerSec = 16000;
wfx.nAvgBytesPerSec = 8000;
wfx.nBlockAlign = 1;
wfx.wBitsPerSample= 0;
wfx.cbSize = 2;
memcpy(AACRAW, (void *)&wfx, sizeof(WAVEFORMATEX));
vector<unsigned char>extra;
extra.push_back(0x14);
extra.push_back(0x08);
memcpy(AACRAW + sizeof(WAVEFORMATEX), extra.data(), extra.size());
pmt->SetFormat(AACRAW, dwLen);
::CoTaskMemFree(AACRAW);
And then when I receive a rtp packet here is what I forward to the ffdshow filter:
DeliverRTPAAC(pRaw + 12 + 2 + 2, nBufSize - 12 - 2 - 2, pack.timestamp);
where pRaw is the pointer to the rtp packet. Each rtp packet that I receive contains one AU.
The filters connect but does not play audio. No error output from the AAC decoder as well.
The SDP parameters from the Axis camera are:
a=rtpmap:97 mpeg4-generic/16000/1
a=fmtp:97 streamtype=5; profile-level-id=15; mode=AAC-hbr; config=1408; sizeLength=13; indexLength=3; indexDeltaLength=3; profile=1; bitrate=64000;
Can somebody help me out please?

Probably the data you are receiving is wrapped in ADTS headers and you need to strip the ADTS header to supply the decoder with raw AAC.

OpenCL Bus error

I have problem with my OpenCL code. I compile and running it on CPU (core 2 duo) Mac OS X 10.6.7. Here is the code:
#define BUFSIZE (524288) // 512 KB
#define BLOCKBYTES (32) // 32 B
__kernel void test(__global unsigned char *in,
__global unsigned char *out,
unsigned int srcOffset,
unsigned int dstOffset) {
int grId = get_group_id(0);
unsigned char msg[BUFSIZE];
srcOffset = grId * BUFSIZE;
dstOffset = grId * BLOCKBYTES;
// Copy from global to private memory
size_t i;
for (i = 0; i < BUFSIZE; i++)
msg[i] = in[ srcOffset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < BLOCKBYTES; i++)
out[ dstOffset + i ] = msg[i];
}
The code gave me an runtime error "Bus error". When I makes help printf in cycle which copy from global to private memory then see there the error occurs, every time in different iteration of i. When I reduce size of BUFSIZE to 262144 (256 KB) then the code runs fine. I tried to have only one work-item on one work-group. The *in points to memory area which have thousands KB of data. I suspect to limit of private memory, but then threw an error in the allocation of memory, not when copy.
Here is my OpenCL device query:
-
--------------------------------
Device Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
---------------------------------
CL_DEVICE_NAME: Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.0
CL_DEVICE_VERSION: OpenCL 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2260 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1024 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store
cl_APPLE_gl_sharing
cl_APPLE_SetMemObjectDestructor
cl_APPLE_ContextLoggingFunctions
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2

You use a variable msg with a size of 512kB. This variable should be in private memory. The private memory is not that big. This shouldn't work, as far as I know.
Why do you have the parameters srcOffsetand dstOffset? You do not use them.
I do not see more issues. Try to allocate local memory. Do you have a version of you code without optimization running? A version which just calculates in global memory?