Speaker Diarization gives result of two users as "Speaker 1" instead of Speaker 1 and Speaker 2 when both speakers audio captured on Mobile - azure-cognitive-services

I am trying to get Diarization in my speech to text conversion output from an mono WAV file.
I need "Speaker 1" and "Speaker 2" differentiation from my WAV file which has two voices.
But transcription always showing Speaker 1 for both speakers. This happens when an audio record of two speakers speaks from Mobile interface. If one speaker from Web and another one from Mobile interface getting correct diarization with Speaker 1 and Speaker 2.
I am using Batch Transcription of Azure to get speech to text from WAV mono file.
Azure Batch Transcription SDK sample

We got a correct result from one of the very clear audio using same Mobile Interface.

Related

How to read NVDA speech viewer's text output stream using Python API

I am making a desktop app which requires listening to the continuous text output stream of NVDA screen viewer. How to do it with their Python API?
# event which is triggered when a new
# text line appears in NVDA Speech Viewer
def on_new_text_line(text):
do_something(text)
This might help:-
https://github.com/prut-h7/NVDA_Speech_Broadcaster
It broadcasts the text stream via UDP, so simply create a receiver port to capture the text output. It is platform independent, so you can use any preferred language to capture the output. And suggest any changes/updates if needed !

Getting the name of the connected printer to print the POS receipt in C#

I know there is a way to get all available printers in C# using System.Drawing.Printing.PrinterSettings.InstalledPrinters. But how to get the name of the printer which is attached to say, a USB port? Do we need to hard code the name of the printer? Because, if so, we'll have to change the printer name in the code if the application is run on a different system, which might use a different printer.
How do real POS systems do this? How could I generalize the printing functionality of the application? I am using the library ESC-POS-USB-.NET.
The API will be for standard Windows desktop printers.
The main function is graphic printing on cut sheet paper.
Most receipt printers for thermal roll paper used in POS are connected to a serial port or a network and print using the ESC POS command mainly for characters.
There are many ways to use it to send commands directly from the serial port.
However, there is also a way to use a standardized API that abstracts the functions of the POS printer.
The name of the specification is called UnifiedPOS.
The actual mechanism is OPOS, POS for.NET, JavaPOS, Windows.Devices.PointOfService, etc.
Please search based on those keywords.

arduino switch control - three output

I want to play mp3 using Arduino.
Is it possible to change the speaker channel for each audio file?
1.mp3 is played on A speaker and 2.mp3 is played on B speaker and 3.mp3 is played on C speaker.
I want a way to play sequentially, not simultaneously.
Please refer to the photo.
https://ibb.co/b6zKdH1
I have configured it as 4ch relays. But I do not know how to connect the question mark part.
Please refer to the photo.
https://ibb.co/TgWq7Cr

What should be the maximum audio file length (duration) to be sent to Bing Speeh to Text API?

I have referred this documentation.
They have mentioned when using client libraries for speech to text, "the long audio stream (up to 10 minutes)".
Whether speech to text accepts audio file greater than 10 minutes?
What will happen if we pass audio file > 10 minutes?
And in my use case, I need to pass audio file greater than 30 minutes. So what we have to do for these situations?
You can split your longer audio streams programmatically using ffmpeg and pass those chunks to this client library. You can check this to programmatically divide long audio streams into time-specified chunks: https://superuser.com/questions/525210/splitting-an-audio-file-into-chunks-of-a-specified-length.
You can then combine your text from these chunks to get the entire text back. Not the cleanest of the ways - but something that will scale.

Output words in speaker with arduino

I want to generate voice in arduino using code. I can generate simple tones and music in arduino, but I need to output words like right, left, etc in arduino speaker. I found some methods using wav files but it requires external memory card reader. Is there a method to generate using only arduino and speaker?
Typical recorded sound (such as wav files) requires much larger amounts of memory than is a available on-chip on an Arduino.
It is possible to use an encoding and data rate that minimises the memory requirement - at the expense of audio quality. For example generally acceptable quality speech-band audio can be obtained using non-linear (companded) 8-bit PCM at 3KHz sample rate, which if differentially decoded to 4 bit samples (so that each sample is not the PCM code, but the difference in level from the previous sample), then you can get about 1 second of audio in 1.5Kbytes. You would have to do some off-line processing of the original audio to encode it in this manner before storing the resulting data in the Arduino flash memory. You will also have to implement the necessary decode and linearisation.
Another possibility is to use synthesised rather then recorded speech. This technique uses recorded phonemes (components of speech) rather than whole words, and you then build words from these components. The results are generally somewhat robotic and unnatural (modern speech synthesis can in fact be very convincing, but not with the resources available on an Arduino - think 1980's Speak-and-Spell).
Although it can be rather efficient, phoneme speech synthesis requires different phoneme sets for different natural languages. It is possible perhaps for a limited vocabulary perhaps to only encode the subset of phonemes actually used.
You can hear a recording of the kind of speech that can be generated by a simple phoneme speech generator at http://nsd.dyndns.org/speech/. This page discusses a 1980's GI-SP0256 speech chip driven by an Arduino rather than speech generated by the Arduino, but it gives you an idea of what might be achieved - the GI-SP0256 managed with just 2Kb ROM - the Arduino could probably implement something similar directly. The difficulty perhaps is in obtaining the necessary phoneme set. You could possibly record your own and encode them as above. Each word or phrase would then simply be a list of phonemes and delays to be output.
The eSpeak project might be a good place to start - it is probably too large for Arduino, and the whole text to speech translation unnecessary, but it converts text to phonemes, so you could do that part off-line (on a PC), then load the phonemes and the replay code to the Arduino. It may still be too large of course.

Resources