I'm seriously considering training a NMT model using some domain data. Custom Translator seems a good option.
But I'm wondering whether the model offered by Azure is an empty model (never trained with language data) or is pretrained general model (like Bing Translator)?
If it's the former case, do I need to train the model with additional general domain data to achieve ideal results.
Thanks
Custom Translator enables customer to use their own domain specific dataset to build translation engine that speaks their domain terminology. Customizing with general domain data would not deliver the translation quality the customer desires.
If customer does not have a domain dataset, it would be better to use our general domain (or standard models like bing translate). For more info about Custom Translator, https://techcommunity.microsoft.com/t5/azure-ai/customize-a-translation-to-make-sense-in-a-specific-context/ba-p/2811956
Related
I'm on the S0 Tier for Azure Cognitive Speech services and am trying to train a custom voice for Japanese TTS. My data was successfully processed. But I wasn't able to select "Statistical Parametric" or "Concatenative" as my training method. "Neural" was the only option on the list.
Training Method Options for Japanese model
However, I was able to use those non-neural methods for English and Chinese projects.
Training Method Options for English/Chinese model
Does anyone know if I could still train a Japanese non-neural voice model? If so, how?
Thank you very much in advance.
Non-Neutral voice training has been deprecated. The standard/non-neural training tier (adaptive, statistical parametric, concacenative) of Custom Voice is being deprecated. The announcement has been sent out to all existing Speech subscriptions before 2/28/2021. During the deprecation period (3/1/2021 - 2/29/2024), existing standard tier users can continue to use their non-neural models created. All new users/new speech resources should move to the neural tier/Custom Neural Voice. After 2/29/2024, all standard/non-neural custom voices will no longer be supported.
To answer your question about why you can train non-neutral for English and Chinese projects but not for Japanese: We allow current users to re-train non-neural voices for their existing projects, but they are not allowed to train new for non-neutral.
For user use non-neutral, we have a short guidance to help you mitigate to neutral training:
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-voice#migrate-to-custom-neural-voice
Sorry for the inconvenience.
I'm trying to figure out the best approach to interfacing with Watson Conversation from a multi-language website (English/French). The input to Watson Conversation will be fed from Watson STT, so the input should be in the appropriate language. Should I set up the intents and entities in both languages? That might potentially cause issues with words that are the same (or very similar) in both languages but have different meanings. My guess is that I'll need two separate Conversation workspaces but that seems like a lot of overhead (double work when anything changes). I've thought about using the Watson Language Translator in between STT and Conversation but I would think the risk with that approach could be a reduction in accuracy. Has anyone been able to do this?
You will need to set up separate workspaces for each language. As you need to set the language of your workspace.
After that you would need to do STT, then language detection service to determine what workspace it should be directed to.
Which Microsoft Cognitive Services (or Azure Machine Learning services?) is best and least work to use to solve the problem of finding similar articles given an article. An article is a string of text. And assuming I do not have user interaction data about the articles.
Are there anything in Microsoft Cognitive Services that can solve this problem out-of-the-box? It seems I cannot use the Recommendations API since I don't have interaction/user data.
Anthony
I am not sure that Text Analytics API may be a good use for this scenario, at least not yet.
There are really two types of similarities:
1. Surface similarity (lexical) – Similarity by presence of words/alphabets
If we are looking for surface similarity, try fuzzy matching/lookup (SQL Server Integration Services – provides a component for this.), or approximate similarity functions (Jaro-Winkler distance, Levenshtein distance) etc. This would be easier as it would not require you to create a custom machine learning model.
2. Semantic similarity – Similarity by meaning of words
If we are looking for Semantic similarity, then you need to go for semantic clustering, word embedding, DSSM (Deep semantic similarity model) etc.
this is harder to do, as it would require you to train your own machine learning model based on an annotated corpus.
Luis Cabrera | Text Analytics Program Manager | Cloud AI Platform, Microsoft
Yes, you can use Text Analytics API.
examples are available here. https://www.microsoft.com/cognitive-services/en-us/text-analytics-api
I would suggest you use the Text Analytics API [1] as #Narasimha suggested. You would put your strings through the Topic detection API, and then come up with a metric (say, Similarity = count(Matching topics) - count(Non Matching topics)) that could order each string against the others for similarity. This would just require one API call and a little JSON parsing.
[1] https://www.microsoft.com/cognitive-services/en-us/text-analytics-api
Sentence similarity or semantic textual similarity is a measure of how similar two pieces of text are, or to what degree they express the same meaning.
This Microsoft's GitHub repo for NLP provide some sample wich could be used from Azure VM and Azure ML : https://github.com/microsoft/nlp/tree/master/examples/sentence_similarity
This folder contains examples and best practices, written in Jupyter notebooks, for building sentence similarity models. The gensen and pretrained embeddings utility scripts are used to speed up the model building process in the notebooks.
The sentence similarity scores can be used in a wide variety of applications, such as search/retrieval, nearest-neighbor or kernel-based classification methods, recommendations, and ranking tasks.
For data normalisation of standard tin can verbs, is it best to use verbs from the tincan registry https://registry.tincanapi.com/#home/verbs e.g.
completed http://activitystrea.ms/schema/1.0/complete
or to use the adl verbs like those defined:
in the 1.0 spec at https://github.com/adlnet/xAPI-Spec/blob/master/xAPI.md
this article http://tincanapi.com/2013/06/20/deep-dive-verb/
and listed at https://github.com/RusticiSoftware/tin-can-verbs/tree/master/verbs
e.g.
completed http://adlnet.gov/expapi/verbs/completed
I'm confused as to why those in the registry differ from every other example I can find. Is one of these out of date?
It really depends on which "profile" you want to target with your Statements. If you are trying to stick to e-learning practices that most closely resemble SCORM or some other standard then the ADL verbs may be most fitting. It is a very limited set, and really only the "voided" verb is provided for by the specification. The other verbs were related to those found in 0.9 and have become the de facto set, but aren't any more "standard" than any other URI. If you are targeting statements to be used in an Activity Streams way, specifically with a social application then you may want to stick with their set. Note that there are verbs in the Registry that are neither ADL coined or provided by the Activity Streams specification.
If you aren't targeting any specific profile (or existing profile) then you should use the terms that best capture the experiences which you are trying to record. And we ask that you either coin those terms at our Registry so that they are well formed and publicly available, or if you coin them under a different domain then at least get them catalogued in our Registry so others may find them. Registering a particular term in one or more registries will hopefully help keep the list of terms from exploding as people search for reusable items. This will ultimately make reporting tools more interoperable with different content providers.
We are building a Form Builder (why reinvent the wheel and not use an existing one is not a discussion I want to have) for my company. Development is going well and I believe we have the right strategy for making it flexible enough and robust enough.
However, the problem lies in expectations. As project leader, it is my job to make sure expectations align with deliverable functionality, in fact project success depends on it, but I am having trouble defining what the form builder should be used for. I am concerned top management thinks of it as a one-size-fits-all solution, something I disagree with. I believe there are use cases for Form Builders and then there are use cases for explicit implementations, not all data should be stored in a dynamic form builder.
My question is: Is there a rule of thumb for determining what type of data should be implemented in a dynamic Form Builder and what should not? Or maybe not one but a set of rules.
For example, a purchase request might be a good fit for the form builder, but employee registration and attendance to company training sessions might not be since you'll most likely want to have that data readily available for querying and statistics.
Which types of forms should be implemented using dynamic form builders and which should have explicit static implementations in the database?
For my experience Forms builders are useful for specific cases. When you want to create interfaces for experience users to facilitate the management of data. When designing applications that are concern about the user’s experience, the form builders are all most the times not useful.