Which natural languages does Google Cloud DLP support? - google-cloud-dlp

I'm considering using Cloud DLP to help me anonymize my data. However, I can't seem to find explicit mention of what languages are supported. AWS Comprehend's detect PII API only supports English so looking for an alternative.

In the detectors reference page you can find the detectors per country
https://cloud.google.com/dlp/docs/infotypes-reference
For global detectors as PHONE_NUMBER there is no information about the languages supported, but you can test the support for your language in the demo page
https://cloud.google.com/dlp/demo/#!/
For example if you write in Spanish Mi teléfono es 600111222 (my phone is 600111222) it detects a PHONE_NUMBER with LIKELY likelihood, but if you write Me puedes llamar al 600111222 (You can call me at 600111222) it detect just a GENERIC_ID with LOW likelihood.
Also, if in the previous examples, you add the country prefix (+34600111222), the likelihood increases to VERY_LIKELY in the first one and the second one detects a PHONE_NUMBER as POSSIBLE
In summary, it works with other languages and uses the context to improve the matches, but you should play with some samples to check the accuracy in your specific use case

Related

Cognitive Services Translation and Profanity Filtering

Issue Description
I use cognitive services TranslateArray to translate my users comments. One of the advantages of this service is that we can use ProfanityAction to mark every profane words in the destination language. I also make use of the automatic language detection, so that I do not have to identify the content before sending it in.
When I get my translation back for a destination language which match the source language, the profanity is not marked. Is there another endpoint I could/should hit, or a parameter I do not know about, or is there a possible improvement of the service ?
Corresponding Documentation
Follow the cognitive service protocol to hit the TranslateArray endpoint, with an english sentence containing profanities, with the ProfanityAction: Marked behavior: http://docs.microsofttranslator.com/text-translate.html#!/default/post_TranslateArray
Reproduction Steps
Send an English sentence with profanities
Translate to fr, notice correctly marked profanities
Translate to en, notice the missing profanities tag
Expected Behavior
Profanities should be marked even if no translation occured.
Actual Results
I obtained the unmodified sentence back.
There is nothing in the documentation that specifies what happens if the source and target language are the same. My guess is that if it sees that they match it will do nothing.
However, there is a specific API that detects profanity for any given language: Content Moderation for Text. The API docs are here.
The Text - Screen function does it all – scans the incoming text (maximum 1024 characters) for profanity, autocorrects text, and extracts Personally Identifiable Information (PII), all while matching against custom lists of terms.
Your observation that Translator API does nothing if source and target languages are the same, is correct. Not an answer, just clarification.

Can Google Cloud Vision generate labels in Spanish via its API?

say that I have images and I want to generate labels for them in Spanish - does the Google Cloud Vision API allow to select which language to return the labels in?
Label Detection
Google Cloud Vision APIs do not allow configuring the result language for label detection. You will need to use a different API like Cloud Translation API to perform that operation instead.
OCR (Text detection)
If you're interested in text detection in your image, Google Cloud Vision APIs support Optical Character Recognition (OCR) with automatic language detection in a broad set of languages listed here.
For TEXT_DETECTION and DOCUMENT_TEXT_DETECTION requests, you can provide languageHints parameter in the request to get better results for certain cases where the language is unknown and/or not easily detectable.
languageHints[]
string
List of languages to use for TEXT_DETECTION. In most cases, an empty
value yields the best results since it enables automatic language
detection. For languages based on the Latin alphabet, setting
languageHints is not needed. In rare cases, when the language of the
text in the image is known, setting a hint will help get better
results (although it will be a significant hindrance if the hint is
wrong). Text detection returns an error if one or more of the
specified languages is not one of the supported languages.
The DetectedLanguage information is available in the request to identify the language along with a confidence value.
Detected language for a structural component.
JSON representation
{
"languageCode": string,
"confidence": number,
}

Finding the number of common users between two websites

There are two Swiss (.ch) websites, let's call them A and B. A is owned by me and B by a customer.
Because of legal data protection issues B is hosted in Switzerland and not allowed to store any user information abroad. Which means that software like Google Analytics is not available on B. A is a Swiss website but hosted in a (European) cloud.
Now we would like to find out how many common users we both have over the duration of 30 days. In short:
numberOfUsersA ∩ numberOfUsersB
For the sake of simplicity: Instead of users we are perfectly happy to measure common browsers.
What would you suggest is the simplest way to solve this problem?
First off all, best regards from Zurich/Zug :) Swiss people are everywhere...
I don't think you're correct that it's not legal to collect data in Switzerland at all (also abroad). As I'm working in the financial industry I know this topic very well and we also had to do a lot research to use GA at all.
It's always the question what and how you collect data. What you can't do - beside you got in upfront the permission of the user - is storing personal identifiable information. That's anyway not allowed by GA - you can't import/save in custom dimension/metrics for example email addresses.
Please check https://support.google.com/adsense/answer/6156630?hl=en as general basic information about this topic.
If you save the IP addresses via IP anonymization, you shouldn't run into problems if you're declaring this in your data-privacy statements. Take this approach: https://support.google.com/analytics/answer/2763052?hl=en
I'm not a lawyer and also not want to give you legal advises, but ours told us that's fine. If you are real paranoid about sending data to the USA - like we have to be - you can exclude your tracking from very sensitive forms.
To go back to your basic question, if you want to find this out via Google Analytics, your key is "cross domain tracking". Check https://support.google.com/analytics/answer/1034342?hl=en for more information in this direction.
The only work-around I have in my mind beside this, is if you start collecting browser-fingerprints yourself and then connect both collections over the finger prints together (that's not save, as your visitors will use more than one device/configuration). I personally would go for the IP anonimization, exclude very sensitive forms and ensure that your data-privacy declaration contains all necessary parts for and offer an opt-out option then you should be on the safe side.
All the best and TGIF :)

Get current MCC/MNC from mobile phone number

I would like to know how I would go about finding out the 'current' MNC from a UK mobile phone number?
I have given out a collection of numbers to companies, and they returned the "original MCC/MNC" & the "current MCC/MNC" codes, all checked out fine.
I would like to know how this was done in the first place? Its easy to find the original MCC/MNC codes, but I'm having trouble with the current MCC/MNC.
To obtain the current MCC / MNC (or NWC - Mobile Network Code) for a mobile number you can take a number of approaches. Based on your comment I'm going to combine a little background information also.
Getting the original MCC / MNC
This is relatively easy as long as you have a reliable source of data. The MCC is relatively simple to deal with as MCC's don't change all that often. MCC being the Mobile Country Code designated by ITU-T. MNC's are a little more tricky because they can change over time. The ITU-T also distributes these allocations and regularly publishes updates or should I say the GSMA does.
Getting current MCC / MNC
Here you have a number of factors to consider. One of them you have already mentioned. Here are some more possibilities:
Mobile porting - Transfer of a mobile phone number from one operator to another
Roaming - The mobile phone number is currently registered on a "foreign" mobile network
Both of these factors mean that just using the mobile phone number is not an option for finding out the current MCC / MNC. It is really a question of how accurate you need the information to be. And of course how much money you want to spend finding it out.
So finally to the original question. The short answer is no you do not have to be a member of ITU to have access to this information. The long answer is that you need access either to the ITU publications. As I recall the following are ways of obtaining the information you need:
GSMA (GSM Association) regularly publishes updates to NWC's (Mobile Network Codes) in document form. This together with numbering schemes for every country using GSM networks.
Neustar (http://www.neustar.biz) provides an API which you can query for the currently registered (non-roaming) mobile phone numbers. They also provide portability information which is updated at various rates depending on country and operator. Effectively they are the root of all portability information for the GSMA.
Some mobile operators for example Deutsche Telekom in Germany provide an API to obtain daily updated portability information for the whole of Germany.
Companies with SS7 connectivity (basically the GSM cloud where mobile operators interoperate) can query realtime the mobile phone numbers current network registration. This also includes whether the mobile phone number is roaming or not.
This information is priceless for many companies and GSMA rightly so ensures that only companies and people who can responsibly manage this information are allowed to obtain it.
You can use this nuget package.
Sample code:
var IsViablePhoneNumber = PhoneNumberUtil.IsViablePhoneNumber("989123456789");
var MCC_MNC = PhoneNumberUtil.GetMCCMNC("989123456789");
var Operator = PhoneNumberUtil.GetOperator("989123456789");
var Brand = PhoneNumberUtil.GetBrand("989123456789");
var OperatorStatus = PhoneNumberUtil.GetOperatorStatus(232 ,10);
var OperatorType = PhoneNumberUtil.GetOperatorType(232 ,10);

Travel APIs how to integrate them all?

I may start working on a project very similar to Hipmunk.com, where it pulls the hotel cost information by calling different APIs (like expedia, orbitz, travelocity, hotels.com etc)
I did some research on this, but I am not able to find any unique hotel id or any field to match the hotels between several API's. Anyone have experience on how can to compare the hotel from expedia with orbitz or travelcity etc?
Thanks
EDIT: Google also doing the same thing http://www.google.com/hotelfinder/
From what I have seen of GDS systems, and these API's there is rarely a unique identifier between systems for e.g. hotels
Airports, airlines and countries have unique ISO identifiers: http://www.iso-code.com/airports.2.html
I would guess you are going to have to have your own internal mapping to identify and disambiguate the properties.
:|
When you get started with hotel APIs, the choice of free ones isn't really that big, see e.g. here for an overview.
The most extensive and accessible one is Expedia's EAN http://developer.ean.com/ which includes Sabre and Venere with unique IDs but still each structured differently.
That is, you are looking into different database tables.
You do get several identifies such as Name, Address, and coordinates, which can serve for unique identification, assuming they are free of errors. Which is an assumption.

Resources