How to de-identify the data present in Big query table - google-cloud-dlp

How to de-identify the data already present in Big query table and then re-identify the same and load in other BQ table.
Thanks

The easiest way to do this is using dataflow pipeline call deidentifyContent.

I have created an example by using "de-identifying sensitive data" to de-identify the data with the following transformations which belongs to Data Loss Prevention (DLP) API and then insert them into BigQuery, however to re-identify the data is needed cryptographical tokens. To achieve this, you can use the supported cryptographic methods in Cloud DLP:
Deterministic encryption using AES-SIV
Format preserving encryption
Cryptographic hashing
In this post, it was mentioned some keys and steps to achieve the transformation using the Deterministic encryption however you can use the one you prefer.
As it was answered, the easy way to do this is following the Dataflow pipeliene tutorial
Last but not least, I share with you an example of how to de-identify by replacing the data generated by a faker library that I used to simulate your BigQuery data.
from faker import Faker
from google.cloud import dlp_v2
fake = Faker()
dlp = dlp_v2.DlpServiceClient()
def create_fake_data(data_length=5):
data = []
headers = [
{"name": "name"}, {"name": "email"},
{"name": "credit_card"}, {"name": "credit_card_provider"},
{"name": "phone"}]
for i in range(data_length):
data.append({"values":
[
{"string_value": fake.unique.first_name()},
{"string_value": fake.free_email()},
{"string_value": fake.unique.credit_card_number()},
{"string_value": fake.credit_card_provider()},
{"string_value": fake.phone_number()},
]
}
)
return {"table": {"headers": headers, "rows": data}}
def deidentify_with_replace(item):
parent = "projects/julio-castor-mx"
inspect_config = {"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "PHONE_NUMBER"}]}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"primitive_transformation": {
"replace_config": {
"new_value": {"string_value": "[IDENTIFIED]"},
}
}
}
]
}
}
response = dlp.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item
}
)
return response.item
if __name__ == "__main__":
data = create_fake_data()
deidentify_data = deidentify_with_replace(data)
Please consider that the InfoType parameter define the DLP detectors that create the de-indentify. These detectors are listed here and to de-indentify a table use the Table object.
You can see more examples in this repository

Related

Adobe Analytics 2.0 API endpoint to get report suite events, props, and evars

I'm having a hard time finding a way in the 2.0 API that I can get a list of Evars, Props and Events for a given report suite. The 1.4 version has the reportSuite.getEvents() endpoint and similar for Evars and Props.
Please let me know if there is a way to get the same data using the 2.0 API endpoints.
The API v2.0 github docs aren't terribly useful, but the Swagger UI is a bit more helpful, showing endpoints and parameters you can push to them, and you can interact with it (logging in with your oauth creds) and see requests/responses.
The two API endpoints in particular you want are metrics and dimensions. There are a number of options you can specify, but to just get a dump of them all, the full endpoint URL for those would be:
https://analytics.adobe.io/api/[client id]/[endpoint]?rsid=[report suite id]
Where:
[client id] - The client id for your company. This should be the same value as the legacy username:companyid (the companyid part) from v1.3/v1.4 API shared secret credentials, with the exception that it is suffixed with "0", e.g. if your old username:companyid was "crayonviolent:foocompany", the [client id] would be "foocompany0", because..reasons? I'm not sure what that's about, but it is what it is.
[endpoint] - Value should be "metrics" to get the events, and dimensions to get the props and eVars. So you will need to make 2 API endpoint requests.
[rsid] - The report suite id you want to get the list of events/props/eVars from.
Example:
https://analytics.adobe.io/api/foocompany0/metrics?rsid=fooglobal
One thing to note about the responses: they aren't like the v1.3 or v1.4 methods where you query for a list of only those specific things. It will return a json array of objects for every single event and dimension respectively, even the native ones, calculated metrics, classifications for a given dimension, etc. AFAIK there is no baked in way to filter the API query (that's in any documentation I can find, anyways..), so you will have to loop through the array and select the relevant ones yourself.
I don't know what language you are using, but here is a javascript example for what I basically do:
var i, l, v, data = { prop:[], evar: [], events:[] };
// dimensionsList - the JSON object returned from dimensions API call
// for each dimension in the list..
for (i=0,l=dimensionsList.length;i<l;i++) {
// The .id property shows the dimension id to eval
if ( dimensionsList[i].id ) {
// the ones we care about are e.g. "variables/prop1" or "variables/evar1"
// note that if you have classifications on a prop or eVar, there are entries
// that look like e.g. "variables/prop1.1" so regex is written to ignore those
v = (''+dimensionsList[i].id).match(/^variables\/(prop|evar)[0-9]+$/);
// if id matches what we're looking for, push it to our data.prop or data.evar array
v && v[1] && data[v[1]].push(dimensionsList[i]);
}
}
// metricsList - the JSON object returned from metrics API call
// basically same song and dance as above, but for events.
for (var i=0,l=metricsList.length;i<l;i++) {
if ( metricsList[i].id ) {
// events ids look like e.g. "metrics/event1"
var v = (''+metricsList[i].id).match(/^metrics\/event[0-9]+$/);
v && data.events.push(metricsList[i]);
}
}
And then the result data object will have data.prop,data.evar, and data.events, each an array of the respective props/evars/events.
Example object entry for an data.events[n]:
{
"id": "metrics/event1",
"title": "(e1) Some event",
"name": "(e1) Some event",
"type": "int",
"extraTitleInfo": "event1",
"category": "Conversion",
"support": ["oberon", "dataWarehouse"],
"allocation": true,
"precision": 0,
"calculated": false,
"segmentable": true,
"supportsDataGovernance": true,
"polarity": "positive"
}
Example object entry for an data.evar[n]:
{
"id": "variables/evar1",
"title": "(v1) Some eVar",
"name": "(v1) Some eVar",
"type": "string",
"category": "Conversion",
"support": ["oberon", "dataWarehouse"],
"pathable": false,
"extraTitleInfo": "evar1",
"segmentable": true,
"reportable": ["oberon"],
"supportsDataGovernance": true
}
Example object entry for a data.prop[n]:
{
"id": "variables/prop1",
"title": "(c1) Some prop",
"name": "(c1) Some prop",
"type": "string",
"category": "Content",
"support": ["oberon", "dataWarehouse"],
"pathable": true,
"extraTitleInfo": "prop1",
"segmentable": true,
"reportable": ["oberon"],
"supportsDataGovernance": true
}

In Geocoder here-api what are the different types of View

I am moving from Google to Here-api Geocoding service.In the Response JSON object there is a property View which is an array. I was not able to find in the documentation what types of objects might be in this collection. It seems that there is a type-_type property for every View and from all my testing I got only one object inside this array and it is always from the same type "_type": "SearchResultsViewType". My questions are:
Is it possible this collection that View collection to contain more than one object? Are they from the same "_type" or from a different "_type"?In short, besides probably ViewId how are different Views objects separated?
I made numerous requests to Geogoder API with switching on and off different parameters. I have searched here and in the API documentation.
This is how the Response object is looking now
{
"Response": {
"MetaInfo": {
"Timestamp": "2019-04-18T11:48:31.306+0000"
},
"View": [
{ "_type": "SearchResultsViewType",
"ViewId": 0,
"Result": .....

JSON Path not working properly with athena

I have a lambda function that converts my logs to this format:
{
"events": [
{
"field1": "value",
"field2": "value",
"field3": "value"
}, (...)
]
}
When I query it on S3, I get in this format:
[
{
"events": [
{ (...) }
]
}
]
And I'm trying to run a custom classifier for it because the data I want is inside the objects kept by 'events' and not events itself.
So I started with the simplest path I could think that worked in my tests (https://jsonpath.curiousconcept.com/)
$.events[*]
And, sure, worked in the tests but when I run a crawler against the file, the table created includes only an events field with a struct inside it.
So I tried a bunch of other paths:
$[*].events
$[*].['events']
$[*].['events'].[*]
$.[*].events[*]
$.events[*].[*]
Some of these does not even make sense and absolutely every one of those got me an schema with an events field marked as array.
Can anyone point me to a better direction to handle this issue?

"Reverse formatting" Riak search results

Let's say I have an object in the test bucket in my Riak installation with the following structure:
{
"animals": {
"dog": "woof",
"cat: "miaow",
"cow": "moo"
}
}
When performing a search request for this object, the structure of the search results is as follows:
{
"responseHeader": {
"status": 0,
"QTime": 3,
"params": {
"q": "animals_cow:moo",
"q.op": "or",
"filter":"",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"maxScore": "0.353553",
"docs": [
{
"id": "test",
"index": "test",
"fields": {
"animals_cat": "miaow",
"animals_cow": "moo",
"animals_dog": "woof"
},
"props": {}
}
]
}
}
As you can see, the way the object is stored, the cat, cow and dog keys are nested within animals. However, when the search results come back, none of the keys are nested, and are simply separated by _.
My question is this: Is there any way provided by Riak to "reverse format" the search, and return the fields of the object in the correct (nested) format? This becomes a problem when storing and returning user data that might possibly contain _.
I do see that the latest version of Riak (beta release) provides a search schema, but I can't seem to see whether my question would be answered by this.
What you receive back in the search result is what the object looked like after passing through the json analyzer. If you need the data formatted differently, you can use a custom analyzer. However, this will only affect newly put data.
For existing data, you can use the id field and issue a get request for the original object, or use the solr query as input to a MapReduce job.

Google Cloud Datastore runQuery returning 412 "no matching index found"

** UPDATE **
Thanks to Alfred Fuller for pointing out that I need to create a manual index for this query.
Unfortunately, using the JSON API, from a .NET application, there does not appear to be an officially supported way of doing so. In fact, there does not officially appear to be a way to do this at all from an app outside of App Engine, which is strange since the Cloud Datastore API was designed to allow access to the Datastore outside of App Engine.
The closest hack I could find was to POST the index definition using RPC to http://appengine.google.com/api/datastore/index/add. Can someone give me the raw spec for how to do this exactly (i.e. URL parameters, what exactly should the body look like, etc), perhaps using Fiddler to inspect the call made by appcfg.cmd?
** ORIGINAL QUESTION **
According to the docs, "a query can combine equality (EQUAL) filters for different properties, along with one or more inequality filters on a single property".
However, this query fails:
{
"query": {
"kinds": [
{
"name": "CodeProse.Pogo.Tests.TestPerson"
}
],
"filter": {
"compositeFilter": {
"operator": "and",
"filters": [
{
"propertyFilter": {
"operator": "equal",
"property": {
"name": "DepartmentCode"
},
"value": {
"integerValue": "123"
}
}
},
{
"propertyFilter": {
"operator": "greaterThan",
"property": {
"name": "HourlyRate"
},
"value": {
"doubleValue": 50
}
}
},
{
"propertyFilter": {
"operator": "lessThan",
"property": {
"name": "HourlyRate"
},
"value": {
"doubleValue": 100
}
}
}
]
}
}
}
}
with the following response:
{
"error": {
"errors": [
{
"domain": "global",
"reason": "FAILED_PRECONDITION",
"message": "no matching index found.",
"locationType": "header",
"location": "If-Match"
}
],
"code": 412,
"message": "no matching index found."
}
}
The JSON API does not yet support local index generation, but we've documented a process that you can follow to generate the xml definition of the index at https://developers.google.com/datastore/docs/tools/indexconfig#Datastore_Manual_index_configuration
Please give this a shot and let us know if it doesn't work.
This is a temporary solution that we hope to replace with automatic local index generation as soon as we can.
The error "no matching index found." indicates that an index needs to be added for the query to work. See the auto index generation documentation.
In this case you need an index with the properties DepartmentCode and HourlyRate (in that order).
For gcloud-node I fixed it with those 3 links:
https://github.com/GoogleCloudPlatform/gcloud-node/issues/369
https://github.com/GoogleCloudPlatform/gcloud-node/blob/master/system-test/data/index.yaml
and most important link:
https://cloud.google.com/appengine/docs/python/config/indexconfig#Python_About_index_yaml to write your index.yaml file
As explained in the last link, an index is what allows complex queries to run faster by storing the result set of the queries in an index. When you get no matching index found it means that you tried to run a complex query involving order or filter. So to make your query work, you need to create your index on the google datastore indexes by creating a config file manually to define your indexes that represent the query you are trying to run. Here is how you fix:
create an index.yaml file in a folder named for example indexes in your app directory by following the directives for the python conf file: https://cloud.google.com/appengine/docs/python/config/indexconfig#Python_About_index_yaml or get inspiration from the gcloud-node tests in https://github.com/GoogleCloudPlatform/gcloud-node/blob/master/system-test/data/index.yaml
create the indexes from the config file with this command:
gcloud preview datastore create-indexes indexes/index.yaml
see https://cloud.google.com/sdk/gcloud/reference/preview/datastore/create-indexes
wait for the indexes to serve on your developer console in Cloud Datastore/Indexes, the interface should display "serving" once the index is built
once it is serving your query should work
For example for this query:
var q = ds.createQuery('project')
.filter('tags =', category)
.order('-date');
index.yaml looks like:
indexes:
- kind: project
ancestor: no
properties:
- name: tags
- name: date
direction: desc
Try not to order the result. After removing orderby(), it worked for me.

Resources