Redacting collections of String efficiently with Google DLP? - google-cloud-dlp

I'm trying to port some DLP/PII de-identifying code that I wrote against the Beta V2 version of the Google DLP Java library classes. It no longer works because the Beta V2 service was retired and the Java API has changed. I've tried to refactor the code based on the new examples here. In the past, I was able to add a collection of ContentItem instances to the DLP client request, but now it appears to be limited to one item. Since the Java DLP library wraps an HTTP REST client, I want to make sure that I send data in efficient sizes. I may have hundreds of Strings that I want to send in one request. It's not really feasible to join them into one long String and then split them apart again, and I would prefer to not have to make hundreds of requests. What is the most efficient way of doing this with the new V2 API?

You can batch items together still, just instead use a ContentItem.Table.

Related

Here Maps - Autocomplete Suggestion

I am looking at using the Autocomplete API from Here Maps and using the Suggestion.json endpoint. My question is that as a user keys in characters, the autocomplete API for suggestions will be called on every key press. This means that for each key press, I need to call the API. This will turn out to be quite expensive. Assuming, I type in "London", it will call the API 6 times. Is there a better way to do this? Also, is there any option of a session token to be created such that, I get charged only for a session token in which I key in multiple characters for a search suggestion list to be generated?
There are a few things that you can do to reduce the number of calls:
Some places have just short names like https://en.wikipedia.org/wiki/List_of_short_place_names, so you might have to consider even single character names for your autosuggest. So you may consider building a cache of autosuggest keywords i.e. say a user types L and the one time you make a call to autosuggest API, you can cache the result with L as Key and then for the next key press repeat building the data structure or data store per your requirement, so that the number of hits to the API gradually decreases.
Once you have built your cache, you can decide to refresh you cache after every 10 calls or so. This will greatly reduce your call to the external API.
Lookup Trie data structure. might be helpful.
Here bills you based on the number of requests you make to the backend services so there is not such possibility to bill by session token. You can talk to your account executive if you can negotiate the cost of the offering though.
You can do this pretty easily with the Javascript API: here's an example to get started and here's the documentation page for autosuggest Javascript API.

I am using Microsoft's Face API. It requires to create a DB at their server. is it possible to use our own DB, rather than creating one at their end

I am trying to use Microsoft's Face API, for facial recognition for my company employees. I see that you need to create a database in Microsoft's serverS.
Is there a way to use their API's on our company database (without creating another DB on their server? Also any changes you make to this DB will be taken care of.
If no, then how will you take care of the changes you want (I know that there are delete API calls as well, but will not it be cumbersome?)
I think you may have misunderstood how this API works. Whenever there is a change to your list of employees, the PersonGroup (a friendly name for the image classifier model), must be retrained in order for it to start recognizing the added faces and stop recognizing removed ones. So even if there was a way to store the model locally (which there isn't), you will still need to track add/removes and take the additional step of training.

Loading Bulk data in Firebase

I am trying to use the set api to set an object in firebase. The object is fairly large, the serialized json is 2.6 mb in size. The root node has around 90 chidren, and in all there are around 10000 nodes in the json tree.
The set api seems to hang and does not call the callback.
It also seems to cause problems with the firebase instance.
Any ideas on how to work around this?
Since this is a commonly requested feature, I'll go ahead and merge Robert and Puf's comments into an answer for others.
There are some tools available to help with big data imports, like firebase-streaming-import. What they do internally can also be engineered fairly easily for the do-it-yourselfer:
1) Get a list of keys without downloading all the data, using a GET request and shallow=true. Possibly do this recursively depending on the data structure and dynamics of the app.
2) In some sort of throttled fashion, upload the "chunks" to Firebase using PUT requests or the API's set() method.
The critical components to keep in mind here is that the number of bytes in a request and the frequency of requests will have an impact on performance for others viewing the application, and also count against your bandwidth.
A good rule of thumb is that you don't want to do more than ~100 writes per second during your import, preferably lower than 20 to maximize your realtime speeds for other users, and that you should keep the data chunks in low MBs--certainly not GBs per chunk. Keep in mind that all of this has to go over the internets.

How Meteor Framework partition data?

From what I know it seems that Meteor Framework stores part of data on the client. It's clear how to do it for personal todo list - because it's small and you can just copy everything.
But how it works in case of let's say Q&A site similar to this? The collection of questions are huge, you can't possibly copy it to the client. And you need to have filtering by tags and sorting by date and popularity.
How Meteor Framework handles such case? How it partition data? Does it make sense to use Meteor for such use case?
Have a look at the meteor docs, in particular the publish and subscribe section. Here's a short example:
Imagine your database contains one million posts. But your client only needs something like:
the top 10 posts by popularity
the posts your friends made in the last hour
the posts for the group you are in
In other words, some subset of the larger collection. In order to get to that subset, the client starts a subscription. For example: Meteor.subscribe('popularPosts'). Then on the server, there will be a corresponding publish function like: Meteor.publish('popularPosts', function(){...}.
As the client moves around the app (changes routes), different subscriptions may be started and stopped.
The subset of documents are sent to the client and cached in memory in a mongodb-like store called minimongo. The client can then retrieve the documents as needed in order to render the page.

Bulk Collection Manipulation through a REST (RESTful) API

I'd like some advice on designing a REST API which will allow clients to add/remove large numbers of objects to a collection efficiently.
Via the API, clients need to be able to add items to the collection and remove items from it, as well as manipulating existing items. In many cases the client will want to make bulk updates to the collection, e.g. adding 1000 items and deleting 500 different items. It feels like the client should be able to do this in a single transaction with the server, rather than requiring 1000 separate POST requests and 500 DELETEs.
Does anyone have any info on the best practices or conventions for achieving this?
My current thinking is that one should be able to PUT an object representing the change to the collection URI, but this seems at odds with the HTTP 1.1 RFC, which seems to suggest that the data sent in a PUT request should be interpreted independently from the data already present at the URI. This implies that the client would have to send a complete description of the new state of the collection in one go, which may well be very much larger than the change, or even be more than the client would know when they make the request.
Obviously, I'd be happy to deviate from the RFC if necessary but would prefer to do this in a conventional way if such a convention exists.
You might want to think of the change task as a resource in itself. So you're really PUT-ing a single object, which is a Bulk Data Update object. Maybe it's got a name, owner, and big blob of CSV, XML, etc. that needs to be parsed and executed. In the case of CSV you might want to also identify what type of objects are represented in the CSV data.
List jobs, add a job, view the status of a job, update a job (probably in order to start/stop it), delete a job (stopping it if it's running) etc. Those operations map easily onto a REST API design.
Once you have this in place, you can easily add different data types that your bulk data updater can handle, maybe even mixed together in the same task. There's no need to have this same API duplicated all over your app for each type of thing you want to import, in other words.
This also lends itself very easily to a background-task implementation. In that case you probably want to add fields to the individual task objects that allow the API client to specify how they want to be notified (a URL they want you to GET when it's done, or send them an e-mail, etc.).
Yes, PUT creates/overwrites, but does not partially update.
If you need partial update semantics, use PATCH. See http://greenbytes.de/tech/webdav/draft-dusseault-http-patch-14.html.
You should use AtomPub. It is specifically designed for managing collections via HTTP. There might even be an implementation for your language of choice.
For the POSTs, at least, it seems like you should be able to POST to a list URL and have the body of the request contain a list of new resources instead of a single new resource.
As far as I understand it, REST means REpresentational State Transfer, so you should transfer the state from client to server.
If that means too much data going back and forth, perhaps you need to change your representation. A collectionChange structure would work, with a series of deletions (by id) and additions (with embedded full xml Representations), POSTed to a handling interface URL. The interface implementation can choose its own method for deletions and additions server-side.
The purest version would probably be to define the items by URL, and the collection contain a series of URLs. The new collection can be PUT after changes by the client, followed by a series of PUTs of the items being added, and perhaps a series of deletions if you want to actually remove the items from the server rather than just remove them from that list.
You could introduce meta-representation of existing collection elements that don't need their entire state transfered, so in some abstract code your update could look like this:
{existing elements 1-100}
{new element foo with values "bar", "baz"}
{existing element 105}
{new element foobar with values "bar", "foo"}
{existing elements 110-200}
Adding (and modifying) elements is done by defining their values, deleting elements is done by not mentioning it the new collection and reordering elements is done by specifying the new order (if order is stored at all).
This way you can easily represent the entire new collection without having to re-transmit the entire content. Using a If-Unmodified-Since header makes sure that your idea of the content indeed matches the servers idea (so that you don't accidentally remove elements that you simply didn't know about when the request was submitted).
Best way is :
Pass Only Id Array of Deletable Objects from Front End Application To Web API
2. Then You have Two Options:
2.1 Web API Way : Find All Collections/Entities using Id arrays and Delete in API , but you need to take care of Dependant entities like Foreign Key Relational Table Data too
2.2. Database Way : Pass Ids to your database side, find all records in Foreign Key Tables and Primary Key Tables and Delete in same order i.e. F-Key Table records then P-Key Table records

Resources