Cloud DLP inspection scan to look for multiple infoTypes in same row - google-cloud-dlp

I have to run an inspection scan on a Big Query Table. My goal is to highlight/find a row only if it contains say, first_name, last_name, Phone_number & age infoTypes (*all in same row).
I'm new to Cloud DLP and have created a Job trigger (with all infoTypes i'm interested in) to scan data from a BQ table. I'm not really sure if Inspection Rulesets can help here.
Just in case my point is not clear: https://help.symantec.com/cs/DLP15.0/DLP/v54111221_v120691346/Coincidencia-con-3-columnas-en-una-condici?locale=EN_US

Cloud DLP provides a set of built-in infoType detectors which allows you to specify by name, each of which is listed in InfoType detector reference. These detectors use a variety of techniques to discover and classify each type. For example, some types will require a pattern match, some may have mathematical checksums, some have special digit restrictions, and others may have specific prefixes or context around the findings.
However, there are additional options as shown in here which you can use to further explore your use case.
"Important: Built-in infoType detectors are not a 100% accurate detection method. For example, they can't guarantee compliance with regulatory requirements. You must decide what data is sensitive and how to best protect it. Google recommends that you test your settings to make sure your configuration meets your requirements."
You can find some additional information here.

Related

Cronofy available_periods does not do anything

I am trying to get a user's availability, but in one use case, I want to ignore their actual availability rules and even their current schedule and calendar. Basically I am using cronofy in this use case to just provide me a list of times.
According to the docs https://docs.cronofy.com/developers/api/scheduling/availability/
I should be able to specify participants.members.available_periods.start and participants.members.available_periods.end. I've read and re-read these params over and over and am sure I am sending it as specified, but cronofy still returns to me only times that the user is not "busy".
Am I still not understanding this param? Is there another way to ignore a user's calendar, ie ignore their "busy" time slots?
The intent of the participants.members.available_periods parameter is to define a specific participant's availability hours for the query. Useful in multi-person queries when one or more participants have ad-hoc shift patterns or other complicated working hours. You can choose to specify these here or use our Available Periods endpoint along with Managed Availability to have the availability query consider the latest set of Available Periods when it is run.
The Availability query isn't designed to ignore all calendar events for an individual participant but there is another way you can achieve what you're looking for.
Application Calendars are designed as drop in replacements for synced-calendar Accounts in Cronofy. So you can create one of more of these via the API and use them in the Availability query as a stand-in for the participant.
They still support Managed Availability and can have events created in them. So if you wanted to ensure that your application doesn't double book over the events it already knows about, you can just create the events as your application books them.
I hope this helps. Our support team (support#cronofy.com) are always happy to talk through the specifics of your use case if that would be helpful.
-- UPDATE --
We've decided to support this as a first class concept in our API.
You can now pass an empty array to the participants.member.calendar_ids attribute to indicate that you don't want any calendars included within the availability query. And thus, only the Availability Rules or query periods will be considered. Thanks for the question.
More information here.

System Design Interview - Car API

System Design Question:
You are given a dataset of a few million used cars and information about them -- miles, color, price, etc. You have to create an API endpoint in two days that allows users to query the dataset.
This was the answer I gave:
Use a relational database (let's say PostgreSQL) to house the data. Expose a GET endpoint that takes query string parameters corresponding to the attributes in the dataset, parses them and uses them to query the database. The endpoint can also track which attributes are queried the most and add indexes to those attributes to speed up the queries. I was asked how I would handle a range (e.g. "car with 50,000 <= miles <= 100,000") to which I said this can be handled by the query string parameter and translated into the SQL query by the GET endpoint.
Feedback
I was told in feedback afterwards that this answer "didn't convey a strong understanding of how to design web systems." I was hoping for some insights as to where my solution may have been insufficient/weak or may have overlooked something about designing web systems.
Note: I reconstructed my answer from memory so it may be clearer here than it was in the interview.
Thanks for any help!
Like already discussed in the comments, the Interviewer wanted to hear something about SQL Injection. There are some counter measures, which you can do to avoid SQL Injection. These are (most probably not a complete list, but should give a hint, on what to look out for):
Use Prepared Statements
Take care about Access restrictions (in the DB as well as on the OS)
Validate the User Input

R geographic address validation

I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.
There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.

Relational behavior against a NoSQL document store for ODBC support

The first assertion is that document style nosql databases such as MarkLogic and Mongo should store each piece of information in a nested/complex object.
Consider the following model
<patient>
<patientid>1000</patientid>
<firstname>Johnny</firstname>
<claim>
<claimid>1</claimid>
<claimdate>2015-01-02</claimdate>
<charge><amount>100</amount><code>374.3</code></charge>
<charge><amount>200</amount><code>784.3</code></charge>
</claim>
<claim>
<claimid>2</claimid>
<claimdate>2015-02-02</claimdate>
<charge><amount>300</amount><code>372.2</code></charge>
<charge><amount>400</amount><code>783.1</code></charge>
</claim>
</patient>
In the relational world this would be modeled as a patient table, claim table, and claim charge table.
Our primary desire is to simultaneously feed downstream applications with this data, but also perform analytics on it. Since we don't want to write a complex program for every measure, we should be able to put a tool on top of this. For example Tableau claims to have a native connection with MarkLogic, which is through ODBC.
When we create views using range indexes on our document model, the SQL against it in MarkLogic returns excessive repeating results. The charge numbers are also double counted with sum functions. It does not work.
The thought is that through these index, view, and possibly fragment techniques of MarkLogic, we can define a semantic layer that resembles a relational structure.
The documentation hints that you should create 1 object per table, but this seems to be against the preferred document db structure.
What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?
If the ODBC connection is going to always return bad data and not be aware of relationships, then all of the tools claiming to have ODBC support against NoSQL is not true.
References
https://docs.marklogic.com/guide/sql/setup
https://docs.marklogic.com/guide/sql/tableau
http://www.marklogic.com/press-releases/marklogic-and-tableau-build-connection/
https://developer.marklogic.com/learn/arch/data-model
For your question: "What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?"
The rule of thumb I use is that when I want to count "objects", I model them as separate documents. So if you want to run queries that count patients, claims, and charges, you would put them in separate documents.
That doesn't mean we're constraining MarkLogic to only relational patterns. In UML terms, a one-to-many relationship can be a composition or an aggregation. In a relational model, I have no choice but to model those as separate tables. But in a document model, I can do separate documents per object or roll them all together - the choice is usually based on how I want to query the data.
So your first assertion is partially true - in a document store, you have the option of nesting all your related data, but you don't have to. Also note that because MarkLogic is schema-agnostic, it's straightforward to transform your data as your requirements evolve (corb is a good option for this). Certain requirements may require denormalization to help searches run efficiently.
Brief example - a person can have many names (aliases, maiden name) and many addresses (different homes, work address). In a relational model, I'd need a persons table, a names table, and an addresses table. But I'd consider the names to be a composite relationship - the lifecycle of a name equals that of the person - and so I'd rather nest those names into a person document. An address OTOH has a lifecycle independent of the person, so I'd make that an address document and toss an element onto the person document for each related address. From an analytics perspective, I can now ask lots of interesting questions about persons and their names, and persons and addresses - I just can't get counts of names efficiently, because names aren't in separate documents.
I guess MarkLogic is a little atypical compared to other document stores. It works best when you don't store an entire table as one document, but one record per document. MarkLogic indexing is optimized for this approach, and handles searching across millions of documents easily that way. You will see that as soon as you store records as documents, results in Tableau will improve greatly.
Splitting documents to such small fragments also allows higher performance, and lower footprints. MarkLogic doesn't hold the data as persisted DOM trees that allow random access. Instead, it streams the data in a very efficient way, and relies on index resolution to pull relevant fragments quickly..
HTH!

How to implement gapless, user-friendly IDs in NHibernate?

I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.
One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.

Resources