I'm beginning the process of instrumenting a web application, and using StatsD to gather as many relevant metrics as possible. For instance, here are a few examples of the high-level metric names I'm currently using:
http.responseTime
http.status.4xx
http.status.5xx
view.renderTime
oauth.begin.facebook
oauth.complete.facebook
oauth.time.facebook
users.active
...and there are many, many more. What I'm grappling with right now is establishing a consistent hierarchy and set of naming conventions for the various metrics, so that the current ones make sense and that there are logical buckets within which to add future metrics.
My question is two fold:
What relevant metrics are you gathering that you have found indespensible?
What naming structure are you using to categorize metrics?
This is a question that has no definitive answer but here's how we do it at Datadog (we are a hosted monitoring service so we tend to obsess over these things).
1. Which metrics are indispensable? It depends on the beholder. But at a high-level, for each team, any metric that is as close to their goals as possible (which may not be the easiest to gather).
System metrics (e.g. system load, memory etc.) are trivial to gather but seldom actionable because they are too hard to reliably connect them to a probable cause.
On the other hand number of completed product tours matter to anyone tasked with making sure new users are happy from the first minute they use the product. StatsD makes this kind of stuff trivially easy to collect.
We have also found that the core set of key metrics for any teamchanges as the product evolves so there is a continuous editorial process.
Which in turn means that anyone in the company needs to be able to pick and choose which metrics matter to them. No permissions asked, no friction to get to the data.
2. Naming structure The highest level of hierarchy is the product line or the process. Our web frontend is internally called dogweb so all the metrics from that component are prefixed with dogweb.. The next level of hierarchy is the sub-component, e.g. dogweb.db., dogweb.http., etc.
The last level of hierarchy is the thing being measured (e.g. renderTime or responseTime).
The unresolved issue in graphite is the encoding of metric metadata in the metric name (and selection using *, e.g. dogweb.http.browser.*.renderTime) It's clever but can get in the way.
We ended up implementing explicit metadata in our data model, but this is not in statsd/graphite so I will leave the details out. If you want to know more, contact me directly.
Related
There's any way to list the kinds that are not being used in google's datastore by our app engine app without having to look into our code and/or logic? : )
I'm not talking about indexes, which I can list by issuing an
gcloud datastore indexes list
and then compare with the datastore-indexes.xml or index.yaml.
I tried to check datastore kinds statistics and other metadata but I could not find anything useful to help me on this matter.
Should I give up to find ways of datastore providing me useful stats and code something to keep collecting datastore statistics(like data size), during a huge period to have at least a clue of which kinds are not being used and then, only after this research, take a look into our app code to see if the kind Model was removed?
Example:
select bytes from __Stat_Kind__
Store it somewhere and keep updating for a period. If the Kind bytes size does not change than probably the kind is not being used anymore.
The idea is to do some cleaning in datastore.
I would like to find which kinds are not being used anymore, maybe for a long time or were created manually to be used once... You know, like a table in oracle that no one knows what is used for and then if we look into the statistics of that table we would see that this table was only used once 5 years ago. I'm trying to achieve the same in datastore, I want to know which kinds are not being used anymore or were used a while ago, then ask around and backup/delete it if no owner was found.
It's an interesting question.
I think you would be best-placed to audit your code and instill organizational practice that requires this documentation to be performed in future as a business|technical pre-prod requirement.
IIRC, Datastore doesn't automatically timestamp Entities and keys (rightly) aren't incremental. So there appears no intrinsic mechanism to track changes short of taking a snapshot (expensive) and comparing your in-flight and backup copies for changes (also expensive and inconclusive).
One challenge with identifying a Kind that appears to be non-changing is that it could be referenced (rarely) by another Kind and so, while it does not change, it is required.
Auditing your code and documenting it for posterity should not only provide you with a definitive answer (and identify owners) but it pays off a significant technical debt that has been incurred and avoids this and probably future problems (e.g. GDPR-like) requirements that will arise in the future.
Assuming you are referring to records being created/updated, then I can think of the following options
Via the Cloud Console (Datastore > Dashboard) - This lists all your 'Kinds' and the number of records in each Kind. Theoretically, you can take a screen shot and compare the counts so that you know which one has experienced an increase or not.
Use of Created/LastModified Date columns - I usually add these 2 columns to most of my datastore tables. If you have them, then you can have a stored function that queries them. For example, you run a query to sort all of your Kinds in descending order of creation (or last modified date) and you only pull the first record from each one. This tells you the last time a record was created or modified.
I would write a function as part of my App, put it behind a page which requires admin privilege (only app creator can run it) and then just clicking a link on my App would give me the information.
I have an aggerate data model (think a Customer entity with Widgets that belong to them as a list of embedded entities).
When I search for customers (e.g DocumentDBRepository.GetItemsAsync) That will be hydrating the customer data model along with the widgets for each. For efficiency reasons, I don’t really need the customer search to consider the widgets.
Are there any strategies for this in document dbs (such as a “LiteCustomer” entity)? I suspect not as that is just the nature of the “schema-less” data I’ve told it to store in the first place, but interested to hear thoughts.
Is this simply a ‘non issue’?
First, disclaimer: data modeling is hard. There are many nuances and a SO question can never cover entire business and everything left unsaid in both Q and A. There's no silver bullets. Regardless..
"LiteCustomer"
Perfectly fine to have such model in your client code. Your main Customer model may and will have many representations, most of them simple subsets of full model. Similarly to relational sql, select only what you need. Don't fetch data to client which you don't need.
The SQL API provides quite cool SQL tools to compose json for return documents for you.
physical storage model may differ from domain model
Consider your usage scenarios. If many scenarios happen to work with customer without widgets (or vice versa) then consider having widgets as separate document(s) in storage model.
In DocDB, the question is often not so much in querying logic but what your application expects on modification logic. Querying which is indexed is fast and every sql query can easily do transformations (though cross-doc joining is troublesome). For C(R)UD - you have less options - it's always by full document. Having too large documents will end up with higher RU costs and complex code.
Questions to consider:
How often customer changes without widget count/details changing?
How often widgets change without customer changing?
Do widgets on customer change independently or always as a set?
When do you need transactional updates on customer+widget changes?
How would queries look like? Can they be indexed?
Test.
True, changing model later is cumbersome in DocDB, but don't try to fix something before you know it's broken. If you are not sure you have an issue or not, then most likely fixing the maybe-issue is costlier than not fixing it.
If in doubt, generate loads of data and test it out.
I'm starting a new project and am interested in architecting it as microservices. I'm trying to wrap my head around it:
Say that I have an order service and a product service. Now I want to make a report service that gives me all orders that contain a product from a certain product category.
Since order's dont know about products that means that I would need to fetch all orders, loop them and fetch products for each order and then return those how match.
Is this assumption correct or is there any more efficient way of doing this with microservices?
In a microservices architecture, the procedure is to distill the use cases and the service boundaries of the application. In the question above, there are at least two service boundaries, namely one for transactions and another for reporting.
When you have two different service boundaries, the typical approach is to duplicate some data elements between them eg. whenever you make a sale, the data, should be sent to both the reporting and transactional services. One possible approach of broadcasting the data to the different boundaries is to use a message queue. Duplicating the data allows them to be evolve and operate independently and become self sufficient which is one of the goals of microservices.
A personal word of advice though, you might want to start with a monolith before going the microservices route. Microservices are generally more operationally heavy; it will be difficult to reason about its advantages during the initial application stages. It tends to work better after having developed the monolithic application since it would be easier to see what didn't work and what could be improved by a microservices-like system.
I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?
("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).
I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.
... This problem has received a great deal of attention over the past
two decades, finding an ever growing number of applications in
networking and traffic monitoring, such as the detection of worm
propagation, of network attacks (e.g., by Denial of Service), and of
link-based spam on the web [3]. For instance, a data stream over a
network consists of a sequence of packets, each packet having a
header, which contains a pair (source–destination) of addresses,
followed by a body of specific data; the number of distinct header
pairs (the cardinality of the multiset) in various time slices is an
important indication for detecting attacks and monitoring traffic, as
it records the number of distinct active flows. Indeed, worms and
viruses typically propagate by opening a large number of different
connections, and though they may well pass unnoticed amongst a huge
traffic, their activity becomes exposed once cardinalities are
measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of
massive data sets of sorts—natural language texts [4, 5], biological
data [17, 18], very large structured databases, or the internet graph,
where the authors of [22] report computational gains by a factor of
500+ attained by probabilistic cardinality estimators.
At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.
Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.
It could be implemented with set. every question would have a set that stores the usernames:
question#ID121e={username1,username2...}
For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.
I am trying to calculate physical distances between geographic locations (addresses) with ggmaps/mapdist function in R. Apart from the uncomfortable fact that Google Maps allows only 2500 queries/session, I have to cope with the misspelled or other way imperfect "addresses". The most typical problem is that the exact address strings themselves are added by several other info (floor, door etc.), but it is very problematic to detect any pattern in these what would allow applying regular expression.
My goal is:
Check if the address string is recognizable to Google Maps;
If not, find a way to truncate to an acceptable form, perhaps by parsing words step by step from the string.
Have anybody coped with this kind of problem?
Thanks.
There are a couple of factors running into each other here. One factor is the misspellings and other complexities related to addresses and the other is pinpointing (geocoding) a given address. Although they are related problems, each must be handled to accomplish your objectives.
There are numerous service providers out there that can do either or both with minimal cost involved. This can be found with a simple Google search. You can then investigate each to see if they match your use case and licensing requirements.
All of that considered, you'll want to get your address list cleaned up on a minimum. Doing that will enable you to utilize any number of geocoding providers.
Depending upon the size of your list, you can get your list cleaned up and geocoded for perhaps $20.
In the interest of full disclosure, I'm the founder of SmartyStreets. We provide a web interface (to help clean up the address list) as well as an API (which can be used on a continual basis to keep addresses clean). We also geocode your list at no extra charge. Further, we don't have any licensing restrictions on the number of lookups that can be performed during a given timeframe. (We have customers that hit us hundreds of millions of times per day.) The entire process of signing up and cleaning up your list takes just a few minutes.