Non-trivial analytics on timeseries - opentsdb

I was wondering, what is the mechanism on OpenTSDB to apply more complex functions than aggregations to data? For example if I wanted to calculate moving averages, how would I go for it? Ideally I would like a mechanism to run (through OpenTSDB) custom aggregators on HBase level (example), but I think this might not be very easy to implement. Alternatively something that runs on OpenTSDB-level would be nice. I think this might conceptually close to what issue 546/Pull 562 talk about.

You can put a graphite-web app in front of the database and use its Functions: http://graphite.readthedocs.org/en/latest/functions.html

Related

Java compatible simple expression language

I am building a system in Scala for feature engineering where the end user API receives aggregations on list of objects/events.
For example, a client of this tool might pass into it a function that given an array of past pageviews for a specific web user, filters for the ones coming from a specify country and count them. The output of this call will then be a number.
It can be thought as a very simple reduce operation.
I am figuring out how to build the API for such system. I could write a simple custom language to perform counts and filters but I am sure that it is not the best approach, especially because it will not be expressive enough, unless designed with care.
Are you aware of something like an expression language that could be used to express simple functions without the need for me to build one from scratch?
The other option would be to allow end users to pass custom code to this library, which might be dangerous at runtime.
I am aware of apache Calcite to plug SQL into different data structures and db. It is a good option, however it forces me to think in "columnar" sql way, while here I am looking more for something row based, similar to the map-reduce way of programming.
You mention Apache Calcite, but did you know that you can use it without writing SQL? If you use Calcite's RelBuilder class you can build algebra directly, which is very similar to the algebraic approach of MapReduce. See algebra in Calcite for more details.

Riak-TS UseCase vs other tsdb

This is a proof of concept and I am curious on experiences in using Riak-TS to evaluate it.
I am working on a mobile app where part of the use is to display graphs/charts of various data. The data is related to commercial printers, jobs that pass through to them, and pre-processing information and has a snapshot of various metrics, but is currently only available in real-time so I am looking at a tsdb implementation for analyzing historical data.
I would use Riak-TS to collect time series data on around 30-60 second intervals and use the data to display:
number of jobs printed by hour/shift/day/week/etc
Ink usage by hour/shift/day/etc
Various other data related to a sum/average/series snapshot of data at a specific time span.
What are some things I should consider to decide whether to use Riak-TS for this and potential drawbacks to think about?
What level of Erlang is required to use Riak for a basic proof of concept set-up of this case. I am pretty comfortable with Python and JavaScript and it looked like Riak was available to work with in those languages, but I probably don't have time to learn Erlang for the setup of this project.
Is there a noticeable difference in the Python, Node.js, HTTP interface easier to use, faster, more features, etc? I have worked with some cloud services where some interfaces had missing/buggy/slow features and would like to plan on using the best one. If that is Java, C#, or Go I would be interest in that information too.
What other open source implementations outside of Riak-TS should I explore?
At first blush this sounds like a good potential use case for Riak TS. Are there drawbacks to using TS vs something else? Maybe, the one thing I would note is that you didn't say how much data you will be dealing with. Riak TS is designed to be clustered from the beginning and the recommendation is that you start with a 5 node cluster for high availability reasons. You can start with a single node and scale out as needed but you lose out on some of the advantages of the TS platform by doing that.
I will also point out that TS was just open sourced not long ago and may not have all of the features of its competitors yet (but the team, and full disclosure I work for Basho, is working on frequent releases to add new features).
On to Erlang. You need to know 0 Erlang to use TS. For what you need to do there is no need to learn Erlang.
The Python client for Riak TS is excellent. I have used it and the Java client extensively. I would guess that the other clients are also quite good because they are written and maintained by the same group of engineers and client software is their specialty.
I would recommend using the client (whether it be Python, Node, Java, etc.) over the HTTP API because it will likely be easier for you and performance will be better since the clients use protocol buffers and/or TTB vs HTTP.
Other databases you should try? You mention TSDB in the title of this question. My experience is that TSDB is much harder to get up and running with. InfluxDB is probably the most popular time series specific database out there right now. I don't have personal experience with it but I am guessing by its popularity that it is pretty good.
Your use case sounds pretty interesting (I used to work in the printing industry) so if you have any other questions I can help with please let me know.

Modelling many to many relationships in Meteor

Hi I am building a small app to get used to Meteor (and Mongo). Something that is bothering me is the data modelling aspect. Specifically what is the best way to model a many to many relationship. I have read in the Mongo docs that a doc should not be embedded in another doc if you expect it to grow while the original doc remains fairly static.
In my test app students can register for courses. So from the Mongo perspective it makes sense to include the students as an embedded doc in the course as each course will have a limited number of students as opposed to the other way round where, over time, a student could theoretically join unlimited courses.
Then there is the Meteor aspect, I read that a lot of Meteor's features are aimed at separate collections, such as DDP working at the document level so any change in the student array would cause the entire course doc to be resent to every browser, and things like the each spacebars helper works with Mongo cursors but not with arrays etc, etc.
Has anyone dealt with a similar situation and could they explain what approach they took and any drawbacks they had to deal with etc? Thanks.
See this article: https://www.discovermeteor.com/blog/reactive-joins-in-meteor/
And test how good your possible solutions are with this https://kadira.io/
Better use the guide:
http://guide.meteor.com/data-loading.html#publishing-relations
The Meteor team tames (or hides!) the javascript monster to an amazing extent. By using their conventions, you get "free" a ton of much-used functionality "out of the box". Things that are usually re-invented over and over again, accounts, OAuth, live data across clients, standard live-data protocol etc.
But very soon... you need features not in the box. Wow... look at all the choices. Wait a minute, this is the same monster you were fighting before Meteor!
So use the official Meteor Guide. They recommend the wisest ways to extend functionality of your app when you make these choices.
Since they know how they have "hidden the monster", they know how to keep avoiding the monster when you extend.

Getting large number of entities from datastore

By this question, I am able to store large number (>50k) of entities in datastore. Now I want to access all of it in my application. I have to perform mathematical operations on it. It always time out. One way is to use TaskQueue again but it will be asynchronous job. I need a way to access these 50k+ entities in my application and process them without getting time out.
Part of the accepted answer to your original question may still apply, for example a manually scaled instance with 24h deadline. Or a VM instance. For a price, of course.
Some speedup may be achieved by using memcache.
Side note: depending on the size of your entities you may need to keep an eye on the instance memory usage as well.
Another possibility would be to switch to a faster instance class (and with more memory as well, but also with extra costs).
But all such improvements might still not be enough. The best approach would still be to give your entity data processing algorithm a deeper thought - to make it scalable.
I'm having a hard time imagining a computation so monolithic that can't be broken into smaller pieces which wouldn't need all the data at once. I'm almost certain there has to be some way of using some partial computations, maybe with storing some partial results so that you can split the problem and allow it to be handled in smaller pieces in multiple requests.
As an extreme (academic) example think about CPUs doing pretty much any super-complex computation fundamentally with just sequences of simple, short operations on a small set of registers - it's all about how to orchestrate them.
Here's a nice article describing a drastic reduction of the overall duration of a computation (no clue if it's anything like yours) by using a nice approach (also interesting because it's using the GAE Pipeline API).
If you post your code you might get some more specific advice.

Neo4j add algorithm to library

Is there a way to add a graph processing algorithm to the Neo4j database engine without rewriting the algorithms module? I can do it in an external process (i.e. implementing a ruby script or java program that retrieves the nodes and processes them) but I'd like it to be inside the DB engine for encapsulation and availability.
Namely, can I implement a search algorithm that includes adding tags to nodes as I travel through them?
It depends on what you mean. Neo4J provides for unmanaged extensions which you can use to do what I think you're trying to do.
There is also the idea of server plugins which you can explore. Which thing would be right for you to do would depend on what you want to accomplish with that "external process".
Warning - the use of those APIs may be a less elegant way to go about what you want to do, as the APIs may change. You might be better off asking a different question that outlines the specifics of what you're trying to accomplish, and then asking for guidance on the best design way to do that. But the answer to your question I think is that fundamentally adding things to the database engine without rewriting the algorithms module is possible, but that doesn't necessarily mean that it's the best way to accomplish whatever you want to do.

Resources