custom querying in graphite - graphite

We need to collect timeseries information on multiple server and business processes and consider to use graphite. It seems good if we want to display the raw data. But what if we want to do BI on this data and run custom queries? Does graphite allow that, or alternatively can I instruct graphite to store data on postgress?

Graphite definitely allows you to query your data, both graphically and returning csv or json. The queries in graphite aren't done with a language like sql. They're done with functions that apply to one metric at a time. Each metric is it's own database, which is just a series of time, value pairs.
The most common thing you're likely to want is summarize data over different time periods. Here's an example of what the url would look like for a graph where the data is summarized daily for a week:
http://graphite.example.com/render/?width=586&height=308&_salt=1355992522.674&target=summarize(stats_counts.mystat.subname%2C%20'1day')&from=-7days
If you wanted to get back csv instead of a graph, you would just add format=json to the url. And if you're looking at the data through graphite's web interface you'd just be putting the following in to view the same graph.
summarize(stats_counts.mystat.subname, '1day')
Most of the querying of data you do will at first be in the graphite composer, which is just a web interface that lets you click on the metrics you want to add to the graph, and apply the various functions to them.
As for adding the data to Postgres, you're probably not going to want to do that to query it. The data isn't really structured in a way that's great for relational databases.

Related

ADX Data Pagination for use in client API

I am exploring using ADX as a timeseries data store for sensor metrics. Our current solution is storing data in MSSQL and I'm testing ADX as an alternative. I was able to set up data ingestion and I can perform basic queries, and with the added aggregation functions, computing insights and statistics seems to be much faster.
As part of the solution, we have a API data access layer used by clients and our web portal to query data for display and analysis use. I am currently transforming the MSSQL queries to the KQL version and I'm hitting a stumble block on data pagination.
We have a function to query historical data using a combination of:
an start/end date,
a device identifier,
and some paging options
records per page,
current page,
column sorting / additional filtering
Currently this is handled in a SQL SP on the back-end, by getting the total number of records and pages (which is set as output on the API so that the front-end can use this data in the table view), then getting the records based on the input parameters and pagination details to return a record set - quite straight forward.
Any suggestions on how to achieve effective pagination using ADX/KQL?
I found a section in the docs on pagination on stored query results, but as the queries are dynamic based on user input, so this does not sound like a viable option.
When you paginate (for example viewing result 21-30) you need to consider if you are taking a snapshot of the result and paging through it or viewing live data. If you expect new rows coming in to not affect your pagination, than stored query results is that snapshot. Once you generate it you can select specific rows from it based on your page calculation.

Reduce numbers of queries in OpenTSDB

I used opentsdb to save my time series data. Of each data point input, I must get 20 value of data points before. But, I have a large numbers of metrics, I can not call query opentsdb api too many times. How can I do to reduce numbers of query from openTSDB?
As far as I know you can't aggregate different metrics into one single result. But I would suggest two solutions:
You can put multiple metrics queries in one call. If you use HTTP
API endpoint you can do something like this:
http://otsdb:4242/api/query?start=15m-ago&m=avg:metric1{tag1=a}&m=avg:metric2{tag2=b}
You get the results for all queries with the same start(end) dates/times. But with multiple metrics don't forget that it will take longer time...
Redefine your time series.I don't know any details about your data, but if you're going to store and use data you should also think about usage - What queries am I going to use? How often? What would be the most common access to the data? And so on...
That's also what's advised from OpenTSDB documentation [1]:
Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those.
So, I would suggest to use tags to overcome this issue of multiple metrics. But as I mentioned I don't know your schema, but OpenTSDB is much more powerful with tags - there are many examples and also filtering options as well.
Edit 1:
From OpenTSDB 2.3 version there is also expression api: http://opentsdb.net/docs/build/html/api_http/query/exp.html
You should be able to handle multiple metric queries together (but I've never used that for any query).

opentsdb api query pagination

I am using OpenTSDB for my time series data.
I have a use-case on the front end in which a user can fetch data from OpenTSDB between specific dates:
http://localhost:5000/api/query?start=2014/06/04%2020:30&end=2014/09/18%2000:00&m=sum:cpu_system
My problem is that the returned data is too large. Like, thousand of records if I fetch data for an interval of more than one day. The service call than takes a couple of minutes which is giving a bad user experience on front end.
I want to apply pagination on the service call so that it will take less time.
The /api/query documentation does not have any mention of pagination. The /api/search documentation does offer pagination, but does not have any mention of time ranges.
How can I query over a time range with pagination?
There is no native pagination support in queries, but you can always emulate it by splitting your time range in multiple queries, so that, for example, you only ask for a day for each query.
Another solution that may be feasible in some cases, is to ask OpenTSDB to downsample the data. This way it will return a lot less data points, and your application will have less data to download and process.

Alternatives of Datatable

In my web application, I have a dynamic query that returns huge data to datatable, and this query is often recalled with different parameters. So database is exhausted.
I want to get all record with no parameters to an object, and perform queries (may be with linq) on this object. So database will not be exthausted.
Which objects can be used instead of datatable?
This is one of my pet peeves - people who return all the data from the database.
There is absolutely no need for this unless you are doing reporting.
If you are doing reporting, then you need to increase your hardware capability so that the database can cope. This may also include tuning your database, rearranging tables, reindexing, regular rebuilding of indexes, updating statistics, archiving out old data, etc.
If you are NOT doing reporting, then start limiting how much data can be queried at any one time. Users DO NOT need to see massive quantities of data all at once. They need to see discrete amounts of data presented in a manageable and coherent way.
Another rule of thumb i like to observe is: let your database server do the work, it is made to manipulate lots of data, it is what it is good at, and it should have the power to do it. Pulling back loads of data to the client, and then trying to manipulate that data on the client is a foolish thing to do. If your client machines are more powerful than the database server then you have issues.
Never ever perform this(except cache)!!!
You are trying to implement DB mechanisms, like
persistent storage
index search and query strategy
replication
and so on
Spend your time on db optimization(optimal scheme, indexes, query, partitioning).

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources