Multi-series query in influxdb - graphite

I am quite new to influxdb. I am submitting collectd metrics to it using the graphite writer
For disk statistics, I'd like to do a percentage transformation, and show that on a grafana dashboard.
disk.used / (disk.used + disk.free)
I was fiddling around with something like (which is obviously not working):
Select first(disk.used / (disk.used + disk.free)) from "server.test_server.disk.used" as used left join "server.test_server.disk.free" as free where ....
What is the query that I can use? Is it possible to do such transformations with influxdb? This is soo easy with graphite :(
Update: Using grafana 2.1 and influxdb 0.9

Your schema is not optimal for InfluxDB 0.9. Rather than a few measurements with many child series each, you have many measurements with only a few series each.
There are no JOINs in InfluxDB 0.9, but all series under the same measurement are automatically merged unless filtered by tags. See https://docs.influxdata.com/influxdb/v0.9/query_language/data_exploration/ and https://docs.influxdata.com/influxdb/v0.9/concepts/08_vs_09/#joins for more.
To make the best use of InfluxDB 0.9 you should encode your metadata "test_server" in tags, and your metrics as individual fields. See also Obtaining a total of two series of data from InfluxDB in Grafana

Yes it's possible, but can you describe the whole schema and the resulting data that you would like to have? Do a:
select * from "server.test_server.disk"
and give us the output. Or do you have it in two separate tables named server.test_server.disk.free and server.test_server.disk.used? In that case, a preferred way would be to use this kind of a table server.test_server.disk which would contain both metrics (both "used" and "free"). And then you would be able to do something like this:
select first(used / (used + free)) from "server.test_server.disk"

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

Which DB to use for comparing courses of data by days?

I'm currently thinking about a little "BigData" Project where I want to record some utilizations every 10 minutes and write them to a DB over several month or years.
I then want to analyze the data e.g. in these ways:
Which time of the day is best (in terms of a low utilization)?
What are the differences in utilization between normal weekdays and days on the weekend?
At what time does the higher part of the utilization begin on a normal monday?
For this I obviously need the possibility to build averaged graphs for e.g. all mondays that where recorded so far.
For the first "proof of concept" I set up a InfluxDB and Grafana which works quite fine for seeing the data being written to the DB, but the more I research on the internet the more I see that InfluxDB is not made for what I want to do (or it can not do it yet).
So which Database would be best to record and analyze data like that? Or is it more like a question about which tool to use to analyze the data? Which tool could that be?
InfluxDB query language is not flexible enough for your kind of questions.
SQL databases supported by Grafana (MySQL, Postgres, TimescaleDB, Clickhouse) seem to fit better.The choice depends on your preferences and amount of your data. For smaller datasets pure MySQL & Postgres may be enough. For higher loads consider TimescaleDB. For billions of datapoints Clickhouse is a probably better.
If you want a lightweight but scalable NoSQL timeseries solution have a look at VictoriaMetrics.

Reduce numbers of queries in OpenTSDB

I used opentsdb to save my time series data. Of each data point input, I must get 20 value of data points before. But, I have a large numbers of metrics, I can not call query opentsdb api too many times. How can I do to reduce numbers of query from openTSDB?
As far as I know you can't aggregate different metrics into one single result. But I would suggest two solutions:
You can put multiple metrics queries in one call. If you use HTTP
API endpoint you can do something like this:
http://otsdb:4242/api/query?start=15m-ago&m=avg:metric1{tag1=a}&m=avg:metric2{tag2=b}
You get the results for all queries with the same start(end) dates/times. But with multiple metrics don't forget that it will take longer time...
Redefine your time series.I don't know any details about your data, but if you're going to store and use data you should also think about usage - What queries am I going to use? How often? What would be the most common access to the data? And so on...
That's also what's advised from OpenTSDB documentation [1]:
Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those.
So, I would suggest to use tags to overcome this issue of multiple metrics. But as I mentioned I don't know your schema, but OpenTSDB is much more powerful with tags - there are many examples and also filtering options as well.
Edit 1:
From OpenTSDB 2.3 version there is also expression api: http://opentsdb.net/docs/build/html/api_http/query/exp.html
You should be able to handle multiple metric queries together (but I've never used that for any query).

custom querying in graphite

We need to collect timeseries information on multiple server and business processes and consider to use graphite. It seems good if we want to display the raw data. But what if we want to do BI on this data and run custom queries? Does graphite allow that, or alternatively can I instruct graphite to store data on postgress?
Graphite definitely allows you to query your data, both graphically and returning csv or json. The queries in graphite aren't done with a language like sql. They're done with functions that apply to one metric at a time. Each metric is it's own database, which is just a series of time, value pairs.
The most common thing you're likely to want is summarize data over different time periods. Here's an example of what the url would look like for a graph where the data is summarized daily for a week:
http://graphite.example.com/render/?width=586&height=308&_salt=1355992522.674&target=summarize(stats_counts.mystat.subname%2C%20'1day')&from=-7days
If you wanted to get back csv instead of a graph, you would just add format=json to the url. And if you're looking at the data through graphite's web interface you'd just be putting the following in to view the same graph.
summarize(stats_counts.mystat.subname, '1day')
Most of the querying of data you do will at first be in the graphite composer, which is just a web interface that lets you click on the metrics you want to add to the graph, and apply the various functions to them.
As for adding the data to Postgres, you're probably not going to want to do that to query it. The data isn't really structured in a way that's great for relational databases.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Resources