OpenTSDB indexing on keys - opentsdb

As I've worked in my personal lab instance of OpenTSDB, I've started to wonder if it is possible to get it to index on tags as well as metric names. My understanding (correction is welcome...) is that OpenTSDB indexes only on metric names. So, suppose I have something like the following, borrowed from the docs:
tsd.hbase.rpcs{type=*,host=tsd1}
My understanding is that tsd.hbase.rpcs is indexed for searching, but that the keys (type=, host=, etc) are not. Is that correct? If so, is there a way to have them be indexed, or some reasonable approximation of it? Thanks.

Yes you are correct, according to the documentation, OpenTSDB creates keys in the 'tsdb' HBase table of the form
[salt]<metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
When you do a query with specific tagk and tagv OpenTSDB can construct the key and look it up. If you have a range of tagk and tagv it will look up all the rows and either aggregate them or return multiple time series, depending on your query.
If you are interested in asking questions about tagks, you should use the OpenTSDB search/lookup api, however this still requires a metric name.
If you want to formulate your question around tagks only, you could consider forwarding your data to Bosun for indexing and using its API
/api/metric/{tagk}/{tagv}
Returns the metrics that are available for the specified tagk/tagv pair. For example, you can see what metrics are available for host=server01

Related

Read all data from a table of 10k rows in a single request

Could there be problems in reading all the data of a 10k rows table in a single request?
It would be a read only request.
I would like to do it because I want to perform some queries on the array, and from the documentation I can’t find a way to do it directly with Pact.
No there shouldn't be. Read only queries are "free" atm.
You can do it in two ways
Do a select query which will always evaluate true
Get all the keys (i.e. unique ids in the table) via (keys your-table-name) and then have a separate method which returns data for a list of ids.
But do consider using select statements to help filter out your data during the query as this could be easier than you doing it yourself.
Pact will check arrays like any other property, but you should ask yourself the question - do you need to test all 10k records or just a representative sample of them (the answer should in most cases be the latter).
You should also consider:
Do you need to exact match? (if so, the consumer and provider must have exactly the same data - not recommended)
Can you use matchers to check the shape of the items in the array

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

Reduce numbers of queries in OpenTSDB

I used opentsdb to save my time series data. Of each data point input, I must get 20 value of data points before. But, I have a large numbers of metrics, I can not call query opentsdb api too many times. How can I do to reduce numbers of query from openTSDB?
As far as I know you can't aggregate different metrics into one single result. But I would suggest two solutions:
You can put multiple metrics queries in one call. If you use HTTP
API endpoint you can do something like this:
http://otsdb:4242/api/query?start=15m-ago&m=avg:metric1{tag1=a}&m=avg:metric2{tag2=b}
You get the results for all queries with the same start(end) dates/times. But with multiple metrics don't forget that it will take longer time...
Redefine your time series.I don't know any details about your data, but if you're going to store and use data you should also think about usage - What queries am I going to use? How often? What would be the most common access to the data? And so on...
That's also what's advised from OpenTSDB documentation [1]:
Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those.
So, I would suggest to use tags to overcome this issue of multiple metrics. But as I mentioned I don't know your schema, but OpenTSDB is much more powerful with tags - there are many examples and also filtering options as well.
Edit 1:
From OpenTSDB 2.3 version there is also expression api: http://opentsdb.net/docs/build/html/api_http/query/exp.html
You should be able to handle multiple metric queries together (but I've never used that for any query).

Firestore order by two fields

In firestore I'm wondering if there is a way to have a hueristic1 and get all data between two hueristic1 values but order the results based on a hueristic2.
I ask because the data at bottom of both pages
https://firebase.google.com/docs/firestore/query-data/order-limit-data
https://firebase.google.com/docs/firestore/query-data/query-cursors
there seems to be slightly contradictory documentation.
What I want to be able to do is
Ref.startAt(some h1 value).endAt(some second h1 value).orderBy(h2).
I know I'd probably have to index by h1 but even then I'm not sure if there is a way to do this.
Update:
I didn't test this well enough to see that is doesn't produce the desired ordering. The OP asked the question again and got an answer from a Firebase team member:
Because Cloud Firestore doesn't support ordering by a different field
than the supplied inequality, you won't be able to sort by name
directly from the query. Instead you'd need to sort client-side once
you've fetched the data.
The API supports the capability you want, although I don't see an example in the documentation that shows it.
The ordering of the query terms is important. Suppose you have a collection of cities and the fields of interest are population (h1) and name (h2). To get the cities with population in range 1000 to 2000, ordered by name, the query would be:
citiesRef.orderBy("population").orderBy("name").startAt(1000).endAt(2000)
This query requires a composite index, which you can create manually in the console. Or as the documentation there indicates, the system will help you:
Instead of defining a composite index manually, run your query in your
app code to get a link for generating the required index.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Resources