Let's understand my usecase.
I want to store some point data along with time-series data in Graphite. I have some metric say user.12345.lastVisitedTimeInMs and I want to update it each time the user visited our site.
So, this information is not a time-series data but a point data.
Is it possible in Graphite to update a metric's value instead of putting another value with a new timestamp?
From the design of graphite and it's fixed-size pre-allocated database whisper I would assume that you can update previously submitted data points of a metric for a given timestamp but I only used this flexibility to submit historical data after the fact for given timestamps.
Yes, you can definitely update a value for an existing timestamp/measurement.
Related
I am adding a filter in Tableau to switch between timezones when looking at the report.
Currently, I have a date time field in MT and I want to have a filter where you can go back and forth between MT and CT.
What would be the best way to accomplish this?
Would it be better practice to add a new field to my data source for the central timezone or to handle the conversion logic in Tableau?
This can be achieved in tableau within 2 steps:
Create a parameter with the strings of time zones that you want to display
Create a calculated field that is connected to this parameter and has a switch case to add or subtract time from your original timestamp.
If switching between timezones is done frequently then it is recommended to insert a column in your database itself. Depends on your data size and dashboard performance.
I am wondering if anyone knows a good way to store time series data of different time resolutions in DynamoDB.
For example, I have devices that send data to DynamoDB every 30 seconds. The individual readings are stored in a Table with the unique device ID as the Hash Key and a timestamp as the Range Key.
I want to aggregate this data over various time steps (30 mins, 1 hr, 1 day etc.) using a lambda and store the aggregates in DynamoDB as well. I then want to be able to grab any resolution data for any particular range of time, 48 30 minute aggregates for the last 24hrs for instance, or each daily aggregate for this month last year.
I am unsure if each new resolution should have its own tables, data_30min, data_1hr etc or if a better approach would be something like making a composite Hash Key by combining the resolution with the Device ID and storing all aggregate data in a single table.
For instance if the device ID is abc123 all 30 minute data could be stored with the Hash Key abc123_30m and the 1hr data could be stored with the HK abc123_1h and each would still use a timestamp as the range key.
What are some pros and cons to each of these approaches and is there a solution I am not thinking of which would be useful in this situation?
Thanks in advance.
I'm not sure if you've seen this page from the tech docs regarding Best Practices for storing time series data in DynamoDB. It talks about splitting your data into time periods such that you only have one "hot" table where you're writing and many "cold" tables that you only read from.
Regarding the primary/sort key selection, you should probably use a coarse timestamp value as the primary key and the actual timestamp as a sort key. Otherwise, if your periods are coarse enough, or each device only produces a relatively small amount of data then your idea of using the device id as the hash key could work as well.
Generating pre-aggregates and storing in DynamoDb would certainly work though you should definitely consider having separate tables for the different granularities you want to support. Beware of mutating data. As long as all your data arrives in order and you don't need to recompute old data, then storing pre-aggregated time series is fine but if data can mutate, or if you have to account for out-of order/late arriving data then things get complicated.
You may also consider a relational database for the "hot" data (ie. last 7 days, or whatever period makes sense) and then, running a batch process to pre-aggregate and move the data into cold, read-only DynamoDB tables, with DAX etc.
We need to collect timeseries information on multiple server and business processes and consider to use graphite. It seems good if we want to display the raw data. But what if we want to do BI on this data and run custom queries? Does graphite allow that, or alternatively can I instruct graphite to store data on postgress?
Graphite definitely allows you to query your data, both graphically and returning csv or json. The queries in graphite aren't done with a language like sql. They're done with functions that apply to one metric at a time. Each metric is it's own database, which is just a series of time, value pairs.
The most common thing you're likely to want is summarize data over different time periods. Here's an example of what the url would look like for a graph where the data is summarized daily for a week:
http://graphite.example.com/render/?width=586&height=308&_salt=1355992522.674&target=summarize(stats_counts.mystat.subname%2C%20'1day')&from=-7days
If you wanted to get back csv instead of a graph, you would just add format=json to the url. And if you're looking at the data through graphite's web interface you'd just be putting the following in to view the same graph.
summarize(stats_counts.mystat.subname, '1day')
Most of the querying of data you do will at first be in the graphite composer, which is just a web interface that lets you click on the metrics you want to add to the graph, and apply the various functions to them.
As for adding the data to Postgres, you're probably not going to want to do that to query it. The data isn't really structured in a way that's great for relational databases.
I have an asp.net/vb file that receives data and processes it via a stored procedure. The code had the width set to 2 for the year's varchar, so it was chopped, leaving only the first two digits to get inserted into the db.
Is this info possibly retrievable from a system/IIS log file or is it lost forever?
thanks!
That data is lost forever.
Do you have data that isn't corrupted? Are the records in the database sequential or do they have automatically incrementing fields. Do you have timestamps on the records? Do the years correspond to the date when the record was inserted/updated? Depending on your answers to these you may be able to reconstruct the data. In particular using timestamps and/or autoincrement fields may give you the ability to determine a particular ordering between records. If the date field is related to this ordering you may be able to infer the year from the data in other records. It's very unlikely that any log files would be of any use.
Only if the year was part of a querystring or URL...which is unlikely, at best. If your IIS admin happened to turn on logging of POST fields, then you may be able to retrieve it from there. Very few sites that I know of, though, ever log POST data.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.