graphite for multi tenent app - graphite

I am building a system where I need to collect time series data. The system is multi tenent so various customers will emit their own data points and I will store it. What kind of support does graphite has to this kind of senario?

Graphite provides a 'relay' which is able to relay metrics to other instances. This is configurable, so what you could do is specifying the tenant-id in the metric name (like so for example: ${tenant}.metric.name), and relay to other instances based on that id, using a regular expression. This way you can provide sharding.
See: http://graphite.readthedocs.org/en/latest/carbon-daemons.html

Related

Is there an option to retrieve all unique sessions using Google's Core Reporting API?

I'm wondering if there's a possibility to fetch disaggregated data from Google, using their APIs.
Currently I'm able to already receive a quite detailed segmentation by selecting ga:source, ga:dateHourMinute, ga:country and others, but of course these are still groups of sessions.
Thanks a lot!
Not by default - there is no dimensions for sessions in the API, and not even the client id is exposed via the API.
An easy way to obtain a session marker is to store a random number in a session scoped custom dimension. Since a session scoped dimension by definition stores only the last value in the session this will give you an unique (well, not technically unique, but unique enough) value per session, which can be use in conjunction with the client id, which you'd need to store in another custom dimension.
Of course since this will give you a lot of single rows you will be running into API limits pretty soon.
In a GA360 account you could use BigQuery - the BQ export schema includes session identifiers.

Silo data for users into particular physical location

We have a system we're designing which has to hold data for people globally, including countries with very strict data protection policies, specifically where data about its citizens must physically reside that country.
Now we could engineer a mechanism for silo-ing/querying the data where it must be pulled from a particular location but as the system will be azure based, we were hoping that cosmosDB's partitioning feature might be an option.
Based on the information available to date for partitioning, it seems like it's possible to assign a location-specific partition for some data but it's not very clear. Any search for partitioning in general goes on about high availability and low latency - good things - but not what I'm looking for.
To this end, can location-specific data be assigned in cosmosDB as part of its feature set or is this something that has to be engineered on top?
For data sovereignty, you must engineer a data access layer across multiple Cosmos DB accounts. Cosmos DB by default will replicate your data across all regions within your account (which is not what you need).
While not specifically for this scenario, you can see a description of how to build such a layer here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers

Is Xively a good fit where data is simple/infrequent, and processing of that data is done externally to it?

I'm looking to design a solution with a rather large number of Arduino devices all returning a very simple data point (let's say temperature for now so as to not release too much information). The single data point is collected only once a day and sent to a central site, from which reports can be generated.
All of the devices will have some device-specific data (a location ID and device ID, in combination unique across the entire network of devices) burnt into EEPROM. The data collected is simply that device-specific data and the temperature itself (but see question 2 below). So, a very simple payload.
Initial investigations into Xively seem to mandate every device be created within Xively itself but that's going to be a serious pain given the many hundreds we expect, even in the pilot program.
And, given that each device uploads its unique ID along with the temperature, it seems to make little sense to have to configure all of them within Xively when the data itself can be easily segregated and reported on at the back end based on the device-specific data.
The following diagram should illustrate what we're looking at:
So, a few questions:
1/ Is Xively a good fit for this sort of scheme? In other words, is it worth using as just a data collector from which we can access the data at the back end and make nice reports? We have no real interest (yet) in using Xively as the interface - for now, it's enough to collect the data at the central site, generate a PDF file and mail it out.
2/ Is it acceptable in Xively to define your single device (an uber-device) as "my massive cluster of Arduino nodes" and then have each node post its data as the uber-device? They seem to just refer to "device" without actually specifying any restrictions.
3/ Given that timestamp information is important to us, can Xively inject that information into its data when the API call is made to upload the data? That may remove the need for use to provide on-board clocks for the devices.
4/ Have people with Arduino experience implemented any other schemes like this (once a day upload)? The business prefers Xively so they don't have to set up their own servers to receive the data, but there may be other options with the same result.
Here we go:
Yes, this is the exactly what Xively was designed for, a massive data broker, or as the buzzword IoT guys like to call it: a Device Cloud, the most simple and easy to use on the market today.
Not sure if there is any restriction on the number of datareams a unique feed can handle, but having thousands of datastreams per feed does not seem to me the smartest way of using Xivley. Creating individual feeds for each phisical device is the idea behind it, devices must be able to auto-register and activate a pre register feed, read the samples on Xively tutorial, this is not dificult at all, also serial numbers can be added/create in batches from text files.
Sure, you may provide timestamp information while uploading, if you DO NOT provide it, Xively will assume it is a real time feed and will add current upload time to the data.
Surely it was implemented before, it is important to notice that Xyvley does not care who, or what is providing data to a feed, you may share one key & feed number with thousands of devices and they can all upload to the same feed or even to the same datastream, however, managing data uploaded this way can become very messy for lack of granularity and fine control.

Explain Query Bands in Teradata

Can anyone explain Query Bands in Teradata?
I've searched regarding this a lot, but wasnt able to get information which I can understand.
Please be a bit detailed.
Thanks!!!
QUERY BANDING IN TERADATA:
QUERY BANDING PROVIDES CIRCUMSTANTIAL WORKFLOW INFORMATION.
Concept:
Scientists will often band the legs of birds with devices to track their flight paths. Monitoring and analyzing the data retrieved via the bands provides critical information about the species.
The same process is followed by DBAs who need some more information about a query than what is available.
Metadata—such as the name of requesting user, work unit & the application name is important, Workload management will be tracking the entire use of data warehouse & query troubleshooting.
Query banding feature is used such a way that, these metadata details are linked to the query in database.
A query band can contain any number of name or value pairs such as initiating users corporate ID, department & location, also the time of the initiation execution started.
Prashanth provided a good analogy with birds and bands. Adam is asking for specific situations. I can come up with several examples, when query banding may be very useful:
Your system is used by hundreds of users via an Application Server with a custom application or a reporting application like Business Objects, Tableau or Qlikview. Application server connects to Teradata using one user ID, however the administrator would still like to know what users, departments and groups of users generate each query to be able to analyze later in DBQL or simply to allocate proper system resources using TASM. For this the application can be configured in such a way that each query is "banded" with information like "AppUser:User1;Appgroup:DataScientists;QueryType:strategic02". Despite the fact that Application Server uses one Teradata user and a limited number of connection to route all the queries from hundreds of users, each individual query is marked with information exactly which user has initiated the query. You can then perform all kinds of analysis based on this information.
Suppose you have a complex ETL application, and you want to track and analyze your execution of loads - what and when went wrong. Usually you would need to log all the steps of your ETL process, and in the logs you must specify unique Load ID, Process ID, Step ID, etc. You do this because you want to be able to understand what specific process caused this halt or a performance degradation, and without such logging it would not be possible to distinguish running of the same steps between different runs of your ETL application. A good alternative would be to switch on DBQL and embellish your queries with Query Band information with Load ID, Process ID, Step ID, etc. In this way you would have all necessary information in DBQL without the necessity to create additional elaborate log tables.
SET QUERY BAND = 'name=value; name2=value;' FOR SESSION|TRANSACTION;
this will tag your query with some name value pairs. This can be used to manage your query's workload management for example in TDWM you have throttles and priority management hooks that will priorities all name2 types with the value "value". It means you can submit a very rich detail on the session or transaction
Yes, what you described can easily be done with QueryBanding; think of it as a "wagon of key-pair attributes in transit". you can access them via sql or prgrammatically with session attributes in bteq or jdbc for example.
Necromancing... Existing answers do a good job at explaining how query bands work, but since I could not find a complete working example, I thought of adding one here.
Setting query bands in Teradata is already covered, so I will provide an example of how to set them from a .NET client:
private void SetQueryBands()
{
TdQueryBand qb = Connection.QueryBand;
qb["CustomApplicationName"] = "MyAppName";
foreach (string key in CustomQueryBands.Keys)
{
qb[key] = CustomQueryBands[key];
}
Connection.ChangeQueryBand(qb);
}
Connection = new TdConnection(GetConnectionString());
Connection.Open();
SetQueryBands();
More details can be found here.
To retrieve stored queryband data, GetQueryBandValue function can be used:
SELECT CollectTimestamp, QueryBand,
GetQuerybandValue(queryband, 0, 'Key1') AS Value1,
GetQuerybandValue(queryband, 0, 'Key2') AS Value2,
GetQuerybandValue(queryband, 0, 'Key3') AS Value3,
FROM dbql_data.dbqlogtbl
WHERE dateofday = DATE - 1
AND queryband LIKE '%somekeyorvalue%'

custom querying in graphite

We need to collect timeseries information on multiple server and business processes and consider to use graphite. It seems good if we want to display the raw data. But what if we want to do BI on this data and run custom queries? Does graphite allow that, or alternatively can I instruct graphite to store data on postgress?
Graphite definitely allows you to query your data, both graphically and returning csv or json. The queries in graphite aren't done with a language like sql. They're done with functions that apply to one metric at a time. Each metric is it's own database, which is just a series of time, value pairs.
The most common thing you're likely to want is summarize data over different time periods. Here's an example of what the url would look like for a graph where the data is summarized daily for a week:
http://graphite.example.com/render/?width=586&height=308&_salt=1355992522.674&target=summarize(stats_counts.mystat.subname%2C%20'1day')&from=-7days
If you wanted to get back csv instead of a graph, you would just add format=json to the url. And if you're looking at the data through graphite's web interface you'd just be putting the following in to view the same graph.
summarize(stats_counts.mystat.subname, '1day')
Most of the querying of data you do will at first be in the graphite composer, which is just a web interface that lets you click on the metrics you want to add to the graph, and apply the various functions to them.
As for adding the data to Postgres, you're probably not going to want to do that to query it. The data isn't really structured in a way that's great for relational databases.

Resources