How to decide whether to store data under 1 column or multiple columns? - plot

I am recording the data for 3 counters and have the choice of using either of following schema:
Date|Sensor|Value
Date|Sensor1Value|Sensor2Value|Sensor3Value
When visualizing using either of the above schemas, x-axis will be the date. In case of 1st schema the sensor will be the legend and value will be the y-axis.
Whereas in case of the 2nd schema, each column will need to be added as y-axis, and there will be no legend.
What amongst the above 2 schemas are better suited for reporting (plotting graphs)?

The best answer will depend on 3 things:
the type of visualizations you're trying to build
which visualization(s) tool you're planning to use and
if you plan to add more sensor values in the future
Essentially, you're either going to pivot your data when storing it (second schema model with 1 column for each value) or you're going to store the data and rely on the visualization or your database query to perform the pivot logic.
In my experience in working with BI and analytics tools, it's almost always better to store data using the first Model (Date | Sensor | Value). This will provide the most flexibility when it comes to visualization tools and also if you need to add future sensor values, you won't need to modify your database table structure. If you need to convert your data into the second model, you can always build a View or temp table that uses a dynamic pivot query.

Related

custom partition in clickhouse

I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499

Tableau---Getting count from 2 different data sources and combining into one total

I am a tableau newbie and am trying to see if this is possible or not. I have 2 separate data sources where the same employees are listed, one is for closed cases and the other is for open cases. These data sources have some of the same columns, but for the most part they are different.
Is it possible to aggregate the case count for each employee on the closed and open data sources into a single column? For instance, if an employee has 50 closed cases and 23 open cases, I want it to show 73 for them.
I attempted to play around with the joins/unions but these didn't work properly and duplicated the data most times.
I think this is a great chance to leverage blends.
I have created a workbook with the Sample Superstore Excel dataset. This dataset has three sheets. I'll use the Orders and Returns sheets to demonstrate how we can calculate the net orders using blends.
The dataset I'm using can be found here.
Start by connecting to both the Orders and Returns separately. Once done with this step you should see the two data sources at the top of your data pane.
In this example, I'll calculate the Net Returns by Category. In your case, you're after the Total Cases by Employee, so just imagine Employee in place of Category.
Next, drag Category from the Orders data source onto the view, then select the Orders data source and click the chain icon to blend on Order ID.
You will need a common column between the two tables in order to blend.
Once blended I'll go back to the primary data source (indicated by the blue check mark) and create the Net Orders calculation.
This calculation uses the dot notation - similar to what you might see in SQL - to reference our other table.
To double check that our calculation is working properly, we can drag the components of this calculation onto the view and do the math.
Of course, once you are satisfied you can remove all but your blended calculation.
Blending isn't ideal in most cases but you could try it. Bring in each data source separately and "join" them within your workbook pane on Employee or hopefully an Employee_id. Click the little chain once you have them both loaded and you are on a worksheet tab. Then you could sum the counts by employee. Blending sometimes presents some issues with calculated fields across the two data sources but this is what I would try first.

Adding new fields to historical tables in BigQuery

I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.
Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.
This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).
This issue is if stakeholders then ask - can we add that new column to the historical data?
Currently i then need to rerun all the daily jobs which is very slow and costly.
This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.
I'm just wondering if there is better ways to
Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
(Less common) Update existing tables based on some where condition.
Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.
To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.
It's a little bit messy though to try and pick apart.
Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?

R Models with Factors in Tableau

I'm attempting to build a model for sales in R that is then integrated into Tableau so I can look at the predictions as they relate to the actual values. The model I'm building for sales is in R, and I'm trying to integrate it into Tableau by creating a calculated field that uses the model to give the predicted value for each record using the SCRIPT_REAL function in Tableau. The records are all coming from a MySQL database connection. The issue that I'm having comes from using factors in my model (for example, month).
If I want to group all of the predictions by day of week, Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model. When it tries to aggregate month, not all of the values are the same, so it instead returns a "". Obviously a prediction value then can't be reached because there is no value associated with a "". Essentially what I'm trying to do is get a prediction value for each record that I have, and then aggregate those prediction values in various ways.
Okay, now I can understand a little bit better what you're talking about. A twbx with dummy data (and even dummy model, but that generates the same problem you're facing) would help even more, but let me try to say a couple of things
One thing that is important to understand is that SCRIPT functions are like table calculations, i.e., they are performed only with aggregated fields, they are computed last (after all aggregations, measures and regular calculations) and you can define the level of aggregation you want.
So, if you want to display values on a daily basis, put your date field on page, go to the day level, and for the calculation partition by DAY(date_field). If you want by week, same thing.
I find table calculations (including R scripts) very useful when they are an end, i.e. the calculation is the answer. It's not so useful (or better, not so easily manipulable) when it's an end, like an intermediate step before a final calculation to get to the answer. That is mainly because the level of aggregation is based on the fields that are on page. So, for instance, if I have multiple orders from different clients, and want to assess what's the average order value by customer, table calculation is great, WINDOW_AVG(SUM(order_value)) partitioned by customer. If, for some reason, I want to sum all this values, then it's tricky. I can't do it directly, as the avg order value is not stored anywhere, and cannot be retrieved without all the clients being on page. So what I usually do is to create the table with all customers, export it to mdb, and reconnect in Tableau.
I said all this because it might be your problem, when you say "Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model". Yes, Tableau does that and there's nothing you can do about it, but figure out a way around it. Creating an intermediate table in Tableau, exporting it, and connecting to it again in Tableau might be an answer. Performing the calculations in R, exporting it and then connecting to Tableau might be another way.
But again, without actually seeing what you're trying to do, it's hard to say what you need to do

Array calculation in Tableau, maxif routine

I'm fairly new to Tableau, and I'm struggling in building some routines that could be easily implemented in Excel (though it would take forever for big sets of data).
So here is the deal, consider a dataset with the following fields:
int [id_order] -> id of the sales order (deepest level, there are only unique entries of id_order)
int [id_client] -> as I want to know who bought it
date [purchase_date] -> when the customer bought the product
What I want to know is, for each order, when was the last time (if ever) the client has bought something. In order words, what is the highest purchase_date for that user that is smaller than current purchase_date.
In excel, approach is simple (but again, not efficient)
{=max(if(id_client=B1,if(purchase_order
Is there a way to do this kind of calculation in Tableau?
You can do this in Tableau using table calculations. They take a little time to understand how to use well, but are very powerful and flexible. I posted a sample Tableau workbook for a similar question in an answer for SO question Find first time a condition is met
Your situation is similar, but with the extra complication that you want to repeat the analysis for each client id, so you might want to try a recursive approach using the Previous_Value() function instead of the approach used in that example - though I'm not certain that previous_value() will fit your situation.
Still, it might be helpful to download the example workbook I mentioned to get an idea how table calculations can address similar problems.
Just to register the solution, in case someone has the same doubt.
So, basically the solution I found use table calculation, which is not calculated until it's called on a sheet (and is only calculated on the context of the sheet). That's a little bit limiting, so what I do is create a sheet with all the fields I need (+ what is necessary for the table calculation) then export the data (to mdb) and connect to this new file.
So, for my example, the right table calculation is (let's name it last_order_date):
LOOKUP(MAX([purchase_date]),-1)
Explanations. The MAX() is necessary because Lookup (and all table calculations) does not work with data directly, only with aggregations. You can use sum, avg, max, attr, whatever suits you. As in my case there will be only 1 correspondence, any function will do just fine and return the same value.
The -1 indicates that I'm looking for the element immediately before the current entry (of the table, as you define it). If it were FIRST(), it would go for the first entry of the table, and LAST() would go for the last.
Now, I have to put it on a sheet. So I'll bring the fields id_client, id_order, purchase_date and last_order_date.
Then I have to define the parameters of my table calculation last_order_date (Edit Table Calculation). I'll go to Compute using and choose advanced. Now I'll do Partitioning: id_client, and addressing all the rest. What will that do? This mean Tableau will create temporary tables for each id_client, and table calculations will use those tables as parameter.
Additionally, I will Sort by field purchase_date, Max (again the aggregation issue) and ascending, to guarantee my entries are in chronological order.
Now, what will it do? For each entry it will access the table of the id_client, and check what was the purchase_date that is immediately before the current entry (that is being assessed), exactly what I need.
To avoid spending Tableau processing in Visualization, I often put all the fields in details (and leave nothing on screen), use Bar chart (it's good because it allows me to see the data). Then I export it to mdb, then connect to it again. Unfortunately Tableau doesn't directly export to tde.

Resources