I have those 4 tables:
I am sorry for the german - englisch naming. My predecessor started german and I am trying to shift it into english. You can just call them:
'Place',
'Streets',
'Usage' and
DateTime'
'DateTime' is actually just a calendar with dates from 2014 to 2025 and 96 times for each date (every time in 15 minute steps)
'Usage' gives me the IDs of charging stations which get used ('Usage'[LPNumber]) and the start- and endtime of the usage.
With this Code:
TimesOverlapping =
CALCULATE (
COUNTROWS ( 'Usage' );
FILTER (
'Usage';
'Usage'[ConnectionStart Time] < ( DateTime[Time] + TIME ( 0; 14; 59 ) )
&& 'Usage'[ConnectionEnd Time] > DateTime[Time]
&& 'Usage'[ConnectionStart Day] = DateTime[Date]
)
)
I am counting the number of charging stations (in 'Usage') which get used, for example on 01.01.2014 from 00:00 to 00:15 (I always sum up 15 minutes) and display the number as custom column in 'DateTime' in the row which says 01.01.2014 00:00. This works totally fine.
Now my problem:
I can use a date slicer because I have all the dates in 'DateTime' but now I also need to apply a place- and a street slicer to filter the values in 'DateTime' but I do not have any direct connections (as you can see) and it is not possible to create any or to rewrite the tables because they have a lot of impact on other tables.
My thoughts: I already have a code which gives values from 'Usage' to 'DateTime' (see above) and we can filter 'Usage' by place and street. Wouldn't it be possible to only give the filtered values to 'DateTime' or does any other workaround exist?
I found a similar problem in the power bi community but I couldn't get use of it and I don't know if I am allowed to post liks. The headline was "Use slicer to filter countrows-table".
I appreciate your help a lot! Thank you very much.
Related
I have a simple data set with training sessions for some athletes. Let's say I want to visualize how many training sessions are done as an average of the number of athletes, either in total or divided by the clubs that exist. I hope the data set is somewhat self-describing.
To norm the number of activities by the number of athletes I use two measures:
TotalSessions = COUNTA(Tab_Sessions[Session key])
AvgAthlete = AVERAGEX(VALUES(Tab_Sessions[Athlete]),[TotalSessions])
I give AvgAthlete as the desired value in both visuals shown below. If I make a filter on the clubs the values are as expected, but with no filter applied I get some strange values
What I guess happens is that since Athlete B doesn't do any strength, Athlete B is not included in the norming factor for strength. Is there a DAX function that can solve this?
If I didn't have the training sessions as a hierarchy (Type-Intensity), it would be pretty straightforward to do some kind of workaround with a calculated column, but it won't work with hierarchical categories. The expected results calculated in excel are shown below:
Data set as csv:
Session key;Club;Athlete;Type;Intensity
001;Fast runners;A;Cardio;High
002;Fast runners;A;Strength;Low
003;Fast runners;B;Cardio;Low
004;Fast runners;B;Cardio;High
005;Fast runners;B;Cardio;High
006;Brutal boxers;C;Cardio;High
007;Brutal boxers;C;Strength;High
If you specifically want to aggregate this across whatever choice you have made in your Club selection, then you simply write out a simple measure that does that:
AvgAthlete =
VAR _athletes =
CALCULATE (
DISTINCTCOUNT ( 'Table'[Athlete] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Club] )
)
RETURN
DIVIDE (
[Sessions] ,
_athletes
)
Here we use a distinct count of values in the Athlete column, with all filters removed apart from on the Club column. This is, as far as I interpret your question, the denominator you are after.
Divide the total number of sessions on this number of athletes. Here is the result:
I have a dataset (a view) that has a numeric field "WR_EST_MHs". If that field exceeds a certain number of man hours (120 or 60, depending on 2 other fields' values), I need to split it out into constiuent records and spread those hours over future weeks.
The OH_UG_Key and 1kMCM_Flag fields determine the threshold for splitting. For example, if the OH_UG = 1 AND 1kMCM_Flag = 'N' and the WR_EST_MHs > 120, then spread the WR_EST_MHs value over as many records as is necessary, in 120 MH increments, changing only the WRSchedDate and WRSchedDate_Key fields (advancing each by one week).
Each OH_UG / 1kMCM_Flag / WR_EST_MHs scenario is as follows:
This is an example of what I need to do:
I thought that something like this might work, but I haven't worked with levels before:
with cte as
2 (Select * from "STJOF"."vfactScheduledWAWork"
5 )
6 select WR_Key, WP_Key, WRShedDate, DistSA_Key_Hash, CrewHQ_Key_Hash, Priority_Key_Hash, JobType_Key_Hash, WRStatus_Key_Hash, PerfBy_Key, OHUG_Key, 1kMCM_Flag, WR_EST_MHs
7 from cte cross join table(cast(multiset(select level from dual
8 connect by level >= WR_EST_MHs / 120
9 ) as sys.odcinumberlist))
10 order by WR_Key;
I also thought this could be done with a "tally table" which I have a little experience with. I really don't know where to begin on this one.
So I would say that a "Tally Table" will work if it is applied correctly. (Or, in this case, a tally view.)
First, break the logic for the hour breakout into a function so we don't have case when everywhere like so:
CREATE OR REPLACE FUNCTION get_hour_breakout(in_ohug_key IN NUMBER, in_1kmcm_flag in varchar2, in_tot_hours in number)
RETURN number
IS hours number;
BEGIN
hours:=
case when in_ohug_key=2 and in_1kmcm_flag='N' and in_tot_hours>60 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>60 and in_tot_hours<=120 then 60 else
case when in_ohug_key=2 and in_1kmcm_flag='Y' and in_tot_hours>120 then 120 else
120
end
end
end;
RETURN(hours);
END get_hour_breakout;
This way, if the hour breakout logic changes, it can be tweaked in one place.
Second, join to a dynamic "tally" view like so:
select wr_key,
WP_Key,
wrscheddate+idxkey.nnn*7 wrscheddate,
to_char(wrscheddate+idxkey.nnn*7,'yyyymmdd') WRSchedDate_Key,
OHUG_Key,
kMCM_Flag,
case when (wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs))>=get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) then get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) else wr_est_mhs-idxkey.nnn*get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) end wr_est_mhs
from yourView inner join (SELECT ROWNUM-1 nnn
FROM ( SELECT 1 just_a_column
FROM dual
CONNECT BY LEVEL <= 52
)
) idxkey on vwrk.wr_est_mhs/get_hour_breakout(ohug_key, kmcm_flag, wr_est_mhs) > idxkey.nnn
By using the connect by level we, in effect, generate a bunch of zero indexed rows, then by joining to it with the hours divided by the breakout greater than the feed number we get a few rows for each group.
For example, if the function returns 120 and the hours are 100 you get a single row, so it stays 1 to 1. If the function returns 120 and the hours are 500, however, you get 5 rows because 500/120=4.1666666…, which in the join gives rows 4,3,2,1,0. Then the rest is simple math to determine the number of hours per breakout.
This could also be improved by moving the function call into the lower view so it is only used once per row. And the inline tally view could be made into it's own view, depends on the maintainability you need to build into it.
Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (with any property in actual fact).
In quest of solution for this problem i'm looking for BigQuery, in reason of Firebase Analytics provide usable way to export data to this service.
But i stuck with many questions and google has no answer or example which may point me to the right direction.
General questions:
As first step i need to aggregate data which represent same data firebase cohorts do, so i can be sure my calculation is right:
Next step should be just apply constrains to the queries, so they match custom user properties.
Here what i get so far:
The main problem – big difference in users calculations. Sometimes it is about 100 users, but sometimes close to 1000.
This is approach i use:
# 1
# Count users with `user_dim.first_open_timestamp_micros`
# in specified period (w0 – week 1)
# this is the way firebase group users to cohorts
# (who started app on the same day or during the same week)
# https://support.google.com/firebase/answer/6317510
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-20'),
TIMESTAMP('2016-11-26')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 2
# For each next period count events with
# same first_open_timestamp
# Here is example for one of the weeks.
# week 0 is Nov20-Nov26, week 1 is Nov27-Dec03
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-27'),
TIMESTAMP('2016-12-03')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 3
# Now we have users for each week w1, w2, ... w5
# Calculate retention for each of them
# retention week 1 = w1 / w0 * 100 = 25.72181359
# rw2 = w2 / w1 * 100
# ...
# rw5 = w5 / w1 * 100
# 4
# Shift week 0 by one and repeat from step 1
BigQuery queries tips request
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
Here is BigQuery Export schema if needed
Side questions:
why all the user_dim.device_info.device_id and user_dim.device_info.resettable_device_idis null?
user_dim.app_info.app_id is missing from the doc (if firebase support teammate will be read this question)
how event_dim.timestamp_micros and event_dim.previous_timestamp_micros should be used, i can not get their purpose.
PS
It will be good someone from Firebase teammate answer this question. Five month ago there are was one mention about extending cohorts functionality with filtering or show bigqueries examples, but things are not moving. Firebase Analytics is way to go they said, Google Analytics is deprecated, they said.
Now i spend second day to lean bigquery and build my own solution over the existing analytics tools. I no, stack overflow is not the place for this comments, but guys are you thinking? Split testing may grammatically affect retention of my app. My app does not sold anything, funnels and events is not valuable metrics in many cases.
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
yes, generic bigquery will work fine
Below is not the most generic version, but can give you an idea
In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets
First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
What it does is:
a. Defines period you want to set for analysis.
In example below - it is a month - FORMAT_DATE('%Y-%m', ...
But you can use year, week, day or anything else – respectively
• By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b. Also it “filters” only the type of events/activity you need to analyse
for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question
The rest of sub-queries are more-less generic and mostly can be used as is
#standardSQL
WITH activities AS (
SELECT answers.owner_user_id AS id,
FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
ON questions.id = answers.parent_id
WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
GROUP BY id, period
), cohorts AS (
SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids
FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
JOIN periods ON periods.period = cohorts.cohort
GROUP BY cohort, num
), retention AS (
SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
FROM periods JOIN activities ON activities.period = periods.period
JOIN cohorts ON cohorts.id = activities.id
GROUP BY cohort, period, num
)
SELECT
CONCAT(cohorts_size.cohort, ' - ', FORMAT("%'d", cohorts_size.ids), ' users') AS cohort,
retention.num - cohorts_size.num AS period_lag,
retention.period as period_label,
ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label
You can visualize result of above query with the tool of your choice
Note: you can use either period_lag or period_label
See the difference of their use in below examples
with period_lag
with period_label
I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.
I'm building a report using Report Builder. It uses Report Application Pascal, which is based on Delphi Object Pascal. I'm still learning this and struggling with a variable value.
I have a variable called 'duration' which contains the following script:
value := round(ReportWizardQuery['wodFinishDate'] - ReportWizardQuery
['wodCreateDate']);
This gives me the result I want. It calculates the total number of days between the two dates.
What I'm trying to do then is to use the value of this 'duration' variable to find out if jobs (which are defined by the start and end date) have been completed on the same day, within 1-5 days, 6-10 days, etc.
I've created columns with these headings and placed a varible in each column in the detail band of the report. The code I have written in the variable for 'same-day' is:
if (duration = 0) then
value := 1;
Likewise for jobs being completed between 1 - 5 days
if (duration > 0 and < 6) then
value := 1;
But variables are blank when report run. I have tried to assign the value of the 'duration' variable to that of the same-day variable and it returns a weird number which is the the same for each line in the report (99468080, or 10150660...etc) This number changes each time I run the report and always seems to be 8 digits long.
Does anybody have any idea what I'm doing wrong and how I can assign the value 1 for each variable if the duration variable = 0, or between 1 - 5, etc.
Thanks.