Expand unbalanced data to monthly panel - r

I have a data set that looks like the following that I'd like to expand to a monthly panel data set.
ID | start_date | end_date | event_type |
1 | 01/01/97 | 08/01/98 | 1 |
2 | 02/01/97 | 10/01/97 | 1 |
3 | 01/01/96 | 12/01/04 | 2 |
Some cases last longer than others. I've figured out how to expand the data to a yearly configuration by pulling out the year from each date and then using:
year <- ddply(df, c("ID"), summarize, year = seq(startyear, endyear))
followed by:
month <- ddply(year, c("ID"), summarize, month = seq(1, 12))
The problem with this approach is that it doesn't assign the correct number for the month, i.e. January = 1, and so it doesn't play well with an event data set that I would like to eventually merge it with, where I would be matching on year, ID, and month. Help would be appreciated. Here is a direct link to the data set I am trying to expand (.xls): http://db.tt/KeLRCzr9. Hopefully I've included enough information, but please let me know if there is any other information needed.

You could try something more like this:
ddply(df,.(ID),transform,dt = seq.Date(as.Date(start_date,"%m/%d/%Y"),as.Date(end_date,"%m/%d/%Y"),by = "month"))
There will probably be a lot of warnings having to do with the row names, and I can't guarantee that this will work, since the data set you link to does not match the example you provide. For starters, I'm assuming that you cleaned up the start and end dates, since they appear in various formats in the .xls file.

ddply(df, .(ID), summarize, dt = seq.Date(start_date, end_date, by = "month"))
Assuming start_date and end_date are date objects already. Joran got me close though, so again, thanks for the help on that.

Related

Datetime conversion problems

I'm currently working in
Stata on a dataset by which the year and quarter are given as 'YYYY QQ' in a string. I am trying to split this into year and quarter using the year and quarter functions. However, I keep getting a type error and have no idea why.
Those functions require a numeric argument and in any case the numeric argument should be a Stata daily date. There are various better ways forward for you. One is to use the split command with the destring option.
clear
set obs 1
gen given = "2022 3"
split given, destring
rename (given?) (year quarter)
You likely need a quarterly date any way and the function for that is quarterly().
gen wanted = quarterly(given, "YQ")
format wanted %tq
list
+----------------------------------+
| given year quarter wanted |
|----------------------------------|
1. | 2022 3 2022 3 2022q3 |
+----------------------------------+
See help datetime for basic documentation.

Pyspark GroupBy time span

I have data with a start and end date e.g.
+---+----------+------------+
| id| start| end|
+---+----------+------------+
| 1|2021-05-01| 2022-02-01|
| 2|2021-10-01| 2021-12-01|
| 3|2021-11-01| 2022-01-01|
| 4|2021-06-01| 2021-10-01|
| 5|2022-01-01| 2022-02-01|
| 6|2021-08-01| 2021-12-01|
+---+----------+------------+
I want a count for each month on how many observations were "active" in order to display that in a plot. With active I mean I want a count on how many observations have a start and end date that includes the given month. The result for the example data should look like this:
Example of a plot for the active times
I have looked into the pyspark Window function, but I don't think that can help me with my problem. So far my only idea is to specify an extra column for each month in the data and indicate whether the observation is active in that month and work from there. But I feel like there must be a much more efficient way to do this.
You can use sequence SQL. sequence will create the date range with start, end and interval and return the list.
Then, you can use explode to flatten the list and then count.
from pyspark.sql import functions as F
# Make sure your spark session is set to UTC.
# This SQL won't work well with a month interval if timezone is set to a place that has a daylight saving.
spark = (SparkSession
.builder
.config('spark.sql.session.timeZone', 'UTC')
... # other config
.getOrCreate())
df = (df.withColumn('range', F.expr('sequence(to_date(`start`), to_date(`end`), interval 1 month) as date'))
.withColumn('observation', F.explode('range')))
df = df.groupby('observation').count()

Make the value as 0, if rows not available in Kusto

My query has count function which returns the count of rows summarized by day. Now, when there are no rows from that table, I'm not getting any result, instead I need, rows with all days and count as zero. I tried with coalesc but didnt work. Any help is much appreciated!
Thanks!
Here is my query:
exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| summarize schedulerFailed = coalesce(count(),tolong("0")) by tostring(day)
Instead of summarize you need to use make-series which will fill the gaps with a default value for you.
exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| make-series count() on tolong(x) step 1
You might want to add from and to to make-series in order for it to also fill gaps at the beginning and the end of the 30d period.

Create event/date graph in R but you don't have event # calculated

How would you find the number of events that share a certain date but not time to then graph the frequency in a line graph?
For instance. my input data is below. I want to make a graph that has an X axis for DATE and Y axis for frequency. I am unsure how R would calculate that there are three date1 events, 2 date2 events, and 1 date3 event. Any support is appreciated!
DATE | TIME |
date1 | xa
date1 | xb
date1 | xc
date2 | xd
date2 | xe
date3 | xf
dplyr is your answer, do as follows, lets say your data is a data.frame called df:
library(dplyr)
by_date <- group_by(df, DATE)
by_date <- summarise(by_date, frecuency = n())
then you will be able to graph it, if you need help with the graph let me know
It would help if you provided reproducible data. However, look into count function of the plyr package.
Assuming your dataframe is called list, you can run something like following to get the frequency:
library(plyr)
frequency = count(list, vars = "DATE")
You can then use frequency to create your graph.
Another option is data.table
library(data.table)
setDT(df)[, .(frequency = .N) , by = DATE]

How to include 'Time' in Date Hierarchy in Power BI

I am working on a report in Power BI. One of the tables in my data model collects sensor data. It has the following columns:
Serial (int) i.e. 123456789
Timestamp (datetime) i.e. 12/20/2016 12:04:23 PM
Reading (decimal) i.e. 123.456
A new record is added every few minutes, with the current reading from the sensor.
Power BI automatically creates a Hierarchy for the datetime column, which includes Year, Quarter, Month and Day. So, when you add a visual to your report, you can easily drill down to each of those levels.
I would like to include the "Time" part of the data in the hierarchy, so that you can drill down one more level after "Day", and see the detailed readings during that period.
I have already set up a Date table, using the CALENDARAUTO() function, added all of the appropriate columns, and related it to my Readings table in order to summarize the data by date - which works great. But it does not include the "Time" dimension.
I have looked at the following SO questions, but they didn't help:
Time-based drilldowns in Power BI powered by Azure Data Warehouse
Creating time factors in PowerBI
I also found this article, but it was confusing:
Power BI Date & Time Dimension Toolkit
Any ideas?
Thanks!
Unfortunately, I can not comment on the previous answer, so I have to add this as separate answer:
Yes, there is a way to automatically generate Date and Time-Tables. Here's some example code I use in my reports:
let
Source = List.Dates(startDate, Duration.Days(DateTime.Date(DateTime.LocalNow()) - startDate)+1, #duration(1,0,0,0)),
convertToTable = Table.FromList(Source, Splitter.SplitByNothing(), {"Date"}, null, ExtraValues.Error),
calcDateKey = Table.AddColumn(convertToTable, "DateKey", each Date.ToText([Date], "YYYYMMDD")),
yearIndex = Table.AddColumn(calcDateKey, "Year", each Date.Year([Date])),
monthIndex = Table.AddColumn(yearIndex, "MonthIndex", each Date.Month([Date])),
weekIndex = Table.AddColumn(monthIndex, "WeekIndex", each Date.WeekOfYear([Date])),
DayOfWeekIndex = Table.AddColumn(weekIndex, "DayOfWeekIndex", each Date.DayOfWeek([Date], 1)),
DayOfMonthIndex = Table.AddColumn(DayOfWeekIndex, "DayOfMonthIndex", each Date.Day([Date])),
Weekday = Table.AddColumn(DayOfMonthIndex, "Weekday", each Date.ToText([Date], "dddd")),
setDataType = Table.TransformColumnTypes(Weekday,{{"Date", type date}, {"DateKey", type text}, {"Year", Int64.Type}, {"MonthIndex", Int64.Type}, {"WeekIndex", Int64.Type}, {"DayOfWeekIndex", Int64.Type}, {"DayOfMonthIndex", Int64.Type}, {"Weekday", type text}})
in
setDataType
Just paste it into an empty query. The code uses a parameter called startDate, so you want to make sure you have something similar in place.
And here's the snippet for a time-table:
let
Source = List.Times(#time(0,0,0) , 1440, #duration(0,0,1,0)),
convertToTable = Table.FromList(Source, Splitter.SplitByNothing(), {"DayTime"}, null, ExtraValues.Error),
createTimeKey = Table.AddColumn(convertToTable, "TimeKey", each Time.ToText([DayTime], "HHmmss")),
hourIndex = Table.AddColumn(createTimeKey, "HourIndex", each Time.Hour([DayTime])),
minuteIndex = Table.AddColumn(hourIndex, "MinuteIndex", each Time.Minute([DayTime])),
setDataType = Table.TransformColumnTypes(minuteIndex,{{"DayTime", type time}, {"TimeKey", type text}, {"HourIndex", Int64.Type}, {"MinuteIndex", Int64.Type}})
in
setDataType
If you use the DateKey and TimeKey (like suggested in the first answer) in your fact-table, you can easily generate the date/time-hierarchy by simply putting the time-element in the visualization below the date-element like this
date-time-hierarchy
You will want separate date & time tables. You don't want to put the time into the date table, because the time is repeated every day.
A Time dimension is the same principal as a Date dimension, except instead of a row for every day, you would have a row for every minute or every second (depending on how exact you want to be - I wouldn't recommend including second unless you absolutely needed it, as it greatly increases the number of rows you need - impacting performance). There would be no reference to date in the time table.
E.g.
Time | Time Text| Hour | Minute | AM/PM
---------|----------|------|--------|------
12:00 AM | 12:00 AM | 12 | 00 | AM
12:01 AM | 12:01 AM | 12 | 01 | AM
12:02 AM | 12:02 AM | 12 | 02 | AM
... | ... | ... | ... | ...
I include a time/text column since Power BI has a habit of adding a date from 1899 to time data types. You can add other columns if they'd be helpful to you too.
In your fact table, you'll want to split your datetime column into separate date & time columns, so that you can join the date to the date table & the time to the time table. The time will likely need to be converted to the nearest round minute or second so that every time in your data corresponds to a row in your time table.
It's worth keeping but hiding the original datetime field in your data in case you later want to calculate durations that span days.
In Power BI, you'd add the time attribute (or the hour (and minute) attribute) under the month/day attributes on your axis to make a column chart that can be drilled from year > quarter > month > day > hour > minute. Power BI doesn't care that the attributes come from different tables.
You can read more about time dimensions here: http://www.kimballgroup.com/2004/02/design-tip-51-latest-thinking-on-time-dimension-tables/
Hope this helps.
My approach was to create new column with given formula:
<new-column-name>=Format([<your-datetime-column>],"hh:mm:ss")
This will create a new column and now you can select it with your-datetime-column to create a drill-down effect.
I created a new custom column and set formula=[Timestamp] and change type to datetime.
#"Added Custom" = Table.AddColumn(#"Added Conditional Column16", "TestTimestamp", each [Timestamp]),
#"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"TestTimestamp", type datetime}}),

Resources