Upsample data with mean - datetime

I am trying to upsample my datetime data and fill in the gap with a mean rather than forward or backward fill.
Sample df-
TIME VALUE
01:00 4
02:00 8
03:00 2
desired output-
TIME VALUE
01:00 4
01:30 6
02:00 8
02:30 5
03:00 2
Currently I did a straightforward resample('30min') and want fill the NaN values
TIME VALUE
01:00 4
01:30 NaN
02:00 8
02:30 NaN
03:00 2
With the mean rather than backward or forward fill.

Figured one way to solve the problem.
df2=df.resample('30min',on='Time')
final=df2.interpolate(method='linear')
But I would keen to look at other ways to this apart from interpolation!

Related

Power BI - Calculating Sum Between 2 Times with DAX

I have half-hourly consumption data and I need to calculate the sum of consumption that takes place between 2:30 AM and 5:00 AM.
I have achieved this in Excel with a SUMIF statement. How do I do this with DAX, though?
Assuming you have a table with columns similar to those in the sample table below, I've included the DAX to calculate the sum of consumption between the times given. This also assumes that you want to calculate this sum for ALL days between 2:30 AM and 5:00 AM.
Sample Table (Table)
id
consumption
timestamp
1
4
2022-05-28 02:00
2
4
2022-05-28 02:30
3
5
2022-05-28 03:00
4
5
2022-05-28 03:30
5
6
2022-05-28 04:00
6
6
2022-05-28 04:30
7
5
2022-05-28 05:00
8
5
2022-05-28 05:30
Solution Measure
Consumption Sum =
CALCULATE(
SUM('Table'[consumption]),
TIME(
HOUR('Table'[timestamp]),
MINUTE('Table'[timestamp]),
SECOND('Table'[timestamp])
) >= TIMEVALUE("02:30:00"),
TIME(
HOUR('Table'[timestamp]),
MINUTE('Table'[timestamp]),
SECOND('Table'[timestamp])
) <= TIMEVALUE("05:00:00")
)
Sample Result
A similar result could be achieved using the SUMX function if that's more intuitive for you.

group a column by date with different formats

I have a dataset where one column has a date and time values. Every date has multiple entries. The first row for every date has a date value inthe form 29MAY2018_00:00:00.000000 while the rest of the row for the same date has time values i.e. 20:00 - 21:00. The problem is that I want to sum the values in another column for each day.
The sample data has the following format
Date A
29MAY2018_00:00:00.000000
20:00 - 21:00 0.009
21:00 - 22:00 0.003
22:00 - 23:00 0.0003
23:00 - 00:00 0
30MAY2018_00:00:00.000000
00:00 - 01:00 -0.0016
01:00 - 02:00 -0.0012
02:00 - 03:00 -0.0002
03:00 - 04:00 -0.0023
04:00 - 05:00 0
05:00 - 06:00 -0.0005
20:00 - 21:00 -0.0042
21:00 - 22:00 -0.0035
22:00 - 23:00 -0.0026
23:00 - 00:00 -0.001
I have created a new column
data$C[data$A ==0 ] <- 0
data$C[data$A < 0 ] <- -1
data$C[data$A > 0 ] <- 1
I need to sum the column `C' for every date.
The output should be
A B
29-MAY-2019 4
30-MAY-2019 -9
31-MAY-2019 3
An option would be to create a grouping column based on the occurrence of full datetime format in the 'Date', summarise the first 'Date', convert it to Date format (with anydate from anytime) and get the sum of sign of 'A'
library(tidyverse)
library(anytime)
data %>%
group_by(grp = cumsum(str_detect(Date, "[A-Z]"))) %>%
summarise(Date = anydate(first(Date)),
B = sum(sign(A), na.rm = TRUE))

time frequency in R

Good Afternoon, colleagues!I have some problems with the following task: I need to plot the time-series graph by using parameter "frequency" that defines the time between two observations in my graph. The data are shown below:
date time open high low close
1 1999.04.08 11:00 1.0803 1.0817 1.0797 1.0809
2 1999.04.08 12:00 1.0808 1.0821 1.0806 1.0807
3 1999.04.08 13:00 1.0809 1.0814 1.0801 1.0813
4 1999.04.08 14:00 1.0819 1.0845 1.0815 1.0844
5 1999.04.08 15:00 1.0839 1.0857 1.0832 1.0844
6 1999.04.08 16:00 1.0842 1.0852 1.0824 1.0834
By default in this data the frequency is 1 hour, but I have two questions: - how to define this frequency in the data (by automatically, if the data will be other one) (because I tried to select column time and to calculate frequency = time[2]-time[1] but I got an error)
- if we task the required frequency is 3 hour how to select this data with frequency 3 hour (in other words: 1st observations, the next one is 4th observations, the next is 7th and etc).
Thank you!

How to change the units of difftime after computing the values, and not using units="xxx" when performing the calculations

I am substracting dates. In the following example
bb
FECHA_EFECTO_ESTADO FECHA_ANIVERSARIO_POLIZA
9 2015-11-05 09:49:00 2015-11-05
10 2015-11-05 09:51:00 2015-11-04
the columns are posixct date values. To create a new variable which is the difference of the other variables I could use:
library (dplyr)
bb<-aa<-mutate(bb, day1=abs(FECHA_EFECTO_ESTADO-FECHA_ANIVERSARIO_POLIZA))
bb
FECHA_EFECTO_ESTADO FECHA_ANIVERSARIO_POLIZA day1
1 2015-11-05 09:49:00 2015-11-05 9.816667 hours
2 2015-11-05 09:51:00 2015-11-04 33.850000 hours
By default, the units (days, hours, seconds) if the day1 variable depends on the amount of the difference. If I want to have the difference in days, I could do:
bb<-mutate(bb, day2=abs (difftime(FECHA_EFECTO_ESTADO, FECHA_ANIVERSARIO_POLIZA, units="days" )))
bb
FECHA_EFECTO_ESTADO FECHA_ANIVERSARIO_POLIZA day1 day2
1 2015-11-05 09:49:00 2015-11-05 9.816667 hours 0.4090278 days
2 2015-11-05 09:51:00 2015-11-04 33.850000 hours 1.4104167 days
IS there a way to specify the units (days in this case) after doing the calculations? I might find further down the analysis that I would prefer to have the difference in hours for instance, so:
How can I change the units of the day1 or day2 columns a posteriori?
Thanks
What is returned by difftime is an object of class "difftime".
Other function have methods for difftime. For example, to convert to the number of hours, as a numeric:
as.numeric(difftime("2015-12-07", "2015-12-05"), units="hours")
[1] 48
Or, to get weeks:
as.numeric(difftime("2015-12-07", "2015-12-05"), units="weeks")
[1] 0.2857143
Also, you might find it useful to remember that POSIXct objects are actually numbers! They represent the number of seconds elapsed since ‘1970-01-01 00:00.00 UTC’, (assumes Gregorian Calendar).
As a result, it can often be convenient to think of a unit of time of fixed duration (i.e., not something like "month", which isn't constant) and perform calculations using it. E.g., once your time difference is in seconds, it's to convert to other constant (technically there is a very fine scale variation in some of these, but most applications don't require attention to these details) time units like minutes, hours, days, or weeks.
Now, thanks to #rbatt I have realised that, in my example, to change the units of variable day1, I can do bb$day1<- as.numeric(bb$day1, units="days")
And this will change the units to days:
FECHA_EFECTO_ESTADO FECHA_ANIVERSARIO_POLIZA day1 day2
9 2015-11-05 09:49:00 2015-11-05 9.816667 hours 0.4090278 days
10 2015-11-05 09:51:00 2015-11-04 33.850000 hours 1.4104167 days
bb$day1<- as.numeric(bb$day1, units="days")
bb
FECHA_EFECTO_ESTADO FECHA_ANIVERSARIO_POLIZA day1 day2
9 2015-11-05 09:49:00 2015-11-05 0.4090278 0.4090278 days
10 2015-11-05 09:51:00 2015-11-04 1.4104167 1.4104167 days

Combine timedelta and date column, group by time interval

I need to combine two separate columns to one datetime column.
The pandas dataframe looks as follows:
calendarid time_delta_actualdeparture actualtriptime
20140101 0 days 06:35:49.000020000 27.11666667
20140101 0 days 06:51:37.000020000 24.83333333
20140101 0 days 07:11:40.000020000 28.1
20140101 0 days 07:31:40.000020000 23.03333333
20140101 0 days 07:53:34.999980000 23.3
20140101 0 days 08:14:13.000020000 51.81666667
I would like to convert it to look like this:
calendarid actualtriptime
2014-01-01 6:30:00 mean of trip times in time interval
2014-01-01 7:00:00 mean of trip times in time interval
2014-01-01 7:30:00 mean of trip times in time interval
2014-01-01 8:00:00 mean of trip times in time interval
2014-01-01 8:30:00 mean of trip times in time interval
Essentially i would like to combine the two columns as one and then group into 30 minute time intervals, taking the mean of the actual trip time in that interval. I've unsuccessfully tried many techniques, but i am still learning python/pandas. Can anyone help me with this?
Convert your 'calendarid' column to a datetime and add the delta to get the starting times.
In [5]: df['calendarid'] = pd.to_datetime(df['calendarid'], format='%Y%m%d')
In [7]: df['calendarid'] = df['calendarid'] + df['time_delta_actualdeparture']
In [8]: df
Out[8]:
calendarid time_delta_actualdeparture actualtriptime
0 2014-01-01 06:35:49.000020 06:35:49.000020 27.116667
1 2014-01-01 06:51:37.000020 06:51:37.000020 24.833333
2 2014-01-01 07:11:40.000020 07:11:40.000020 28.100000
3 2014-01-01 07:31:40.000020 07:31:40.000020 23.033333
4 2014-01-01 07:53:34.999980 07:53:34.999980 23.300000
5 2014-01-01 08:14:13.000020 08:14:13.000020 51.816667
Then you can you set your date column as an index and resample at the 30 minutes frequency to get the mean over each interval.
In [19]: df.set_index('calendarid').resample('30Min', how='mean', label='right')
Out[19]:
actualtriptime
calendarid
2014-01-01 07:00:00 25.975000
2014-01-01 07:30:00 28.100000
2014-01-01 08:00:00 23.166667
2014-01-01 08:30:00 51.816667

Resources