Extract Last 35 days data prior to the last 7 days in Teradata - teradata

I want to extract the last 35 days prior to the 7days before the max date in the dataset. Here the analysis is for days and not weeks. For example: Max date is 10 March 2022 so I want date range to be between 29 Jan to 3 March. I want to repeat this every 3 months and automate it, instead of writing the code every time.
Example data
date ids
23/01/2022 12345
29/01/2022 13452
30/01/2022 21345
25/01/2022 13482
24/01/2022 12245
04/02/2022 13052
06/02/2022 11145
10/02/2022 13452
06/02/2022 12945
07/02/2022 13222
09/02/2022 14345
11/02/2022 13452
12/02/2022 12245
12/02/2022 13432
13/02/2022 13455
12/02/2022 12344
03/03/2022 13452
06/02/2022 12310
08/02/2022 17893
09/03/2022 10987
10/03/2022 11346
Expected Output
date ids
29/01/2022 13452
30/01/2022 21345
04/02/2022 13052
06/02/2022 11145
10/02/2022 13452
06/02/2022 12945
07/02/2022 13222
09/02/2022 14345
11/02/2022 13452
12/02/2022 12245
12/02/2022 13432
13/02/2022 13455
12/02/2022 12344
03/03/2022 13452
06/02/2022 12310
08/02/2022 17893

Related

Using loop to calculate bearings over a list

My goal is to apply the geosphere::bearing function to a very large data frame,
yet because the data frame concerns multiple individuals, I split it using the purrr package and split function.
I have seen the use of 'lists' and 'forloops' in the past but I have no experience with these.
Below is a fraction of my dataset, I have split the dataframe by ID, into a list with 43 elements. I have attached long and lat in wgs84 to the initial data frame.
ID Date Time Datetime Long Lat x y
10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156
10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409 5179885 -91.7044 46.34891
10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134
10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134
10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.7242 46.34506
10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738
10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.7184 46.32236
10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712
10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.3485
10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451
10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.6685 46.32941
10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481
10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138
10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763
10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.3684
10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429
10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274
I then try this function
library(geosphere)
library(sf)
library(magrittr)
dis_list <- split(data, data$ID)
answer <- lapply(dis_list, function(df) {
start <- df[-1 , c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
end <- df[-nrow(df), c("x", "y")] %>%
st_as_sf(coords = c('x', 'y'))
angles <-geosphere::bearing(start, end)
df$angles <- c(NA, angles)
df
})
answer
which gives the error
Error in .pointsToMatrix(p1) :
'list' object cannot be coerced to type 'double'
A google search on "pass sf points to geosphere bearings" brings up this SE::GIS answer that seems to address the issue which I would characterize as "how to extract numeric vectors from items that are sf-classed POINTS": https://gis.stackexchange.com/questions/416316/compute-east-west-or-north-south-orientation-of-polylines-sf-linestring-in-r
I needed to work with a single section first and then apply the lessons from #Spacedman to this task:
> st_coordinates( st_as_sf(dis_list[[1]], coords = c('x', 'y')) )
X Y
1 -91.72272 46.35156
2 -91.70440 46.34891
3 -91.72297 46.35134
4 -91.72420 46.34506
5 -91.65361 46.34712
So st_coordinates wilL extract the POINTS classed values into a two column matrix that can THEN get passed to geosphere::bearings
dis_list <- split(dat, dat$ID)
answer <- lapply(dis_list, function(df) {
start <- df[-1 , c("x", "y")] %>%
st_as_sf(coords = c('x', 'y')) %>% st_coordinates
end1 <- df[-nrow(df), c("x", "y")] %>%
st_as_sf(coords = c('x', 'y')) %>% st_coordinates
angles <-geosphere::bearing(start, end1)
df$angles <- c(NA, angles)
df
})
answer
#------------------------
$`10_17`
ID Date Time date time Long Lat x y
1 10_17 4/18/2017 15:02:00 4/18/2017 15:02 379800.5 5181001 -91.72272 46.35156
2 10_17 4/20/2017 6:00:00 4/20/2017 6:00 383409.0 5179885 -91.70440 46.34891
3 10_17 4/21/2017 21:02:00 4/21/2017 21:02 383191.2 5177960 -91.72297 46.35134
5 10_17 4/23/2017 12:01:00 4/23/2017 12:01 378582.5 5182110 -91.72420 46.34506
8 10_17 4/26/2017 18:02:00 4/26/2017 18:02 380691.9 5179353 -91.65361 46.34712
Datetime angles
1 4/18/2017 15:02 NA
2 4/20/2017 6:00 -78.194383
3 4/21/2017 21:02 100.694352
5 4/23/2017 12:01 7.723513
8 4/26/2017 18:02 -92.387473
$`10_24`
ID Date Time date time Long Lat x y
4 10_24 4/22/2017 10:03:00 4/22/2017 10:03 383448.6 5179918 -91.72298 46.35134
6 10_24 4/24/2017 1:00:00 4/24/2017 1:00 383647.4 5180009 -91.72515 46.34738
7 10_24 4/25/2017 16:01:00 4/25/2017 16:01 383407.9 5179872 -91.71840 46.32236
Datetime angles
4 4/22/2017 10:03 NA
6 4/24/2017 1:00 20.77910
7 4/25/2017 16:01 -10.58228
$`10_36`
ID Date Time date time Long Lat x y
9 10_36 4/27/2017 20:00:00 4/27/2017 20:00 382521.9 5175266 -91.66127 46.34850
10 10_36 4/29/2017 11:01:00 4/29/2017 11:01 383443.8 5179909 -91.70303 46.35451
11 10_36 4/30/2017 0:00:00 4/30/2017 0:00 383060.8 5178361 -91.66850 46.32941
Datetime angles
9 4/27/2017 20:00 NA
10 4/29/2017 11:01 101.72602
11 4/30/2017 0:00 -43.60192
$`10_40`
ID Date Time date time Long Lat x y
12 10_40 4/30/2017 13:02:00 4/30/2017 13:02 383426.3 5179873 -91.70263 46.35481
13 10_40 5/2/2017 17:02:00 5/2/2017 17:02 383393.7 5179883 -91.67099 46.34138
14 10_40 5/3/2017 6:01:00 5/3/2017 6:01 382875.8 5179376 -91.66324 46.34763
Datetime angles
12 4/30/2017 13:02 NA
13 5/2/2017 17:02 -58.48235
14 5/3/2017 6:01 -139.34297
$`10_88`
ID Date Time date time Long Lat x y
15 10_88 5/3/2017 19:02:00 5/3/2017 19:02 383264.3 5179948 -91.73075 46.36840
16 10_88 5/4/2017 8:01:00 5/4/2017 8:01 378554.4 5181966 -91.70413 46.35429
17 10_88 5/4/2017 21:03:00 5/4/2017 21:03 379830.5 5177232 -91.66452 46.37274
Datetime angles
15 5/3/2017 19:02 NA
16 5/4/2017 8:01 -52.55217
17 5/4/2017 21:03 -123.91920
The help page for st_coordinates characterizes its function as "retrieve coordinates in matrix form".
Given the data is all ready in longitude and latitude form.
Then just using bearing(data[, c("Long", "Lat")]) and distGeo(data[, c("Long", "Lat")]) from geosphere on the split data frames will work. No need to create a start and end points.
library(geosphere)
dfs <- split(data, data$ID)
library(geosphere)
answer <- lapply(dfs, function(df) {
df$distances <-c(distGeo(df[,c("Long", "Lat")]))
df$bearings <- c(bearing(df[,c("Long", "Lat")]))
df
})
answer
The sf package is useful for converting between coordinate systems, but with the data set above, that step can be skipped. I find the geosphere package more straight forward and simpler to use.

How can I make specific periods for each account separately starting from their first transaction date?

This is my transaction data:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
0 7468 5695 700.0 2005-01-04
1 6213 9379 11832.0 2005-01-08
2 7517 8170 1000.0 2005-01-10
3 6143 9845 4276.0 2005-01-12
4 6254 9640 200.0 2005-01-14
5 6669 5815 200.0 2005-01-20
6 6934 8583 49752.0 2005-01-24
7 9240 8314 19961.0 2005-01-26
8 6374 8865 1000.0 2005-01-30
9 6143 6530 13.4 2005-01-31
...
I want to build new features based on time intervals.
Let's look at this:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
149431 5370 5735 1000.0 2007-03-24
157403 5370 7058 3679.0 2007-04-13
158831 5370 8667 12600.0 2007-04-23
162680 5370 6053 19.2 2007-04-30
167082 5370 8165 3679.0 2007-05-13
168562 5370 5656 2100.0 2007-05-23
172578 5370 5929 79.0 2007-05-31
177507 5370 6725 3679.0 2007-06-01
179167 5370 8433 200.0 2007-06-22
183499 5370 7022 100.6 2007-06-29
...
Let's say, I want to calculate the amount of money transacted, for example, in week periods for each account.
So, starting from 2007-03-24, 5370's weekly transaction amount history is as follows:
in the 1st week(2007-03-24 - 2007-03-31): 1000.0
in the 2nd week(2007-03-31 - 2007-04-07): 0.0
in the 3rd week(2007-04-07 - 2007-04-14): 3679.0
in the 4th week(2007-04-14 - 2007-04-21): 0.0
in the 5th week(2007-04-21 - 2007-04-28): 12600.0
in the 6th week(2007-04-28 - 2007-05-05): 19.2
in the 7th week(2007-05-05 - 2007-05-12): 0.0
in the 8th week(2007-05-12 - 2007-05-19): 3679.0
in the 9th week(2007-05-19 - 2007-05-26): 2100.0
in the 10th week(2007-05-26 - 2007-06-02): 79.0 + 3679.0 = 3758.0
in the 11th week(2007-06-02 - 2007-06-09): 0.0
in the 12th week(2007-06-09 - 2007-06-16): 0.0
in the 13th week(2007-06-16 - 2007-06-23): 200.0
in the 14th week(2007-06-23 - 2007-06-30): 100.6
Here we see that the max amount 5370 transacted in a week period is 12600.0. So, now I want to see this measure as a feature, say max_of_weekly_transacted_amount.
Similarly, I want to calculate say the mean amount transacted in a month period for each account and store it as another feature, say mean_of_monthly_transacted_amount
I tried lubridate function floor_date:
# Max of weekly transaction amount
data <- data %>% group_by(date_trx_week=floor_date(date_trx, "week"),from_id) %>% mutate(weekly_trx = sum(amount)) %>%
group_by(from_id) %>% mutate(max_of_weekly_transacted_amount=max(weekly_trx))%>%
select(-c(date_trx_week,weekly_trx))
# Mean of monthly transaction amount
data <- data %>% group_by(date_trx_month=floor_date(date_trx, "month"),from_id) %>% mutate(monthly_trx = sum(amount)) %>%
group_by(from_id) %>% mutate(mean_of_monthly_transacted_amount=mean(monthly_trx))%>%
select(-c(date_trx_month,monthly_trx))
The date variable date_trx in my data starts with 2005-01-01 and ends with 2010-12-31. floor_date starts week periods with 2005-01-02-2005-01-09 and continues with 2005-01-09-2005-01-16 and so on. It starts month periods with 2005-01-01-2005-02-01 and continues with 2005-02-01-2005-03-01 and so on. And this function uses the same periods for each account.
But I want to make periods spesifically for each account based on the first transaction date they made. So, for from_id = 5370 first transaction date is 2007-03-24. If I want to make week periods for 5370, it would be 2007-03-24 - 2007-03-31, 2007-03-31 - 2007-04-07, and so on. If I want to make month periods for 5370, it would be 2007-03-24 - 2007-04-24, 2007-04-24 - 2007-05-24, and so on.
For another account, periods would be different. So, how can I achieve this? How can I make specific periods for each account separately starting from their first transaction date?
The following is all in Base-R. First, the custom weeks. This uses mod 7 on the difference between the first entry and the rest. Hopefully someone can use this approach and make you a more efficient data.table solution.
df$Week <- unsplit(tapply(df$date_trx,df$from_id, function(x) as.numeric((x-x[1])) %/% 7 ),df$from_id)
Then if you want to take say the mean by ID by Week you can
aggregate(amount ~ from_id + Week, df, mean)
from_id Week amount
1 5370 0 1000.0
2 6143 0 4276.0
3 6213 0 11832.0
4 6254 0 200.0
5 6374 0 1000.0
6 6669 0 200.0
7 6934 0 49752.0
8 7468 0 700.0
9 7517 0 1000.0
10 9240 0 19961.0
11 5370 2 3679.0
12 6143 2 13.4
13 5370 4 12600.0
14 5370 5 19.2
15 5370 7 3679.0
16 5370 8 2100.0
17 5370 9 1879.0
18 5370 12 200.0
19 5370 13 100.6
Of course you can replace mean with max or any other function.

How can I filter specifically for certain months if the days are not the same in each year?

This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]

Intraday high/low clustering

I am attempting to perform a study on the clustering of high/low points based on time. I managed to achieve the above by using to.daily on intraday data and merging the two using:
intraday.merge <- merge(intraday,daily)
intraday.merge <- na.locf(intraday.merge)
intraday.merge <- intraday.merge["T08:30:00/T16:30:00"] # remove record at 00:00:00
Next, I tried to obtain the records where the high == daily.high/low == daily.low using:
intradayhi <- test[test$High == test$Daily.High]
intradaylo <- test[test$Low == test$Daily.Low]
Resulting data resembles the following:
Open High Low Close Volume Daily.Open Daily.High Daily.Low Daily.Close Daily.Volume
2012-06-19 08:45:00 258.9 259.1 258.5 258.7 1424 258.9 259.1 257.7 258.7 31523
2012-06-20 13:30:00 260.8 260.9 260.6 260.6 1616 260.4 260.9 259.2 260.8 35358
2012-06-21 08:40:00 260.7 260.8 260.4 260.5 493 260.7 260.8 257.4 258.3 31360
2012-06-22 12:10:00 255.9 256.2 255.9 256.1 626 254.5 256.2 253.9 255.3 50515
2012-06-22 12:15:00 256.1 256.2 255.9 255.9 779 254.5 256.2 253.9 255.3 50515
2012-06-25 11:55:00 254.5 254.7 254.4 254.6 1589 253.8 254.7 251.5 253.9 65621
2012-06-26 08:45:00 253.4 254.2 253.2 253.7 5849 253.8 254.2 252.4 253.1 70635
2012-06-27 11:25:00 255.6 256.0 255.5 255.9 973 251.8 256.0 251.8 255.2 53335
2012-06-28 09:00:00 257.0 257.3 256.9 257.1 601 255.3 257.3 255.0 255.1 23978
2012-06-29 13:45:00 253.0 253.4 253.0 253.4 451 247.3 253.4 246.9 253.4 52539
There are duplicated results using the subset, how do I achieve only the first record of the day? I would then be able to plot the count of records for periods in the day.
Also, are there alternate methods to get the results I want? Thanks in advance.
Edit:
Sample output should look like this, count could either be 1st result for day or aggregated (more than 1 occurrence in that day):
Time Count
08:40:00 60
08:45:00 54
08:50:00 60
...
14:00:00 20
14:05:00 12
14:10:00 30
You can get the first observation of each day via:
y <- apply.daily(x, first)
Then you can simply aggregate the count based on hours and minutes:
z <- aggregate(1:NROW(y), by=list(Time=format(index(y),"%H:%M")), sum)

Aggregating daily data using quantmod 'to.weekly' function creates weekly data ending on Monday not Friday

I am trying to aggregate daily share price data (close only) to weekly share price data using the "to.weekly" function in quantmod. The xts object foo holds daily share price data for a stock starting from Monday 3 January 2011 and ending on Monday 20 September 2011. To aggregate this daily data I used:
tmp <- to.weekly(foo)
The above approach succeeds in that tmp now holds a series of weekly OHLC data points, as per the quantmod docs. The problem is that the series begins on Monday 3 January 2011 and each subsequent week also begins on Monday e.g. Monday 10 January, Monday 17 January and so on. I had expected the week to default to ending on Friday so that the weekly series started on Friday 7 January and ended on Friday 16 September.
I have experimented with adjusting the start and end of the data and using 'endof' or 'startof' together with the indexAt parameter but I cannot get it to return a week ending in Friday.
I am grateful for any insights received.
(Sorry, I could not find any way to attach dput file so data appears below)
foo:
2011-01-03 2802
2011-01-04 2841
2011-01-05 2883
2011-01-06 2948
2011-01-07 2993
2011-01-10 2993
2011-01-11 3000
2011-01-12 3000
2011-01-13 3025
2011-01-14 2970
2011-01-17 2954
2011-01-18 2976
2011-01-19 2992
2011-01-20 2966
2011-01-21 2940
2011-01-24 2969
2011-01-25 2996
2011-01-26 2982
2011-01-27 3035
2011-01-28 3075
2011-01-31 3020
tmp:
foo.Open foo.High foo.Low foo.Close
2011-01-03 2802 2802 2802 2802
2011-01-10 2841 2993 2841 2993
2011-01-17 3000 3025 2954 2954
2011-01-24 2976 2992 2940 2969
2011-01-31 2996 3075 2982 3020
I've come up with something yielding only Close values, perhaps it can be hacked further to return OHLC series.
Assuming that foo is an xts object, first we create the vector of indeces of Fridays:
fridays = as.POSIXlt(time(foo))$wday == 5
Then we prepend it with 0:
indx <- c(0, which(fridays))
And use period.apply:
period.apply(foo, INDEX=indx, FUN=last)
Result:
[,1]
2011-01-07 2993
2011-01-14 2970
2011-01-21 2940
2011-01-28 3075
For Fridays (with occasional Thursdays due to market closures), use:
tmp = to.weekly(foo, indexAt = "endof")
For Mondays (with occasional Tuesdays due to market closures), use:
tmp = to.weekly(foo, indexAt = "startof")`
Or you can create a custom vector of Dates that contains the date to be associated with each week. For instance, to force every week to be associated with Friday regardless of market closures:
customIdx = seq(from = as.Date("2011-01-07"), by = 7, length.out = nrow(tmp))
index(tmp) = customIdx

Resources