Fill Lagged Values Down R - r

I am trying to use a combination of conditional lagging and then filling values down by group. In my data, I have old_price and new_price. The new_price must always be lower than old_price. Whenever new_price is greater than old_price, I would like to lag back to the most recent value where new_price was less than old_price. In the case of Raleigh, rows 2 and 3 should lag back to 36.00. Row 4 should not lag back since new_price is already lower than old_price. When I have tried using lag, it has been applying it to row 2 (where the price is 52), but then leaving row 3 as 54.00. I would like row 3 to also lag from row 1, or from row 2 once it has the correct value.
Here is my data:
city sku month year old_price new_price
Raleigh 001 Dec 2021 45.00 36.00
Raleigh 001 Jan 2022 45.00 52.00
Raleigh 001 Feb 2022 45.00 54.00
Raleigh 001 Mar 2022 45.00 37.00
Austin 002 Dec 2021 37.50 30.00
Austin 002 Jan 2022 37.50 32.00
Austin 002 Feb 2022 37.50 48.00
Desired output:
city sku month year old_price new_price
Raleigh 001 Dec 2021 45.00 36.00
Raleigh 001 Jan 2022 45.00 36.00
Raleigh 001 Feb 2022 45.00 36.00
Raleigh 001 Mar 2022 45.00 37.00
Austin 002 Dec 2021 37.50 30.00
Austin 002 Jan 2022 37.50 32.00
Austin 002 Feb 2022 37.50 32.00

One approach is to convert values where new_price > old_price to NA and then fill down.
library(dplyr)
library(tidyr)
df %>%
mutate(new_price = if_else(new_price > old_price, NA_real_, new_price)) %>%
fill(new_price)
Output:
city sku month year old_price new_price
1 Raleigh 1 Dec 2021 45.0 36
2 Raleigh 1 Jan 2022 45.0 36
3 Raleigh 1 Feb 2022 45.0 36
4 Raleigh 1 Mar 2022 45.0 37
5 Austin 2 Dec 2021 37.5 30
6 Austin 2 Jan 2022 37.5 32
7 Austin 2 Feb 2022 37.5 32
Data:
df <- read.table(textConnection("city sku month year old_price new_price
Raleigh 001 Dec 2021 45.00 36.00
Raleigh 001 Jan 2022 45.00 52.00
Raleigh 001 Feb 2022 45.00 54.00
Raleigh 001 Mar 2022 45.00 37.00
Austin 002 Dec 2021 37.50 30.00
Austin 002 Jan 2022 37.50 32.00
Austin 002 Feb 2022 37.50 48.00"), header = TRUE)

Related

Rolling data for 12 month period

I wanna show the last 12 months, and each of those months should show the sum of 12 months back. So January 2022 shows sum of January 2021 -> January 2022, February 2022 shows sum of February 2021 -> February 2022 and so on.
My current data
Expected Result
I new in kusto, seems i need use pivot mode with prev function but these month period a bit confusing.
If you know for sure that you have data for each month, this will do the trick.
If not, the solution will get a bit more complicated.
The Idea is to create an accumulated sum column and then match each month accumulated sum with this of the same month from the previous year.
The difference between them is the sum of the last 12 months.
// Data sample generation. Not part of the solution.
let t = materialize(range i from 1 to 10000 step 1 | extend dt = ago(365d*5*rand()) | summarize val = count() by year = getyear(dt), month = getmonth(dt));
// Solution starts here.
t
| order by year asc, month asc
| extend cumsum_val = row_cumsum(val) - val, prev_year = year - 1
| as t2
| join kind=inner t2 on $left.prev_year == $right.year and $left.month == $right.month
| project year, month = format_datetime(make_datetime(year,month,1),'MM') , last_12_cumsum_val = cumsum_val - cumsum_val1
| evaluate pivot(month, any(last_12_cumsum_val), year)
| order by year asc
year
01
02
03
04
05
06
07
08
09
10
11
12
2018
1901
2020
2018
2023
2032
2039
2015
2025
2039
2019
2045
2048
2029
2043
2053
2040
2041
2027
2025
2037
2050
2042
2020
2035
2016
2024
2022
1999
2009
1989
1996
1975
1968
1939
1926
2021
1926
1931
1936
1933
1945
1942
1972
1969
1981
2007
2020
2049
2022
2051
2032
2019
2002
Fiddle
Another option is to follow the sliding window aggregations sample described here:
let t = materialize(range i from 1 to 10000 step 1 | extend dt = ago(365d*5*rand()) | summarize val = count() by year = getyear(dt), month = getmonth(dt) | extend Date = make_datetime(year, month, 1));
let window_months = 12;
t
| extend _bin = startofmonth(Date)
| extend _range = range(1, window_months, 1)
| mv-expand _range to typeof(long)
| extend end_bin = datetime_add("month", _range, Date)
| extend end_month = format_datetime(end_bin, "MM"), end_year = datetime_part("year", end_bin)
| summarize sum(val), count() by end_year, end_month
| where count_ == 12
| evaluate pivot(end_month, take_any(sum_val), end_year)
| order by end_year asc
end_year
01
02
03
04
05
06
07
08
09
10
11
12
2018
1921
2061
2036
2037
2075
2067
2038
2025
2029
2019
2012
2006
2015
2022
1997
2015
2012
2010
1994
2002
2029
2035
2020
2012
2002
1967
1949
1950
1963
1966
1976
1982
2016
1988
1972
2021
1990
1987
1991
1996
2026
2004
2005
1996
1991
1966
1989
1993
2022
1979
1983
1981
1977
1931

Confused on percent difference calculations in R using dplyr::mutate

I'm attempting to find the percent differences of state characteristics (using a defined index created using factor analysis) between the years 2012 and 2017. However some states begin with a score of -0.617 (2012) and end with -1.25 (2017), creating a positive percent difference rather than a negative.
The only other thing I've tried is subtracting 1 from the fraction factor1/lag(factor1). Below is is the code I'm currently working with:
STFACTOR %>>%
dplyr::select(FIPSst, Geography, Year, factor1) %>>%
filter(Year == c(2012, 2017)) %>>%
group_by(Geography) %>>%
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
These are the changes and results from each change in code
mutate(pct_change = (1-factor1/lag(factor1)) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 102.
I would expect the final result to look like this:
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100)
Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.
we can use
mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)

Converting Dates to Julian Date

I am currently trying to do Theil-Sen trend estimates with a number of time series. How should I convert the Date variables so that they can be run in mblm package? The dates currently exist like so 'Apr 1981'. I want to use monthly medians in this assessment. See attached data.frame.
Thanks!
mo yr doc Date
04 1981 2.800 Apr 1981
05 1982 2.700 May 1982
10 1999 0.500 Oct 1999
05 2000 2.400 May 2000
06 2000 1.200 Jun 2000
07 2000 0.950 Jul 2000
08 2000 0.700 Aug 2000
09 2000 0.750 Sep 2000
10 2000 0.600 Oct 2000
11 2000 0.785 Nov 2000
12 2000 0.660 Dec 2000
01 2001 0.710 Jan 2001

Convert time series to data.frame without losing the year and Month items

I have a time series, dt_ts. I want to convert to dataframe without loosing the year and month
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005 41.26 40.02 38.24 35.37 39.35 38.90 43.51 40.32 38.14 41.04 41.78 40.48
2006 40.55 42.15 42.30 39.93 38.12 35.79 34.71 34.29 36.27 37.33 37.97 40.16
2007 40.74 39.59 36.74 37.87 38.87 39.35 37.17 38.31 32.44
I want something like:
Year Month Sales
2005 Jan 41.26
etc etc etc
A solution using dplyr, tidyr, and tibble.
library(dplyr)
library(tidyr)
library(tibble)
dt2 <- dt %>%
rownames_to_column("Year") %>%
gather(Month, Sales, -Year) %>%
mutate(Month = factor(Month, levels = colnames(dt))) %>%
arrange(Year, Month)
dt2
Year Month Sales
1 2005 Jan 41.26
2 2005 Feb 40.02
3 2005 Mar 38.24
4 2005 Apr 35.37
5 2005 May 39.35
6 2005 Jun 38.90
7 2005 Jul 43.51
8 2005 Aug 40.32
9 2005 Sep 38.14
10 2005 Oct 41.04
11 2005 Nov 41.78
12 2005 Dec 40.48
13 2006 Jan 40.55
14 2006 Feb 42.15
15 2006 Mar 42.30
16 2006 Apr 39.93
17 2006 May 38.12
18 2006 Jun 35.79
19 2006 Jul 34.71
20 2006 Aug 34.29
21 2006 Sep 36.27
22 2006 Oct 37.33
23 2006 Nov 37.97
24 2006 Dec 40.16
25 2007 Jan 40.74
26 2007 Feb 39.59
27 2007 Mar 36.74
28 2007 Apr 37.87
29 2007 May 38.87
30 2007 Jun 39.35
31 2007 Jul 37.17
32 2007 Aug 38.31
33 2007 Sep 32.44
34 2007 Oct NA
35 2007 Nov NA
36 2007 Dec NA
DATA
dt <- read.table(text = " Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005 41.26 40.02 38.24 35.37 39.35 38.90 43.51 40.32 38.14 41.04 41.78 40.48
2006 40.55 42.15 42.30 39.93 38.12 35.79 34.71 34.29 36.27 37.33 37.97 40.16
2007 40.74 39.59 36.74 37.87 38.87 39.35 37.17 38.31 32.44",
header = TRUE, fill = TRUE)
One option would be to convert to xts, get the 'index', split it into two column and cbind with vector 'ts1'
library(xts)
cbind(read.table(text = as.character(index(as.xts(ts1))),
col.names = c('Month', 'Year')), Sales = c(ts1))
data
set.seed(24)
ts1 <- ts(sample(50), start = c(2001, 1), frequency = 12)

Linear model/lmList with nested/multiple group categories?

I am trying to build a model for monthly energy consumption based on weather, grouped by location (there are ~1100) AND year (I would like to do it from 2011-2014). The data is called factin and looks like this:
Store Month Days UPD HD CD Year
1 August, 2013 31 6478.27 0.06 10.03 2013
1 September, 2013 30 6015.38 0.50 5.67 2013
1 October, 2013 31 5478.21 5.29 1.48 2013
1 November, 2013 30 5223.78 18.60 0.00 2013
1 December, 2013 31 5115.80 20.52 0.23 2013
6 January, 2011 31 4517.56 27.45 0.00 2011
6 February, 2011 28 4116.07 16.75 0.07 2011
6 March, 2011 31 3981.78 12.68 0.39 2011
6 April, 2011 30 4041.68 3.83 2.53 2011
6 May, 2011 31 4287.23 1.61 6.58 2011
And my model code, which just spits out 1 set of coefficients for all the years of each store, looks like this:
factout <- lmList(UPD ~ HD + CD | Store, factin)
My question is: is there any way I can get coefficients for each store AND year without creating a separate data frame for each year?
dat <- read.table(header = T, stringsAsFactors = F, text = "Store Month year Days UPD HD CD Year
1 August 2013 31 6478.27 0.06 10.03 2013
1 September 2013 30 6015.38 0.50 5.67 2013
1 October 2013 31 5478.21 5.29 1.48 2013
1 November 2013 30 5223.78 18.60 0.00 2013
1 December 2013 31 5115.80 20.52 0.23 2013
6 January 2011 31 4517.56 27.45 0.00 2011
6 February 2011 28 4116.07 16.75 0.07 2011
6 March 2011 31 3981.78 12.68 0.39 2011
6 April 2011 30 4041.68 3.83 2.53 2011
6 May 2011 31 4287.23 1.61 6.58 2011")
factout <- lmList(UPD ~ HD + CD | Store, dat)
data.frame(Store = unique(dat$Store), summary(factout)$coef[1:2,1,1:3])
(Intercept) HD CD
1 5405.108 -12.90986 107.2061
6 3581.307 32.93137 102.9780

Resources