Calculating lag over a month - r

I have this data:
library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
It´s just a sample of a large data set, so there are jumps between the date.
In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?
The target variable is item_cnt_day. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this:
library(RcppRoll)
library(dplyr)
samp %>%
mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
date shop_id item_id item_cnt_day month year r_mean_5
30717 2013-01-04 4 1904 1 1 2013 NA
43051 2013-01-31 1 17880 1 1 2013 NA
66273 2013-01-09 30 14439 1 1 2013 NA
105068 2013-01-20 41 15010 1 1 2013 NA
23332 2013-01-29 26 10917 1 1 2013 NA
28838 2013-02-22 16 10331 1 2 2013 1.0
40418 2013-02-08 25 2751 2 2 2013 1.0
62219 2013-02-12 10 1475 1 2 2013 1.2
98641 2013-02-16 29 16071 1 2 2013 1.2
21905 2013-02-23 52 13901 2 2 2013 1.2
32219 2013-03-31 54 2972 1 3 2013 1.4
45156 2013-03-17 42 11184 1 3 2013 1.4
69513 2013-03-24 8 19405 1 3 2013 1.2
110206 2013-03-10 59 2255 1 3 2013 1.2
24473 2013-03-07 31 15119 1 3 2013 1.2
Here is the dput().
structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")

Maybe this?
library(lubridate)
df$lag <- df$date %m-% months(1)
df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))
date shop_id item_id item_cnt_day month year lag rollmean
30717 2013-01-04 4 1904 1 1 2013 2012-12-04 1.000000
43051 2013-01-31 1 17880 1 1 2013 2012-12-31 1.000000
66273 2013-01-09 30 14439 1 1 2013 2012-12-09 1.000000
105068 2013-01-20 41 15010 1 1 2013 2012-12-20 1.000000
23332 2013-01-29 26 10917 1 1 2013 2012-12-29 1.000000
28838 2013-02-22 16 10331 1 2 2013 2013-01-22 1.166667
40418 2013-02-08 25 2751 2 2 2013 2013-01-08 1.200000
62219 2013-02-12 10 1475 1 2 2013 2013-01-12 1.200000
98641 2013-02-16 29 16071 1 2 2013 2013-01-16 1.166667
21905 2013-02-23 52 13901 2 2 2013 2013-01-23 1.285714
32219 2013-03-31 54 2972 1 3 2013 2013-02-28 1.000000
45156 2013-03-17 42 11184 1 3 2013 2013-02-17 1.200000
69513 2013-03-24 8 19405 1 3 2013 2013-02-24 1.000000
110206 2013-03-10 59 2255 1 3 2013 2013-02-10 1.166667
24473 2013-03-07 31 15119 1 3 2013 2013-02-07 1.333333
%m-% calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag. Then in sapply(), the mean of item_cnt_day is calculated for all observations whose date lies within the range of date and lag of the current iteration.
So it doesn't matter how many elements are there for each month or how the elements are ordered.

The date class supports seq for different time intervals (documentation).
So you can basically do:
calculate_lag <- function(date) {
return(seq(date, by = "1 month", length.out = 2)[2])
}
date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")

I am not really familiar calculating lag, but maybe that is what you want?
Data:
df <- structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Calculation:
df %>%
dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>%
tidyr::nest(-c(month, days_in_month)) %>%
dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))
EDIT based on comment:
maybe this then?
df %>%
tidyr::nest(-month) %>%
dplyr::mutate(
ndays = purrr::map_int(data, nrow),
lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
)

Related

Changing value of row based on condition in other condition

My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))

How to insert 0's where values are missing

I have a dataframe, called dets_per_month, that looks like so...
**Zone month yearcollected total**
1 Jul 2017 183
1 Jul 2015 18
1 Aug 2015 202
1 Aug 2017 202
1 Aug 2017 150
1 Sep 2017 68
2 Apr 2018 65
2 Jun 2018 25
2 Sep 2018 278
I'm trying to input 0's for months where there are no totals in a particular zone. This is the code I tried using to input those 0's
complete(dets_per_month, nesting(zone, month), yearcollected = 2016:2018, fill = list(count = 0))
But the output of this doesn't give me any 0's, instead it adds on columns from my original dataframe.
Can anyone tell me how to get 0's for this?
You could use complete after grouping by Zone and yearcollected. We can use month.abb which is in-built constant for month name in English.
library(dplyr)
df %>%
group_by(Zone, yearcollected) %>%
tidyr::complete(month = month.abb, fill = list(total = 0))
# Zone yearcollected month total
# <int> <int> <chr> <dbl>
# 1 1 2015 Apr 0
# 2 1 2015 Aug 202
# 3 1 2015 Dec 0
# 4 1 2015 Feb 0
# 5 1 2015 Jan 0
# 6 1 2015 Jul 18
# 7 1 2015 Jun 0
# 8 1 2015 Mar 0
# 9 1 2015 May 0
#10 1 2015 Nov 0
# … with 27 more rows
data
df <- structure(list(Zone = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
month = structure(c(3L, 3L, 2L, 2L, 2L, 5L, 1L, 4L, 5L), .Label = c("Apr",
"Aug", "Jul", "Jun", "Sep"), class = "factor"), yearcollected = c(2017L,
2015L, 2015L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L),
total = c(183L, 18L, 202L, 202L, 150L, 68L, 65L, 25L, 278L
)), class = "data.frame", row.names = c(NA, -9L))

Time lag between sequential observations giving negative values

I am trying to calculate the time between sequential observations. I have attached a sample of my data here.
A subset of my data looks like:
head(d1) #visualize the first few lines of the data
date time year km sps pp datetime next timedif seque
<fct> <fct> <int> <dbl> <fct> <dbl> <chr> <dbl> <dbl> <fct>
2012/06/21 23:23 2012 80 MUXX 1 2012-06-21 23:23 0 4144 10
2012/07/15 11:38 2012 80 MAMO 0 2012-07-15 11:38 1 33855 01
2012/07/20 22:19 2012 80 MICRO 0 2012-07-20 22:19 0 7841 00
2012/07/29 23:03 2012 80 MICRO 0 2012-07-29 23:03 0 13004 00
2012/10/18 2:54 2012 80 MICRO 0 2012-10-18 02:54 0 -971 00
2012/10/23 2:49 2012 80 MICRO 0 2012-10-23 02:49 0 -1094 00
Where:
pp: which species (sps) are predators (coded as 1) and which are prey (coded as 0)
next: very next pp after the current observation
timedif: time difference between the current observation and the next one
seque: this should be the sequence order: where the first number is the current pp and the second number is the next pp
To generate the datetime column, I did this:
d1$datetime=strftime(paste(d1$date,d1$time),'%Y-%m-%d %H:%M',usetz=FALSE) #converting the date/time into a new format
To make the other columns I used the following code:
d1 = d1 %>%
ungroup() %>%
group_by(km, year) %>%
mutate(next = dplyr::lag(pp)) %>%
mutate(timedif = as.numeric(as.POSIXct(datetime) - lag(as.POSIXct(datetime))))
d1 = d1[2:nrow(d1),] %>% mutate(seque = as.factor(paste0(pp,prev)))
I have two questions:
My lag function appears to be recording the previous pp event, not the next pp event. How do I fix this?
My timedif calculation is giving me negative values, which shouldn't be possible. Why is that happening?
Just in case, here is the output for str(d1):
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 96 obs. of 10 variables:
$ date : Factor w/ 1093 levels "2012/05/30","2012/05/31",..: 23 47 52 61 71 76 76 88 90 98 ...
$ time : Factor w/ 1439 levels "0:00","0:01",..: 983 219 919 963 1016 5 47 52 923 1058 ...
$ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
$ km : num 80 80 80 80 80 80 80 80 80 80 ...
$ sps : Factor w/ 17 levels "CACA","ERDO",..: 11 7 9 9 9 9 9 4 9 11 ...
$ pp : num 1 0 0 0 0 0 0 0 0 1 ...
$ datetime: chr "2012-06-21 23:23" "2012-07-15 11:38" "2012-07-20 22:19" "2012-07-29 23:03" ...
$ next : num 0 1 0 0 0 0 0 0 0 0 ...
$ timedif : num 4144 33855 7841 13004 14453 ...
$ seque : Factor w/ 4 levels "00","01","10",..: 3 2 1 1 1 1 1 1 1 3 ...
And also:
dput(d1[1:10,])
structure(list(
date = structure(c(23L, 47L, 52L, 61L, 71L, 76L, 76L, 88L, 90L, 98L),
.Label = c("2012/05/30", "2012/05/31", "2012/06/01", "2012/06/02", "2012/06/03", "2012/06/04", "2012/06/05", "2013/06/18", "2013/06/19", "2013/06/20", "2013/06/21", "2013/06/22", "2014/07/19", "2014/07/20", "2014/07/21", "2014/07/22", "2014/07/23", "2015/08/06", "2015/08/07", "2015/08/08", "2015/08/09", "2015/08/10"),
class = "factor"),
time = structure(c(983L, 219L, 919L, 963L, 1016L, 5L, 47L, 52L, 923L, 1058L),
.Label = c("0:00", "0:01", "0:02", "0:03", "0:04", "0:05", "0:06", "0:07", "0:33","0:34", "0:35", "0:36", "0:37","10:06", "10:07", "10:08", "10:09", "10:10", "10:11", "10:12", "10:13", "2:05", "2:06", "2:07", "2:08", "2:09", "2:10", "2:11", "9:54", "9:55", "9:56", "9:57", "9:58", "9:59"),
class = "factor"),
year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L),
km = c(80, 80, 80, 80, 80, 80, 80, 80, 80, 80),
sps = structure(c(11L, 7L, 9L, 9L, 9L, 9L, 9L, 4L, 9L, 11L),
.Label = c("CACA", "ERDO", "FEDO", "LEAM", "LOCA", "MAAM", "MAMO", "MEME", "MICRO", "MUVI", "MUXX", "ONZI", "PRLO", "TAHU", "TAST", "URAM", "VUVU"),
class = "factor"),
pp = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 1),
datetime = c("2012-06-21 23:23", "2012-07-15 11:38", "2012-07-20 22:19", "2012-07-29 23:03", "2012-08-08 23:56", "2012-08-13 00:04", "2012-08-13 00:46", "2012-08-25 00:51", "2012-08-27 22:23", "2012-09-04 03:38"),
prev = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
timedif = c(4144, 33855, 7841, 13004, 14453, 5768, 42, 17285, 4172, 10395),
seque = structure(c(3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L), .Label = c("00", "01", "10", "11"),
class = "factor")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -10L))

R: weighted aggregation

I have a dataset on this form:
set.seed(4561) # Make the results reproducible
df=data.frame(
colour=rep(c("green","red","blue"),each=3),
year=rep("2017",9),
month=rep(c(1,2,3),3),
price=c(200,254,188,450,434,490,100,99,97),
work=ceiling(runif(9,30,60)),
gain=ceiling(runif(9,1,10)),
work_weighed_price=NA,
gain_weighed_price=NA
)
For each colour, year, month I have a price (output variable) and two input variables called gain and work. In reality I have many more input variables, but this suffices to show what I desire to do with my dataframe.
> df
colour year month price work gain work_weighed_price gain_weighed_price
1 green 2017 1 200 33 9 NA NA
2 green 2017 2 254 56 5 NA NA
3 green 2017 3 188 42 8 NA NA
4 red 2017 1 450 39 3 NA NA
5 red 2017 2 434 45 2 NA NA
6 red 2017 3 490 36 8 NA NA
7 blue 2017 1 100 50 8 NA NA
8 blue 2017 2 99 45 8 NA NA
9 blue 2017 3 97 56 4 NA NA
I wish to calculate the weighted gain and work (and also the weighted price), where the weight is the price for that month and year, divided by the sum of price across colours:
desired_output=data.frame(
year=rep("2017",3),
month=rep(c(1,2,3),1),
price=c(200*(200/(200+450+100))+450*(450/(200+450+100))+100*(100/(200+450+100)),
254*(254/(254+434+99))+434*(434/(254+434+99))+99*(99/(254+434+99)),
188*(188/(188+490+97))+490*(490/(188+490+97))+97*(97/(188+490+97))),
work_weighed_price=c(47*(200/(200+450+100))+44*(450/(200+450+100))+52*(100/(200+450+100)),
44*(254/(254+434+99))+42*(434/(254+434+99))+32*(99/(254+434+99)),
38*(188/(188+490+97))+52*(490/(188+490+97))+52*(97/(188+490+97))) ,
gain_weighed_price=c(5*(200/(200+450+100))+8*(450/(200+450+100))+10*(100/(200+450+100)),
3*(254/(254+434+99))+7*(434/(254+434+99))+9*(99/(254+434+99)),
2*(188/(188+490+97))+4*(490/(188+490+97))+9*(97/(188+490+97)))
)
> desired_output
year month price work_weighed_price gain_weighed_price
1 2017 1 336.6667 45.86667 7.466667
2 2017 2 333.7649 41.38755 5.960610
3 2017 3 367.5523 48.60387 4.140645
How would I attack this in R?
You can use the weighted.mean function
df %>%
group_by(year, month) %>%
summarise_at(vars(price, work, gain),
funs(price_weighted = weighted.mean(., price)))
# # A tibble: 3 x 5
# # Groups: year [?]
# year month price_price_weighted work_price_weighted gain_price_weighted
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 337 45.9 7.47
# 2 2017 2 334 41.4 5.96
# 3 2017 3 368 48.6 4.14
Or, in data.table
library(data.table)
setDT(df)
df[, lapply(.SD, weighted.mean, price)
, .SDcols = c('price', 'work', 'gain')
, by = .(year, month)]
# year month price work gain
# 1: 2017 1 336.6667 45.86667 7.466667
# 2: 2017 2 333.7649 41.38755 5.960610
# 3: 2017 3 367.5523 48.60387 4.140645
An approach using dplyr. Your use of runif in your example df without setting seed and the fact that it doesn't line up with your desired output is causing some confusion. In the code below, I use a df that's consistent with your desired output.
library(dplyr)
df %>%
group_by(year, month) %>%
mutate(weight = price / sum(price)) %>%
mutate_at(vars(price, work, gain), funs(weighed_price = . * weight)) %>%
summarise_at(vars(ends_with("weighed_price")), sum)
# # A tibble: 3 x 5
# # Groups: year [?]
# year month work_weighed_price gain_weighed_price price_weighed_price
# <int> <int> <dbl> <dbl> <dbl>
# 1 2017 1 45.9 7.47 337.
# 2 2017 2 41.4 5.96 334.
# 3 2017 3 48.6 4.14 368.
df:
structure(list(colour = c("green", "green", "green", "red", "red",
"red", "blue", "blue", "blue"), year = c(2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L), month = c(1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), price = c(200L, 254L, 188L, 450L,
434L, 490L, 100L, 99L, 97L), work = c(47L, 44L, 38L, 44L, 42L,
52L, 52L, 32L, 52L), gain = c(5L, 3L, 2L, 8L, 7L, 4L, 10L, 9L,
9L), work_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA
), gain_weighed_price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("colour",
"year", "month", "price", "work", "gain", "work_weighed_price",
"gain_weighed_price"), class = "data.frame", row.names = c(NA,
-9L))
A base R solution could be the following sequence of tapply instructions.
fun_price <- function(x){
s <- sum(x)
sum(x*(x/s))
}
fun_weighted <- function(x, w){
s <- sum(w)
sum(x*(w/s))
}
desired <- data.frame(year = unique(df$year), month = sort(unique(df$month)))
desired$price <- with(df, tapply(price, month, FUN = fun_price))
desired$work_weighed_price <- with(df, tapply(work, month, FUN = fun_weighted, w = price))
desired$gain_weighed_price <- with(df, tapply(gain, month, FUN = fun_weighted, w = price))
desired
# year month price work_weighed_price gain_weighed_price
#1 2017 1 336.6667 40.74092 6.622405
#2 2017 2 333.7649 48.56834 4.984429
#3 2017 3 367.5523 44.65052 6.659170

How to create a datetime object from separate date fields?

I have a dataset like this:
Year MM DD HH
158 2010 7 1 5
159 2010 7 1 5
160 2010 7 1 6
161 2010 7 1 6
structure(list(Year = c(2010L, 2010L, 2010L, 2010L), MM = c(7L,
7L, 7L, 7L), DD = c(1L, 1L, 1L, 1L), HH = c(5L, 5L, 6L, 6L)), .Names = c("Year",
"MM", "DD", "HH"), row.names = 158:161, class = "data.frame")
How can I create a one datetime object from this data set (new column for this data)?
There are a few options, here's one (where x is your data.frame):
x$datetime <- ISOdatetime(x$Year, x$MM, x$DD, x$HH, 0, 0)
You can pass in the correct time zone if need be, see ?ISOdatetime.
You can now do this in lubridate using make_date or make_datetime:
From the cran doc:
make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L,
sec = 0, tz = "UTC")
make_date(year = 1970L, month = 1L, day = 1L)
Assuming you have a your data in a dataframe x:
transform(x,datetime = as.POSIXct(paste(paste(Year,MM,DD,sep="-"), paste(HH,"00",sep=":"))))
Year MM DD HH datetime
158 2010 7 1 5 2010-07-01 05:00:00
159 2010 7 1 5 2010-07-01 05:00:00
160 2010 7 1 6 2010-07-01 06:00:00
161 2010 7 1 6 2010-07-01 06:00:00

Resources