I have a dataset with multiple rows per ID like this
ID From To State
1 2004 2005 MD
1 2005 2005 MD
1 2005 2012 DC
1 2012 2015 DC
1 2015 2020 DC
1 2012 2013 MD
1 2013 2016 MD
1 2016 2019 MD
1 2019 2020 MD
2 2003 2004 OR
2 2004 2008 OR
2 2008 2013 AZ
2 2013 2015 AZ
My goal is to collapse the multiple From and To columns to create a smooth timeline like
ID From To State
1 2004 2005 MD
1 2005 2020 DC
1 2012 2020 MD
2 2003 2008 OR
2 2008 2015 AZ
Not sure how to accomplish this. An help is much appreciated. Thanks.
Group by 'ID', 'State' and the run-length-id of 'State', get the first of 'From' and last of 'To'
library(dplyr)
library(data.table)
df1 %>%
group_by(ID, State, grp = rleid(State)) %>%
summarise(From = first(From), To = last(To), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 5 × 4
ID State From To
<int> <chr> <int> <int>
1 1 DC 2005 2020
2 1 MD 2004 2005
3 1 MD 2012 2020
4 2 AZ 2008 2015
5 2 OR 2003 2008
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), From = c(2004L, 2005L, 2005L, 2012L, 2015L, 2012L,
2013L, 2016L, 2019L, 2003L, 2004L, 2008L, 2013L), To = c(2005L,
2005L, 2012L, 2015L, 2020L, 2013L, 2016L, 2019L, 2020L, 2004L,
2008L, 2013L, 2015L), State = c("MD", "MD", "DC", "DC", "DC",
"MD", "MD", "MD", "MD", "OR", "OR", "AZ", "AZ")),
class = "data.frame", row.names = c(NA,
-13L))
I have this data:
library(dplyr)
glimpse(samp)
Observations: 15
Variables: 6
$ date <date> 2013-01-04, 2013-01-31, 2013-01-09, 2013-01-20, 2013-01-29, 2013...
$ shop_id <int> 4, 1, 30, 41, 26, 16, 25, 10, 29, 52, 54, 42, 8, 59, 31
$ item_id <int> 1904, 17880, 14439, 15010, 10917, 10331, 2751, 1475, 16071, 13901...
$ item_cnt_day <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1
$ month <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
It´s just a sample of a large data set, so there are jumps between the date.
In the original data, the time series stars at 2013-01-01 and ends at 2015-11-30. The data are a time series. My goal is to calculate the lag for one month. The problem is that the length of a month is not consistent (i.e. some months have 30 other have 31 days). In order to calculate the lag, I have to set a number. However, as I mentioned before for a month it´s not possible to set a fixed number. Is there a way to calculate the lag month wise?
The target variable is item_cnt_day. The lag should be calculated for the rolling mean. In this example each month has 5 days so the result should like this:
library(RcppRoll)
library(dplyr)
samp %>%
mutate(r_mean_5 = lag(roll_meanr(item_cnt_day, 5), 1))
date shop_id item_id item_cnt_day month year r_mean_5
30717 2013-01-04 4 1904 1 1 2013 NA
43051 2013-01-31 1 17880 1 1 2013 NA
66273 2013-01-09 30 14439 1 1 2013 NA
105068 2013-01-20 41 15010 1 1 2013 NA
23332 2013-01-29 26 10917 1 1 2013 NA
28838 2013-02-22 16 10331 1 2 2013 1.0
40418 2013-02-08 25 2751 2 2 2013 1.0
62219 2013-02-12 10 1475 1 2 2013 1.2
98641 2013-02-16 29 16071 1 2 2013 1.2
21905 2013-02-23 52 13901 2 2 2013 1.2
32219 2013-03-31 54 2972 1 3 2013 1.4
45156 2013-03-17 42 11184 1 3 2013 1.4
69513 2013-03-24 8 19405 1 3 2013 1.2
110206 2013-03-10 59 2255 1 3 2013 1.2
24473 2013-03-07 31 15119 1 3 2013 1.2
Here is the dput().
structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Maybe this?
library(lubridate)
df$lag <- df$date %m-% months(1)
df$rollmean <- sapply(1:nrow(df), function(x) mean(df[df$date <= df$date[x] & df$date >= df$lag[x], "item_cnt_day" ]))
date shop_id item_id item_cnt_day month year lag rollmean
30717 2013-01-04 4 1904 1 1 2013 2012-12-04 1.000000
43051 2013-01-31 1 17880 1 1 2013 2012-12-31 1.000000
66273 2013-01-09 30 14439 1 1 2013 2012-12-09 1.000000
105068 2013-01-20 41 15010 1 1 2013 2012-12-20 1.000000
23332 2013-01-29 26 10917 1 1 2013 2012-12-29 1.000000
28838 2013-02-22 16 10331 1 2 2013 2013-01-22 1.166667
40418 2013-02-08 25 2751 2 2 2013 2013-01-08 1.200000
62219 2013-02-12 10 1475 1 2 2013 2013-01-12 1.200000
98641 2013-02-16 29 16071 1 2 2013 2013-01-16 1.166667
21905 2013-02-23 52 13901 2 2 2013 2013-01-23 1.285714
32219 2013-03-31 54 2972 1 3 2013 2013-02-28 1.000000
45156 2013-03-17 42 11184 1 3 2013 2013-02-17 1.200000
69513 2013-03-24 8 19405 1 3 2013 2013-02-24 1.000000
110206 2013-03-10 59 2255 1 3 2013 2013-02-10 1.166667
24473 2013-03-07 31 15119 1 3 2013 2013-02-07 1.333333
%m-% calculates for every date the date one month ago, while accounting for different length of the months (31 days, 30 days, 28 days) and puts it into the column lag. Then in sapply(), the mean of item_cnt_day is calculated for all observations whose date lies within the range of date and lag of the current iteration.
So it doesn't matter how many elements are there for each month or how the elements are ordered.
The date class supports seq for different time intervals (documentation).
So you can basically do:
calculate_lag <- function(date) {
return(seq(date, by = "1 month", length.out = 2)[2])
}
date_column <- as.Date(sapply( _YOUR_DATAFRAME_ , calculate_lag), origin="1970-01-01")
I am not really familiar calculating lag, but maybe that is what you want?
Data:
df <- structure(list(date = structure(c(15709, 15736, 15714, 15725,
15734, 15758, 15744, 15748, 15752, 15759, 15795, 15781, 15788,
15774, 15771), class = "Date"), shop_id = c(4L, 1L, 30L, 41L,
26L, 16L, 25L, 10L, 29L, 52L, 54L, 42L, 8L, 59L, 31L), item_id = c(1904L,
17880L, 14439L, 15010L, 10917L, 10331L, 2751L, 1475L, 16071L,
13901L, 2972L, 11184L, 19405L, 2255L, 15119L), item_cnt_day = c(1,
1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1), month = c(1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(2013L,
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L,
2013L, 2013L, 2013L, 2013L, 2013L)), row.names = c(30717L, 43051L,
66273L, 105068L, 23332L, 28838L, 40418L, 62219L, 98641L, 21905L,
32219L, 45156L, 69513L, 110206L, 24473L), class = "data.frame")
Calculation:
df %>%
dplyr::mutate(days_in_month = lubridate::days_in_month(date)) %>%
tidyr::nest(-c(month, days_in_month)) %>%
dplyr::mutate(lag = purrr::map2(data, days_in_month, ~ stats::lag(.x$item_cnt_day, .y)))
EDIT based on comment:
maybe this then?
df %>%
tidyr::nest(-month) %>%
dplyr::mutate(
ndays = purrr::map_int(data, nrow),
lag = purrr::map2_dbl(data, ndays, ~ zoo::rollmean(.x$item_cnt_day, .y))
)
I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))