Forecasting for multiple products - r

I want to forecast values for multiple products (in this case product_id 1 and 2 but I actually have a few thousand products) at the same time.
product_id Date Revenue Value
1 1 1/10/12 in 0
2 1 1/13/12 in 1
3 1 2/14/16 in 0
4 1 3/5/16 out 0
5 1 1/5/17 out 0
6 1 3/15/17 out 0
7 2 11/1/11 in 1
8 2 3/14/15 in 2
9 2 1/15/16 in 3
10 2 3/15/17 out 0
11 2 4/11/17 out 0
12 2 5/16/17 out 0
If this were only one product, I would fill in the missing dates with:
allDates <- seq.Date(
min(Dat$Date),
max(Dat$Date),
"day")
allValues <- merge(
x=data.frame(Date=allDates),
y=Value,
all.x=TRUE)
Make the data time series:
time <- ts(dat$Value, start= c(2011,11), frequency=52)
Forecast using hybrid model:
hm1 <- hybridModel(y = time, weights = "insample.errors")
plot(forecast(hm1))
Is there a way that I can do this for both product ids? Or is there a cleaner method without filling in the blank dates?
dat <-structure(list(product_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L),
Date = c("1/10/12", "1/13/12", "2/14/16", "3/5/16", "1/5/17", "3/15/17", "1/1/11", "3/14/15", "1/15/16", "3/15/17", "4/11/17", "5/16/17"),
Revenue = c("in", "in", "in", "out", "out", "out", "in", "in", "in", "out", "out", "out"),
Value = c(0L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 0L, 0L, 0L)
),
.Names = c("product_id", "Date", "Revenue", "Value"),
class = "data.frame", row.names = c(NA, -12L))

Related

How can create my own factor column in a dataframe?

I have dataframe and task:"Define your own criterion of income level, and split data according to levels of this criterion"
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no",
"yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L
), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25),
income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991,
0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384
), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033,
91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no",
"yes"), class = "factor"), selfemp = structure(c(1L, 1L,
1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,
58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L
), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13,
0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
I defined this criterion in this way
inc_l<-c("low","average","above average","high")
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 && x<10, 'above average',
ifelse(x>=3 && x<=6,'average',
ifelse(x<3, 'low'))))
}
And added a column like this
creditcard<-transform(creditcard, incom_levev=factor(sapply(creditcard$income, grad_fact), inc_l, ordered = TRUE))
But I need not to use saaply for this and I tried to do it in this way
creditcard<-transform(creditcard, incom_level=factor(grad_fact(creditcard$income),inc_l, ordered = TRUE))
But in this case, all the elements of the column take the value "average" and I don't understand why, please help me figure out the problem
We may need to change the && to & as && will return a single TRUE/FALSE. According to ?"&&"
& and && indicate logical AND and | and || indicate logical OR. The shorter forms performs elementwise comparisons in much the same way as arithmetic operators. The longer forms evaluates left to right, proceeding only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.
In addition, the last ifelse didn't had a no case
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 & x<10, 'above average',
ifelse(x>=3 & x<=6,'average',
ifelse(x<3, 'low', NA_character_))))
}
and then use
creditcard <- transform(creditcard, incom_level=
factor(grad_fact(income),inc_l, ordered = TRUE))
-output
creditcard
card reports age income share expenditure owner selfemp dependents days majorcards active income_fam incom_level
1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3 54 1 12 1.130000 average
2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3 34 1 13 0.605000 low
3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4 58 1 5 0.900000 average
4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0 25 1 7 2.540000 low
5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2 64 1 5 3.262233 above average
6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0 54 1 1 2.500000 low

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

R - replace values by row given some statement in if loop with another value in same df

I have a dataset with which I want to conduct a multilevel analysis. Therefore I have two rows for every patient, and a couple column with 1's and 2's (1 = patient, 2 = partner of patient).
Now, I have variables with date of birth and age, for both patient and partner in different columns that are now on the same row.
What I want to do is to write a code that does:
if mydata$couple == 2, then replace mydata$dateofbirthpatient with mydata$dateofbirthpatient
And that for every row. Since I have multiple variables that I want to replace, it would be lovely if I could get this in a loop and just 'add' variables that I want to replace.
What I tried so far:
mydf_longer <- if (mydf_long$couple == 2) {
mydf_long$pgebdat <- mydf_long$prgebdat
}
Ofcourse this wasn't working - but simply stated this is what I want.
And I started with this code, following the example in By row, replace values equal to value in specified column
, but don't know how to finish:
mydf_longer[6:7][mydf_longer[,1:4]==mydf_longer[2,2]] <-
Any ideas? Let me know if you need more information.
Example of data:
# id couple groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age
# 1 3 1 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 1.1 3 2 1 1 1 1955-12-01 42.50000 1 <NA> NA
# 2 5 1 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 2.1 5 2 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.5
# 3 7 1 1 1 1 1958-04-10 40.25000 1 <NA> NA
# 3.1 7 2 1 1 1 1958-04-10 40.25000 1 <NA> NA
mydf_long <- structure(
list(id = c(3L, 3L, 5L, 5L, 7L, 7L),
couple = c(1L, 2L, 1L, 2L, 1L, 2L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 1L, 1L),
pgebdat = structure(c(-5145, -5145, -9764, -9764, -4284, -4284), class = "Date"),
p_age = c(42.5, 42.5, 55.16667, 55.16667, 40.25, 40.25),
pgesl = c(1L, 1L, 1L, 1L, 1L, 1L),
prgebdat = structure(c(NA, NA, -2815, -2815, NA, NA), class = "Date"),
pr_age = c(NA, NA, 36.5, 36.5, NA, NA)),
.Names = c("id", "couple", "groep_MNC", "zkhs", "fbeh", "pgebdat",
"p_age", "pgesl", "prgebdat", "pr_age"),
row.names = c("1", "1.1", "2", "2.1", "3", "3.1"),
class = "data.frame"
)
The following for loop should work if you only want to change the values based on a condition:
for(i in 1:nrow(mydata)){
if(mydata$couple[i] == 2){
mydata$pgebdat[i] <- mydata$prgebdat[i]
}
}
OR
As suggested by #lmo, following will work faster.
mydata$pgebdat[mydata$couple == 2] <- mydata$prgebdat[mydata$couple == 2]

Calculate difference between two values in grouped sequences

This is a follow-up question for this post:
Loop through dataframe in R and measure time difference between two values
I already got excellent help with the following code to calculate the time difference in minutes between a certain Stimuli and the next Response:
df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>%
arrange(User,Date)%>%
mutate(difftime= difftime(lead(Date),Date, units = "mins") ) %>%
group_by(User)%>%
filter((StimuliA==1 | StimuliB==1) & lead(Responses)==1)`
Dataset:
structure(list(User = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L), Date = c("24.11.2015 20:39",
"25.11.2015 11:42", "11.01.2016 22:46", "26.11.2015 22:42", "04.03.2016 05:45",
"24.11.2015 13:13", "25.11.2015 13:59", "27.11.2015 12:18", "28.05.2016 06:49",
"06.07.2016 09:46", "03.12.2015 09:32", "07.12.2015 08:18", "08.12.2015 19:40",
"08.12.2015 19:40", "22.12.2015 08:50", "22.12.2015 08:52", "22.12.2015 08:52",
"22.12.2015 20:46"), StimuliA = c(1L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L), Responses = c(0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, 1L, 1L, 1L, 1L)), .Names = c("User", "Date", "StimuliA",
"StimuliB", "Responses"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L), spec = structure(list(cols = structure(list(
User = structure(list(), class = c("collector_integer", "collector"
)), Date = structure(list(), class = c("collector_character",
"collector")), StimuliA = structure(list(), class = c("collector_integer",
"collector")), StimuliB = structure(list(), class = c("collector_integer",
"collector")), Responses = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("User", "Date", "StimuliA", "StimuliB",
"Responses")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Goal/Question The lead arugment helped to determine the time difference between a Stimuli == 1 (A or B) and the next response [sorted by date/time] (Response == 1). How would i change that code to find the time difference between the Stimuli A or B and the LAST Response in this sequence. (until the next Stimuli occurs)
Desired output:
User Date StimuliA StimuliB Responses time diff Seq_ID
1 24.11.2015 20:39 1 0 0 1_1_0
1 25.11.2015 11:42 0 0 1 1_1_0
1 11.01.2016 22:46 0 0 1 69247 1_1_0
2 26.11.2015 22:42 1 0 0 2_1_0
2 04.03.2016 05:45 0 1 0 2_1_1
3 24.11.2015 13:13 1 0 0 3_1_0
3 25.11.2015 13:59 0 0 1 1486 3_1_0
3 27.11.2015 12:18 1 0 0 3_2_0
3 28.05.2016 06:49 0 0 1 3_2_0
3 06.07.2016 09:46 0 0 1 319528 3_2_0
4 03.12.2015 09:32 1 0 0 4_1_0
4 07.12.2015 08:18 1 0 0 4_2_0
4 08.12.2015 19:40 0 0 1 2122 4_1_0
4 08.12.2015 19:40 0 1 0 4_2_1
4 22.12.2015 08:50 0 0 1 19510 4_2_1
5 22.12.2015 08:52 0 0 1 5_0_0
5 22.12.2015 08:52 0 0 1 5_0_0
5 22.12.2015 20:46 0 0 1 5_0_0
For Stimuli A this meant the values c(69247, 319528, 2122) and B c(1486, 19510).
Try this.
# df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>%
arrange(User, Date) %>%
group_by(User) %>%
mutate(
last.date = Date[which(StimuliA == 1L)[c(1,1:sum(StimuliA == 1L))][cumsum(StimuliA == 1L)+ 1]]
) %>%
mutate(
timesince = ifelse(Responses == 1L, Date - last.date, NA)
)
This works by first creating a column that records the data of last stimuli, and then using ifelse and lag to get the difference between the current date and the last stimuli date. You can filter to extract only the LAST response.
There is a cleaner way to do the "last.date" operation with zoo.na.locf, but I didn't want to assume you were ok with another package dependency.
EDIT To identify the sequence (if I correctly understand what you mean by "sequence"), continue the chain with
%>% mutate(sequence = cumsum(StimuliA))
to identify sequences defined as observations following a positive Stimuli. To filter out the last response of a sequence, continue the chain with
%>% group_by(User, sequence) %>%
filter(timesince == max(timesince, na.rm = TRUE))
to group by sequence (and user) and then extract the maximum time difference associated with each sequence (which will correspond to the last positive response of a sequence).

Compare dates across multiple rows and replace values if condition is met in R

I have a set of dates and times for several individuals (ID) that correspond to our primary outcome measure (Y) and a covariate (X1).
My objective is to replace missing X1 values for each of the Y rows if the X1 measurement was recorded within a +/- 24 hour period from the date/time that the Y variable was measured. To make this easier to visualize (and load into R), here is how the data are currently arranged:
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), TIME = structure(1:15, .Label = c("01/01/2013 12:01",
"01/03/2013 08:49", "01/03/2013 20:52", "02/01/2013 05:00", "02/03/2013 05:30",
"02/03/2013 21:14", "02/05/2013 05:15", "02/12/2013 05:03", "02/15/2013 04:16",
"02/16/2013 04:12", "02/16/2013 21:02", "03/01/2010 17:58", "03/02/2010 00:10",
"03/03/2010 10:45", "03/04/2010 09:00"), class = "factor"), Y = structure(c(1L,
5L, 7L, 1L, 1L, 2L, 1L, 1L, 1L, 4L, 3L, 1L, 8L, 1L, 6L), .Label = c(".",
"22", "35", "4", "5", "6", "8", "9"), class = "factor"), X1 = structure(c(2L,
1L, 1L, 7L, 7L, 1L, 4L, 4L, 3L, 1L, 1L, 6L, 1L, 5L, 1L), .Label = c(".",
"0.1", "0.2", "0.4", "0.6", "0.9", "1.0"), class = "factor")), .Names = c("ID",
"TIME", "Y", "X1"), class = "data.frame", row.names = c(NA, -15L))
To simplify the desired output, I would like to only display the rows with non-missing Y values, such that the end product would look like this:
ID TIME Y X1
1 1 01/03/2013 08:49 5 .
2 1 01/03/2013 20:52 8 .
3 2 02/03/2013 21:14 22 .
4 2 02/16/2013 04:12 4 0.2
5 2 02/16/2013 21:02 35 .
6 3 03/02/2010 00:10 9 0.9
7 3 03/04/2010 09:00 6 0.6
Is it possible to (1) iterate across multiple rows and evaluate the absolute value of 24 hours to get the difference between the X1 and Y measurements and (2) to replace the missing values of X1 with those that are within the +/- 24 hour window?
Any thoughts on how to go about this would be greatly appreciated!
if you convert your data into xts then you can use xts's easy subsetting feature to get what you want.
PS: following code will work if you have exactly 1 value of X1 within 24 hour period of Y measurement.
require(xts)
xx <- xts(DF[, c(1, 4, 5)], as.POSIXct(paste0(DF$Date, " ", DF$TIME), format = "%m/%d/%Y %H:%M"))
sapply(index(xx[!is.na(xx$Y)]), FUN = function(tt) {
startTime <- tt - 24 * 60 * 60
endTime <- tt + 24 * 60 * 60
y <- xx[paste(startTime, endTime, sep = "/")]
if (nrow(y[!is.na(y$X1), "X1"]) != 0) {
return(as.vector(y[!is.na(y$X1), "X1"]))
} else {
return(NA)
}
})
## [1] 0.9 0.6 NA NA 1.0 0.2 NA
xx[!is.na(xx$Y), "X1"] <- sapply(index(xx[!is.na(xx$Y)]), FUN = function(tt) {
startTime <- tt - 24 * 60 * 60
endTime <- tt + 24 * 60 * 60
y <- xx[paste(startTime, endTime, sep = "/")]
if (nrow(y[!is.na(y$X1), "X1"]) != 0) {
return(as.vector(y[!is.na(y$X1), "X1"]))
} else {
return(NA)
}
})
xx[!is.na(xx$Y), "X1"]
## X1
## 2010-03-02 00:10:00 0.9
## 2010-03-04 09:00:00 0.6
## 2013-01-03 08:49:00 NA
## 2013-01-03 20:52:00 NA
## 2013-02-03 21:14:00 1.0
## 2013-02-16 04:12:00 0.2
## 2013-02-16 21:02:00 NA

Resources