I have a dataframe (customer) that looks like this:
email order_no date
a#stack.com 0012 2014-02-13
a#stack.com 0013 2014-03-13
a#stack.com 0014 2014-06-13
b#stack.com 0015 2014-05-13
b#stack.com 0016 2014-05-20
b#stack.com 0017 2014-07-20
I want to create a new field that appends the interval between orders for each customer. The first step would be to order by date ascending:
customer <- arrange(customer, date)
The next step would be to iterate through each customer and calculate the order interval so the result set looks like this:
email order_no date days_interval
a#stack.com 0012 2014-02-13 0
a#stack.com 0013 2014-03-13 30
a#stack.com 0014 2014-06-13 90
b#stack.com 0015 2014-05-13 0
b#stack.com 0016 2014-05-20 7
b#stack.com 0017 2014-07-20 60
Can this be achieved without using a for loop?
What's the most efficient way of doing this.
With a FOR Loop, this is what you do:
for (i in 2:nrow(customer)){
if(customer$email[i]==customer$email[i-1]){
customer$interval[i] <- as.integer(difftime(customer$date[i],customer$date[i-1]))
}
}
Is this feasible without using a for loop?
diff should work for you. It takes a vector of length n and returns a vector of length n-1 with the differences between the vector's items. Below is an example.
> data <- data.frame(name=c("jeff","steve","jim"),date=today()+seq(-3:-5))
> data
name date
1 jeff 2015-04-28
2 steve 2015-04-29
3 jim 2015-04-30
> diff(data$date)
Time differences in days
[1] 1 1
You just need to combine this with your current work. Such as with
customer$days_interval <- c(0, diff(customer$date))
Here's what I'd do, using dplyr and lubridate:
library(dplyr)
library(lubridate)
df %>%
group_by(email) %>%
mutate(date = ymd(date)) %>%
arrange(date) %>%
mutate(days_interval = difftime(date, lag(date), unit="days"))
Here's what I get:
email order_no date days_interval
1 a#stack.com 12 2014-02-13 NA days
2 a#stack.com 13 2014-03-13 28 days
3 a#stack.com 14 2014-06-13 92 days
4 b#stack.com 15 2014-05-13 NA days
5 b#stack.com 16 2014-05-20 7 days
6 b#stack.com 17 2014-07-20 61 days
Related
I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))
Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.
I know using the lubridate package, I can generate the respective weekday for each date of entry. I am now dealing with a large dataset having a lot of date entries and I wish to extract weekdays for each date entries. I think it is quite impossible to search for each date and to find weekdays. I will love to have a function that will allow me to insert my date column from my data frame and will produce days corresponding to each dates of the frame.
my frame is like
uinq_id Product_ID Date_of_order count
1 Aarkios04_2014-09-09 Aarkios04 2014-09-09 10
2 ABEE01_2014-08-18 ABEE01 2014-08-18 1
3 ABEE01_2014-08-19 ABEE01 2014-08-19 0
4 ABEE01_2014-08-20 ABEE01 2014-08-20 0
5 ABEE01_2014-08-21 ABEE01 2014-08-21 0
6 ABEE01_2014-08-22 ABEE01 2014-08-22 0
i am trying to generate
uinq_id Product_ID Date_of_order count weekday
1 Aarkios04_2014-09-09 Aarkios04 2014-09-09 10 Tues
2 ABEE01_2014-08-18 ABEE01 2014-08-18 1 Mon
3 ABEE01_2014-08-19 ABEE01 2014-08-19 0 Tues
4 ABEE01_2014-08-20 ABEE01 2014-08-20 0 Wed
5 ABEE01_2014-08-21 ABEE01 2014-08-21 0 Thurs
6 ABEE01_2014-08-22 ABEE01 2014-08-22 0 Fri
any help will be highly beneficial.
thank you.
Using weekdays from base R you can do this for a vector all at once:
temp = data.frame(timestamp = Sys.Date() + 1:20)
> head(temp)
timestamp
1 2016-06-01
2 2016-06-02
3 2016-06-03
4 2016-06-04
5 2016-06-05
6 2016-06-06
temp$weekday = weekdays(temp$timestamp)
> head(temp)
timestamp weekday
1 2016-06-01 Wednesday
2 2016-06-02 Thursday
3 2016-06-03 Friday
4 2016-06-04 Saturday
5 2016-06-05 Sunday
6 2016-06-06 Monday
We can use format to get the output
df1$weekday <- format(as.Date(df1$Date_of_order), "%a")
df1$weekday
#[1] "Tue" "Mon" "Tue" "Wed" "Thu" "Fri"
According to ?strptime
%a - Abbreviated weekday name in the current locale on this platform.
(Also matches full name on input: in some locales there are no
abbreviations of names.)
library(lubridate)
date <- as.Date(yourdata$Date_of_order, format = "%Y/%m/%d")
yourdata$WeekDay <- weekdays(date)
Here, i have a data set with Start date and End Date and the usages. I have calculated the number of Days between these two days and got the daily usages. (I am okay with one flat usages for each day for now).
Now, what i want to achieve is the sum of the usage for each day in those TIME-FRAME FOR month of June. For example, the first case will be just the Daily_usage
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.156250
And, for 2nd, i want to the add the Usage 3905 to June 1st, and also to June 2nd because it spans in both June 1st and June 2nd.
2015-05-04 2015-06-02 117159.00 30 3905.3000000
I want to continue doing this for all 387 rows and at the end get the sum of Usages for each day. And,I do not know how to do this for hundreds of records.
This is what my datasets looks right now:
str(YYY)
'data.frame': 387 obs. of 5 variables:
$ START_DATE : Date, format: "2015-05-01" "2015-05-04" "2015-05-11" "2015- 05-13" ...
$ END_DATE : Date, format: "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" ...
$ x : num 261605 1380796 183 103 489 ...
$ DAYS : num 32 29 22 20 19 12 1 34 30 29 ...
$ DAILY_USAGE: num 8175.16 47613.66 8.32 5.13 25.74 ...
Also, the header.
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.1562500
2 2015-05-04 2015-06-01 1380796.00 29 47613.6551724
6 2015-05-21 2015-06-01 1392.00 12 116.0000000
7 2015-06-01 2015-06-01 2503.00 1 2503.0000000
8 2015-04-30 2015-06-02 0.00 34 0.0000000
9 2015-05-04 2015-06-02 117159.00 30 3905.3000000
10 2015-05-05 2015-06-02 193334.00 29 6666.6896552
13 2015-05-04 2015-06-03 630.00 31 20.3225806
and so on........
Example of data sets and Results
I will call this data set. EXAMPLE1 (For 3 days, mocked up data)
START_DATE END_DATE x DAYS DAILY_USAGE
5/1/2015 6/1/2015 261605 32 8175.15625
5/4/2015 6/1/2015 1380796 29 47613.65517
5/11/2015 6/1/2015 183 22 8.318181818
4/30/2015 6/2/2015 0 34 0
5/20/2015 6/2/2015 70 14 5
6/1/2015 6/2/2015 569 2 284.5
6/1/2015 6/3/2015 582 3 194
6/2/2015 6/3/2015 6 2 3
For the above examples, answer should be like this
DAY USAGE
6/1/2015 56280.6296
6/2/2015 486.5
6/3/2015 197
HOW?
In Example 1, for June 1st, i have added all the rows of usages except the last row usage because the last row doesn't include the the date 06/01 in time-frame. It starts in 06/02 and ends in 06/03.
To get June 2nd, i have added all the usages from Row 4 to 8 because June 2nd is between all of those start and end dates.
For June 3rd, i have only added, Last two rows to get 197.
So, where to sum, depends on the time-frame of Start & End_date.
Hope this helps!
There might be a easy trick to do this than to write 400 lines of If else statement.
Thank you again for your time!!
-Gyve
library(lubridate)
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
cbind.data.frame(DAY=unique(df$END_DATE),
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x]))))
# DAY USAGE
# 1 6/1/2015 56280.63
# 2 6/2/2015 486.50
# 3 6/3/2015 197.00
Explanation
We can expand it to explain what is happening:
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
The unique end dates are tested to be within the range days in the first and second columns. mdy is a quick way to convert to POSIXct with lubridate. The operator %within% tests a date against an interval. We created intervals with interval('col1', 'col2'). This creates an index that we can subset the data by.
In our final data frame,
cbind.data.frame(DAY=unique(df$END_DATE),
creates the first column of dates.
And,
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x])))
takes the sum of df$DAILY_USAGE by the index that we created.
This is what my data looks like. I would like to convert date and time columns to a time stamp and put it in a single column.
Any help appreciated. Thanks
DATE TIME CLOSE HIGH LOW OPEN VOLUME
1 20150216 1520 2283.85 2284 2275.6 2275.6 48309
2 20150216 1530 2282 2284 2273.15 2283.85 108856
3 20150218 920 2276.1 2280.1 2260.6 2280.1 94279
4 20150218 930 2271.6 2277.95 2271 2276.1 65932
5 20150218 940 2270.35 2275 2268.2 2271.6 53595
6 20150218 950 2270.65 2271.2 2265.55 2270.5 34546
7 20150218 1000 2274.15 2274.25 2268.65 2270.6 35414
8 20150218 1010 2270.1 2274.9 2267.1 2274.25 37334
You can try
df$DateTime <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
df1 <- df[-(1:2)]
head(df1,2)
# CLOSE HIGH LOW OPEN VOLUME DateTime
#1 2283.85 2284 2275.60 2275.60 48309 2015-02-16 15:20:00
#2 2282.00 2284 2273.15 2283.85 108856 2015-02-16 15:30:00
Update
If you need to convert to xts, instead of creating a new column, we can remove the columns that are not needed (df[-(1:2)]) and specify order.by as the datetime vector ('indx')
library(xts)
indx <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
xt1 <- xts(df[-(1:2)], order.by=indx)