I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.
I have some data that looks like
CustomerID InvoiceDate
<fctr> <dttm>
1 13313 2011-01-04 10:00:00
2 18097 2011-01-04 10:22:00
3 16656 2011-01-04 10:23:00
4 16875 2011-01-04 10:37:00
5 13094 2011-01-04 10:37:00
6 17315 2011-01-04 10:38:00
7 16255 2011-01-04 11:30:00
8 14606 2011-01-04 11:34:00
9 13319 2011-01-04 11:40:00
10 16282 2011-01-04 11:42:00
It tells me when a person make a transaction. I would like to know the time between transactions for each customer, preferably in days. I do this in the following way
d <- data %>%
arrange(CustomerID,InvoiceDate) %>%
group_by(CustomerID) %>%
mutate(delta.t = InvoiceDate - lag(InvoiceDate), #calculating the difference
delta.day = as.numeric(delta.t, unit = 'days')) %>%
na.omit() %>%
arrange(CustomerID) %>%
inner_join(Ntrans) %>% #Existing data.frame telling me the number of transactions per customer
filter(N>=10) %>% #only want people with more than 10 transactions
select(-N)
However, the result doesn't make sense (seen below)
CustomerID InvoiceDate delta.t delta.day
<fctr> <dttm> <time> <dbl>
1 12415 2011-01-10 09:58:00 5686 days 5686
2 12415 2011-02-15 09:52:00 51834 days 51834
3 12415 2011-03-03 10:59:00 23107 days 23107
4 12415 2011-04-01 14:28:00 41969 days 41969
5 12415 2011-05-17 15:42:00 66314 days 66314
6 12415 2011-05-20 14:13:00 4231 days 4231
7 12415 2011-06-15 13:37:00 37404 days 37404
8 12415 2011-07-13 15:30:00 40433 days 40433
9 12415 2011-07-13 15:31:00 1 days 1
10 12415 2011-07-19 10:51:00 8360 days 8360
The differences measured in days are way off. What I want is something close to SQL's rolling window function partitioned over customerID. How can I implement this?
If you just want to change the difference to days you can use the package lubridate.
> library('lubridate')
> library('dplyr')
>
> InvoiceDate <- c('2011-01-10 09:58:00', '2011-02-15 09:52:00', '2011-03-03 10:59:00')
> CustomerID <- c(111, 111, 111)
>
> dat <- data.frame('Invo' = InvoiceDate, 'ID' = CustomerID)
>
> dat %>% mutate('Delta' = as_date(Invo) - as_date(lag(Invo)))
Invo ID Delta
1 2011-01-10 09:58:00 111 NA days
2 2011-02-15 09:52:00 111 36 days
3 2011-03-03 10:59:00 111 16 days
I have a dataframe (customer) that looks like this:
email order_no date
a#stack.com 0012 2014-02-13
a#stack.com 0013 2014-03-13
a#stack.com 0014 2014-06-13
b#stack.com 0015 2014-05-13
b#stack.com 0016 2014-05-20
b#stack.com 0017 2014-07-20
I want to create a new field that appends the interval between orders for each customer. The first step would be to order by date ascending:
customer <- arrange(customer, date)
The next step would be to iterate through each customer and calculate the order interval so the result set looks like this:
email order_no date days_interval
a#stack.com 0012 2014-02-13 0
a#stack.com 0013 2014-03-13 30
a#stack.com 0014 2014-06-13 90
b#stack.com 0015 2014-05-13 0
b#stack.com 0016 2014-05-20 7
b#stack.com 0017 2014-07-20 60
Can this be achieved without using a for loop?
What's the most efficient way of doing this.
With a FOR Loop, this is what you do:
for (i in 2:nrow(customer)){
if(customer$email[i]==customer$email[i-1]){
customer$interval[i] <- as.integer(difftime(customer$date[i],customer$date[i-1]))
}
}
Is this feasible without using a for loop?
diff should work for you. It takes a vector of length n and returns a vector of length n-1 with the differences between the vector's items. Below is an example.
> data <- data.frame(name=c("jeff","steve","jim"),date=today()+seq(-3:-5))
> data
name date
1 jeff 2015-04-28
2 steve 2015-04-29
3 jim 2015-04-30
> diff(data$date)
Time differences in days
[1] 1 1
You just need to combine this with your current work. Such as with
customer$days_interval <- c(0, diff(customer$date))
Here's what I'd do, using dplyr and lubridate:
library(dplyr)
library(lubridate)
df %>%
group_by(email) %>%
mutate(date = ymd(date)) %>%
arrange(date) %>%
mutate(days_interval = difftime(date, lag(date), unit="days"))
Here's what I get:
email order_no date days_interval
1 a#stack.com 12 2014-02-13 NA days
2 a#stack.com 13 2014-03-13 28 days
3 a#stack.com 14 2014-06-13 92 days
4 b#stack.com 15 2014-05-13 NA days
5 b#stack.com 16 2014-05-20 7 days
6 b#stack.com 17 2014-07-20 61 days
I am downloading data from Bloomberg and then update it in R. The dataframe looks as follows.
ticker date PX_LAST PE_RATIO
1 SMI Index 2014-09-30 8835.14 20.3692
2 SMI Index 2014-10-31 8837.78 20.3753
3 DAX Index 2014-09-30 9474.30 16.6487
4 DAX Index 2014-10-31 9326.87 16.3896
5 SMI Index 2014-11-28 9150.46 21.0962
6 SMI Index 2014-12-31 8983.37 20.6990
7 DAX Index 2014-11-28 9980.85 17.5388
8 DAX Index 2014-12-31 9805.55 16.8639
Now I would like to have this dataframe sorted according to ticker (not in alphabetical order, but in the order of the ticker) and then the date so that the end result would be:
ticker date PX_LAST PE_RATIO
1 SMI Index 2014-09-30 8835.14 20.3692
2 SMI Index 2014-10-31 8837.78 20.3753
3 SMI Index 2014-11-28 9150.46 21.0962
4 SMI Index 2014-12-31 8983.37 20.6990
5 DAX Index 2014-09-30 9474.30 16.6487
6 DAX Index 2014-10-31 9326.87 16.3896
7 DAX Index 2014-11-28 9980.85 17.5388
8 DAX Index 2014-12-31 9805.55 16.8639
There's a function chgroup() in the data.table package that does exactly what you're looking for. It groups values from the vector together while preserving the initial order. It is only available for character vectors (ch for character).
require(data.table)
DF[chgroup(DF$ticker), ]
# ticker date PX_LAST PE_RATIO
# 1 SMIIndex 2014-09-30 8835.14 20.3692
# 2 SMIIndex 2014-10-31 8837.78 20.3753
# 5 SMIIndex 2014-11-28 9150.46 21.0962
# 6 SMIIndex 2014-12-31 8983.37 20.6990
# 3 DAXIndex 2014-09-30 9474.30 16.6487
# 4 DAXIndex 2014-10-31 9326.87 16.3896
# 7 DAXIndex 2014-11-28 9980.85 17.5388
# 8 DAXIndex 2014-12-31 9805.55 16.8639
If your ticker column is factor, then convert it first to character type.
You could try order
df$ticker <- factor(df$ticker, levels=unique(df$ticker))
df1 <- df[with(df, order(ticker, date)),]
row.names(df1) <- NULL
df1
# ticker date PX_LAST PE_RATIO
#1 SMI Index 2014-09-30 8835.14 20.3692
#2 SMI Index 2014-10-31 8837.78 20.3753
#3 SMI Index 2014-11-28 9150.46 21.0962
#4 SMI Index 2014-12-31 8983.37 20.6990
#5 DAX Index 2014-09-30 9474.30 16.6487
#6 DAX Index 2014-10-31 9326.87 16.3896
#7 DAX Index 2014-11-28 9980.85 17.5388
#8 DAX Index 2014-12-31 9805.55 16.8639
I would like to extract certain rows from a data frame containing one colum as Date (column C). Here is a small example:
The output should look like this:
Before <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0032","0033","0039","0039","0039","0041","0054"),
B=c(11,12,11,11,12,12,12,11,"NA","NA",11,11,11),
C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2013-08-21","2014-07-11","2012-10-06","2012-10-06","2013-10-22","2014-05-28","2014-03-26"))
After <- data.frame(A=c("0010","0011","0012","0015","0024","0032","0033","0039","0041","0054"),
B=c(11,12,11,11,12,12,11,11,11,11),
C=c("2014-01-07","2013-06-03","2013-07-29","2014-07-14","2012-12-17","2013-08-21","2014-07-11","2013-10-22","2014-05-28","2014-03-26"))
So would I'm aiming for is:
Only give out entries with the latest date (out of row 9,10,11 (BEFORE)) --> give out only row 8 (AFTER)
Give out identical entries only once (row 6 and 7 (BEFORE)) --> give out only row 6 (AFTER)
I wasn't able to find a solution using subset, unique etc. Any help appreciated!
Here are two data.table variations depending on the assumptions on data:
Assuming that your data already has the latest date for each group of A as the last element:
require(data.table)
setDT(Before)[, .SD[.N], by=A]
.SD holds the Subset of Data for each group in A and .N holds the number of observations in that group. So, .SD[.N] gives us the last observation, for each group.
Without any assumptions:
require(data.table)
setDT(Before)[, C := as.Date(C)][, .SD[which.max(C)], by=A]
Here, first we replace C with as.Date(C) using data.table's := operator which modifies by reference (without making any copy, hence fast+memory efficient). Then, for each A data subset, we subset the row correspondng to the maximum value of C.
HTH
require(dplyr)
Before %>%
mutate(C=as.Date(C)) %>%
group_by(A) %>%
arrange(A,desc(C)) %>%
filter(row_number()==1)
#Source: local data frame [10 x 3]
#Groups: A
# A B C
#1 0010 11 2014-01-07
#2 0011 12 2013-06-03
#3 0012 11 2013-07-29
#4 0015 11 2014-07-14
#5 0024 12 2012-12-17
#6 0032 12 2013-08-21
#7 0033 11 2014-07-11
#8 0039 11 2013-10-22
#9 0041 11 2014-05-28
#10 0054 11 2014-03-26
split-apply-combine:
Before$C <- as.Date(Before$C)
library(plyr)
ddply(Before, .(A), function(df) {
df <- df[df$C==max(df$C),]
df[!duplicated(df),]
})
# A B C
#1 0010 11 2014-01-07
#2 0011 12 2013-06-03
#3 0012 11 2013-07-29
#4 0015 11 2014-07-14
#5 0024 12 2012-12-17
#6 0032 12 2013-08-21
#7 0033 11 2014-07-11
#8 0039 11 2013-10-22
#9 0041 11 2014-05-28
#10 0054 11 2014-03-26
By using the fact that dates act like numerics something like the following might do the trick:
Before$C <- as.Date(Before$C) # Convert to dates
ans <- aggregate(C ~ A + B, max, data = Before) # Aggregate date, choose the last date
ans <- ans[ans$B != "NA", ] # Remove NA in col B
print(ans)
# A B C
#1 0010 11 2014-01-07
#2 0012 11 2013-07-29
#3 0015 11 2014-07-14
#4 0033 11 2014-07-11
#5 0039 11 2013-10-22
#6 0041 11 2014-05-28
#7 0054 11 2014-03-26
#8 0011 12 2013-06-03
#9 0024 12 2012-12-17
#10 0032 12 2013-08-21
The max of type Date will return the most recent one.