calculation of week number in R - r

I have found many answer regarding the week number of a particular date. What I want is to get a week number for 2 years i.e for first year it will give 1 to 53 weeks and then keep the count from 53 only and should not start with 1 again. Is it possible in R?. Example data is shown below:

We can use rep to add 53 to the vector ('vN2') after finding the number of observations for each year.
vN2 + rep(c(0, 53), tapply(vN2, cumsum(c(TRUE, diff(vN2) < 0)), FUN = length))
data
set.seed(24)
vN <- rep(1:53, sample(1:5, 53, replace=TRUE))
vN1 <- rep(1:53, sample(1:6, 53, replace=TRUE))
vN2 <- c(vN, vN1)

Related

#R - Split Quarterly data into monthly data using R

Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop

Subtracting samples in R

In this code, I have selected a random sample of 30 and from this random sample of 30 taken a sample of 15.
I am stuck on how to subtract the sample of 15 that I took from the sample of 30. I.E subtract s1985 from b1985
Can someone help me please?
My code is below
function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, YEAR == "1985"), 30)
b1985 <-sample.df(subset(df, YEAR == "1985"), 30)
s1985 <-sample.df(subset(b1985), 15)
sample.df(subset(df, YEAR == "1986"), 30)
b1986 <-sample.df(subset(df, YEAR == "1986"), 30)
s1986 <-sample.df(subset(b1986), 15)
Hi all I have figured out how to do the subtraction.
Rename the rows so they are not the random rows drawn
row.names(b1985) <- 1:nrow(b1985)
row.names(s1985) <- 1:nrow(b1985)
row.names(b1986) <- 1:nrow(b1986)
row.names(s1986) <- 1:nrow(s1986)
Subtract by column number (in this case I want to add column 5)
b1985[1:30,5]-s1985[1:15,5]
b1986[1:30,5]-s1986[1:15,5]

R Programming 30 day Months

I'm currently writing a script in the R Programming Language and I've hit a snag.
I have time series data organized in a way where there are 30 days in each month for 12 months in 1 year. However, I need the data organized in a proper 365 days in a year calendar, as in 30 days in a month, 31 days in a month, etc.
Is there a simple way for R to recognize there are 30 days in a month and to operate within that parameter? At the moment I have my script converting the number of days from the source in UNIX time and it counts up.
For example:
startingdate <- "20060101"
endingdate <- "20121230"
date <- seq(from = as.Date(startingdate, "%Y%m%d"), to = as.Date(endingdate, "%Y%m%d"), by = "days")
This would generate an array of dates with each month having 29 days/30 days/31 days etc. However, my data is currently organized as 30 days per month, regardless of 29 days or 31 days present.
Thanks.
The first 4 solutions are basically variations of the same theme using expand.grid. (3) uses magrittr and the others use no packages. The last two work by creating long sequence of numbers and then picking out the ones that have month and day in range.
1) apply This gives a series of yyyymmdd numbers such that there are 30 days in each month. Note that the line defining yrs in this case is the same as yrs <- 2006:2012 so if the years are handy we could shorten that line. Omit as.numeric in the line defining s if you want character string output instead. Also, s and d are the same because we have whole years so we could omit the line defining d and use s as the answer in this case and also in general if we are always dealing with whole years.
startingdate <- "20060101"
endingdate <- "20121230"
yrs <- seq(as.numeric(substr(startingdate, 1, 4)), as.numeric(substr(endingdate, 1, 4)))
g <- expand.grid(yrs, sprintf("%02d", 1:12), sprintf("%02d", 1:30))
s <- sort(as.numeric(apply(g, 1, paste, collapse = "")))
d <- s[ s >= startingdate & s <= endingdate ] # optional if whole years
Run some checks.
head(d)
## [1] 20060101 20060102 20060103 20060104 20060105 20060106
tail(d)
## 20121225 20121226 20121227 20121228 20121229 20121230
length(d) == length(2006:2012) * 12 * 30
## [1] TRUE
2) no apply An alternative variation would be this. In this and the following solutions we are using yrs as calculated in (1) so we omit it to avoid redundancy. Also, in this and the following solutions, the corresponding line to the one setting d is omitted, again, to avoid redundancy -- if you don't have whole years then add the line defining d in (1) replacing s in that line with s2.
g2 <- expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30))
s2 <- with(g2, sort(as.numeric(paste0(yr, mon, day))))
3) magrittr This could also be written using magrittr like this:
library(magrittr)
expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30)) %>%
with(paste0(yr, mon, day)) %>%
as.numeric %>%
sort -> s3
4) do.call Another variation.
g4 <- expand.grid(yrs, 1:12, 1:30)
s4 <- sort(as.numeric(do.call("sprintf", c("%d%02d%02d", g4))))
5) subset sequence Create a sequence of numbers from the starting date to the ending date and if each number is of the form yyyymmdd pick out those for which mm and dd are in range.
seq5 <- seq(as.numeric(startingdate), as.numeric(endingdate))
d5 <- seq5[ seq5 %/% 100 %% 100 %in% 1:12 & seq5 %% 100 %in% 1:30]
6) grep Using seq5 from (5)
d6 <- as.numeric(grep("(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|30)$", seq5, value = TRUE))
Here's an alternative:
date <- unclass(startingdate):unclass(endingdate) %% 30L
month <- rep(1:12, each = 30, length.out = NN <- length(date))
year <- rep(1:(NN %/% 360 + 1), each = 360, length.out = NN)
(of course, we can easily adjust by adding constants to taste if you want a specific day to be 0, or a specific month, etc.)

R: cut by date and grouping by ID with data.table

I have a data.table with with a list of actors uniquely identified by id doing things on a date. There is no limit to number of things done by an actor on a particular date.
require(data.table)
set.seed(28100)
df.in <- data.table(id = sample(1:10, 100, replace=TRUE),
date = sample(2001:2012, 100, replace=TRUE))
Now I want to summarise my dataset finding the number of occurrences for each of the intervals of the following sequence
sequence <- seq(2000, 2012, 4)
df.out1 <- as.data.frame(table(cut(df.in$date, breaks = sequence)))
df.out1
# Var1 Freq
# 1 (2000,2004] 35
# 2 (2004,2008] 27
# 3 (2008,2012] 38
All good. But now instead of counting the occurrences I would like to count the number of actors active in each interval, that is with one or more occurrences.
Do you mean something like this?
df.in[, interv := cut(date, sequence)][, .(Actors = length(unique(id))), by = interv]
# interv Actors
#1: (2000,2004] 10
#2: (2008,2012] 9
#3: (2004,2008] 10
In case you are using the development version 1.9.5 from GitHub you could use uniqueN() instead of length(unique()).

Calculate number of days between two dates in r

I need to calculate the number of days elapsed between multiple dates in two ways and then output those results to new columns: i) number of days that has elapsed as compared to the first date (e.g., RESULTS$FIRST) and ii) between sequential dates (e.g., RESULTS$BETWEEN). Here is an example with the desired results. Thanks in advance.
library(lubridate)
DATA = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")))
RESULTS = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")),
FIRST = c(0, 24, 53, 107, 161, 891), BETWEEN = c(0, 24, 29, 54, 54, 730))
#Using dplyr package
library(dplyr)
df1 %>% # your dataframe
mutate(BETWEEN0=as.numeric(difftime(DATE,lag(DATE,1))),BETWEEN=ifelse(is.na(BETWEEN0),0,BETWEEN0),FIRST=cumsum(as.numeric(BETWEEN)))%>%
select(-BETWEEN0)
DATE BETWEEN FIRST
1 2013-07-08 0 0
2 2013-08-01 24 24
3 2013-08-30 29 53
4 2013-10-23 54 107
5 2013-12-16 54 161
6 2015-12-16 730 891
This will get you what you want:
d <- as.Date(DATA$DATE, format="%m/%d/%Y")
first <- c()
for (i in seq_along(d))
first[i] <- d[i] - d[1]
between <- c(0, diff(d))
This uses the as.Date() function in the base package to cast the vector of string dates to date values using the given format. Since you have dates as month/day/year, you specify format="%m/%d/%Y" to make sure it's interpreted correctly.
diff() is the lagged difference. Since it's lagged, it doesn't include the difference between element 1 and itself, so you can concatenate a 0.
Differences between Date objects are given in days by default.
Then constructing the output dataframe is simple:
RESULTS <- data.frame(DATE=DATA$DATE, FIRST=first, BETWEEN=between)
For the first part:
DATA = data.frame((c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013","12/16/2013", "12/16/2015")))
names(DATA)[1] = "V1"
date = as.Date(DATA$V1, format="%m/%d/%Y")
print(date-date[1])
Result:
[1] 0 24 53 107 161 891
For second part - simply use a for loop
You can just add each column with the simple difftime and lagged diff calculations.
DATA$FIRST <- c(0,
with(DATA,
difftime(DATE[2:length(DATE)],DATE[1], unit="days")
)
)
DATA$BETWEEN <- c(0,
with(DATA,
diff(DATE[1:(length(DATE) - 1)], unit="days")
)
)
identical(DATA, RESULTS)
[1] TRUE

Resources