This probably seems straightforward, but I am pretty stumped.
I have a set of dates ~ August 1 of each year and need to sum sales by week number. The earliest date is 2008-12-08 (YYYY-MM-DD). I need to create a "week_id" field where week #1 begins on 2008-12-08. And the date 2011-09-03 is week 142. Note that this is different since the calculation of week number does not reset every year.
I am putting up a small example dataset here:
data <- data.frame(
dates = c("2008-12-08", "2009-08-10", "2010-03-31", "2011-10-16", "2008-06-03", "2009-11-14" , "2010-05-05", "2011-09-03"))
data$date = as.Date(data$date)
Any help is appreciated
data$week_id = as.numeric(data$date - as.Date("2008-12-08")) %/% 7 + 1
This would take the day difference between the two dates and find the integer number of 7 days elapsed. I add one since we want the dates where zero weeks have elapsed since the start to be week 1 instead of week 0.
dates date week_id
1 2008-12-07 2008-12-07 0 # added for testing
2 2008-12-08 2008-12-08 1
3 2008-12-09 2008-12-09 1 # added for testing
4 2008-12-14 2008-12-14 1 # added for testing
5 2008-12-15 2008-12-15 2 # added for testing
6 2009-08-10 2009-08-10 36
7 2010-03-31 2010-03-31 69
8 2011-10-16 2011-10-16 149
9 2008-06-03 2008-06-03 -26
10 2009-11-14 2009-11-14 49
11 2010-05-05 2010-05-05 74
12 2011-09-03 2011-09-03 143
I have a dataframe df in R:
month abc1 def2 xyz3
201201 1 2 4
201202 2 5 7
201203 4 11 4
201204 6 23 40
I would like to convert each of the columns (of which there are ~50, each with ~100 monthly observations) to a time series format in order to check for seasonality in the data, using the decompose function.
I assumed a for loop using the ts function would be the best way of doing this. I would like to use something along the lines of the loop below, although I realise using a function on the left side of the <- produces an error. Is there a way to dynamically name variables generated by a loop?
for(i in 2:ncol(df)) {
paste(names(df[, i]), "_ts") <- ts(df[ ,i], start = c(2012, 1), end = c(2021,11), frequency = 12)
}
You could try zoo:
test = data.frame(month=c("201201", "201202", "201203", "201204"), abc1=c(1,2,3,4), def2=c(4,6,7,10), xyz3=c(12,15,16,19))
library(zoo)
ZOO =zoo(test[, c("abc1", "def2", "xyz3")], order.by=as.Date(paste0(test$month, "01"), format="%Y%m%d"))
ts(ZOO, frequency=12)
Output:
abc1 def2 xyz3
Jan 1 1 4 12
Feb 1 2 6 15
Mar 1 3 7 16
Apr 1 4 10 19
attr(,"index")
[1] 2012-01-01 2012-02-01 2012-03-01 2012-04-01
Update:
Now with correct frequency.
So I have some data with a time stamp, and for each row, I want to count the number of rows that fall within a certain time window. For example, if I have the data below with a time stamp in h:mm (column ts), I want to count the number of rows that occur from that time stamp to five minutes in the past (column count). The first n rows that are less than five minutes from the first data point should be NAs.
ts data count
1:01 123 NA
1:02 123 NA
1:03 123 NA
1:04 123 NA
1:06 123 5
1:07 123 5
1:10 123 3
1:11 123 4
1:12 123 4
This is straightforward to do with a for loop, but I've been trying to implement with the apply() family and have not yet found any success. Any suggestions?
EDIT: modified to account for the potential for multiple readings per minute, raised in comment.
Data with new mid-minute reading:
library(dplyr)
df %>%
# Take the text above and convert to datetime
mutate(ts = lubridate::ymd_hms(paste(Sys.Date(), ts))) %>%
# Count how many observations per minute
group_by(ts_min = lubridate::floor_date(ts, "1 minute")) %>%
summarize(obs_per_min = sum(!is.na(data))) %>%
# Add rows for any missing minutes, count as zero observations
padr::pad(interval = "1 min") %>%
replace_na(list(obs_per_min = 0)) %>%
# Count cumulative observations, and calc how many in window that
# begins 5 minutes ago and ends at end of current minute
mutate(cuml_count = cumsum(obs_per_min),
prior_cuml = lag(cuml_count) %>% tidyr::replace_na(0),
in_window = cuml_count - lag(prior_cuml, 5)) %>%
# Exclude unneeded columns and rows
select(-cuml_count, -prior_cuml) %>%
filter(obs_per_min > 0)
Output (now reflects add'l reading at 1:06:30)
# A tibble: 12 x 3
ts_min obs_per_min in_window
<dttm> <dbl> <dbl>
1 2018-09-26 01:01:00 1 NA
2 2018-09-26 01:02:00 1 NA
3 2018-09-26 01:03:00 1 NA
4 2018-09-26 01:04:00 1 NA
5 2018-09-26 01:06:00 2 6
6 2018-09-26 01:07:00 1 6
7 2018-09-26 01:10:00 1 4
8 2018-09-26 01:11:00 1 5
9 2018-09-26 01:12:00 1 4
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.
ID start date end date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
....
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000
The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.
I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly.
In other words, I would like to turn this series into something like
ID start date end date amount
1 1-1-2012 2-1-2012 3096 = 6000 * 16/31
1 2-1-2012 3-1-2012 4339 = 6000*15/31+4000*14/39
1 3-1-2012 4-1-2012 etc
....
1 6-1-2012 7-1-2012 etc
2 1-1-2012 2-1-2012 etc
2 2-1-2012 3-1-2012 etc
2 3-1-2012 4-1-2012 etc
2 4-1-2012 5-1-2012 etc
2 5-1-2012 6-1-2012 etc
....
10000 1-1-2012 2-1-2012 etc
....
10000 6-1-2012 7-1-2012 etc
Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.
I'm rather new to r and could certainly use some help on this!
Here is using native R:
#The data
df=read.table(text='ID start_date end_date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000',
header=T,row.names = NULL,stringsAsFactors =FALSE)
df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")
df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot
#using by compute new amount
df1$amount=do.call(c,
by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
)
df1
> df1
ID startm endm amount
1 1 2012-01-01 2012-02-01 3096.774
2 1 2012-02-01 2012-03-01 4339.123
3 1 2012-03-01 2012-05-01 4306.038
4 1 2012-05-01 2012-06-01 1535.842
5 2 2012-01-01 2012-02-01 2500.000
6 2 2012-02-01 2012-03-01 4700.000
7 2 2012-03-01 2012-05-01 3754.545
8 10000 2012-01-01 2012-02-01 5302.326
9 10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000
To solve this I think the easiest way is to break it down into two problems.
How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
How do I group by a date range and summarise to what I'm interested in?
For the following example, I will use the data set which I created using the code below:
df <- data.frame(
id=c(1,1,1,1,2,2,2),
start_date=as.Date(c("1-15-2012",
"2-15-2012",
"3-25-2012",
"5-26-2012",
"1-16-2012",
"2-27-2012",
"3-18-2012"), "%m-%d-%Y"),
end_date=as.Date(c("2-15-2012",
"3-25-2012",
"5-26-2012",
"6-13-2012",
"2-27-2012",
"3-18-2012",
"5-23-2012"), "%m-%d-%Y"),
amount=c(6000,
4000,
3000,
1000,
7000,
2000,
3000)
)
1. Provide daily figures
To provide the daily figures, firstly we get the daily contribution:
df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)
Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:
library(dplyr)
df <- df %>%
rowwise() %>%
do(data.frame(id=.$id,
date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")),
daily_contribution=.$daily_contribution))
which has some output which looks like this:
Source: local data frame [285 x 3]
Groups: <by row>
id date daily_contribution
1 1 2012-01-15 193.5484
2 1 2012-01-16 193.5484
3 1 2012-01-17 193.5484
4 1 2012-01-18 193.5484
5 1 2012-01-19 193.5484
6 1 2012-01-20 193.5484
7 1 2012-01-21 193.5484
8 1 2012-01-22 193.5484
9 1 2012-01-23 193.5484
10 1 2012-01-24 193.5484
.. .. ... ...
2. Create a grouping variable
Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:
library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)
Now with all of this we can easily use dplyr to summarise our information by the dates as required.
df %>%
group_by(id, mnth, yr) %>%
summarise(amount=sum(daily_contribution))
with output:
Source: local data frame [11 x 4]
Groups: id, mnth
id mnth yr amount
1 1 1 2012 3290.3226
2 1 2 2012 4441.6873
3 1 3 2012 2902.8122
4 1 4 2012 1451.6129
5 1 5 2012 1591.3978
6 1 6 2012 722.2222
7 2 1 2012 2666.6667
8 2 2 2012 4800.0000
9 2 3 2012 2436.3636
10 2 4 2012 1363.6364
11 2 5 2012 1045.4545
To get it precisely in the format you specified:
df %>% rowwise() %>%
mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
select(id, start_date, end_date, amount)
with output:
Source: local data frame [11 x 4]
Groups: <by row>
id start_date end_date amount
1 1 2012-01-01 2012-02-01 3290.3226
2 1 2012-02-01 2012-03-01 4441.6873
3 1 2012-03-01 2012-04-01 2902.8122
4 1 2012-04-01 2012-05-01 1451.6129
5 1 2012-05-01 2012-06-01 1591.3978
6 1 2012-06-01 2012-07-01 722.2222
7 2 2012-01-01 2012-02-01 2666.6667
8 2 2012-02-01 2012-03-01 4800.0000
9 2 2012-03-01 2012-04-01 2436.3636
10 2 2012-04-01 2012-05-01 1363.6364
11 2 2012-05-01 2012-06-01 1045.4545
as needed.
note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.
Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).
df$index <- 1:nrow(df) #Create a unique index number
#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")
library(plyr); library(reshape) #Load the libraries
#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
# ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) {
ID <- X$ID #Keep the ID value
n.days <- as.numeric(difftime( X$end.date, X$start.date )) #Calculate time difference in days, report the result as a number
day <- seq(X$start.date, X$end.date, by="days") #Sequence of days
amount.per.day <- X$amount/n.days #Amount for that day
data.frame(ID, day, amount.per.day) #Last line is the output
})
#Change list back into data.frame
df3 <- ldply(df2, data.frame) #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <- as.numeric(format(df3$day, "%m")) #Assign a month to all dates
#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))
# ID mon amount
# 1 1 1 3290.3226
# 2 1 2 4441.6873
# 3 1 3 2902.8122
# 4 1 4 1451.6129
# 5 1 5 1591.3978
# 6 1 6 722.2222
# 7 2 1 2666.6667
# 8 2 2 4800.0000
# 9 2 3 2436.3636
# 10 2 4 1363.6364
# 11 2 5 1045.4545
Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata).
HTH!