Adding a seasons column to data table based on month dates - r

I'm using data.table and I am trying to make a new column, called "season", which creates a column with the corresponding season, e.g summer, winter... based on a column called "MonthName".
I'm wondering whether there is a more efficient way to add a season column to a data table based on month values.
This is the first 6 of 300,000 observations, assume that the table is called "dt".
rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb
I have tried the following code:
dt[, Season := ifelse(MonthName = c("Jun", "Jul", "Aug"),"Winter", ifelse(MonthName = c("Dec", "Jan", "Feb"), "Summer", ifelse(MonthName = c("Sep", "Oct", "Nov"), "Spring" , ifelse(MonthName = c("Mar", "Apr", "May"), "Autumn", NA))))]
Which returns:
rrp totaldemand Year Month Finyear hourminute AvgPriceByTOD MonthName Season
1: 35.27500 1999 1 1999 00:00 33.09037 Jan NA
2: 21.01167 1999 1 1999 00:00 33.09037 Jan Summer
3: 25.28667 1999 2 1999 00:00 33.09037 Feb Summer
4: 18.42334 1999 2 1999 00:00 33.09037 Feb NA
5: 16.67499 1999 2 1999 00:00 33.09037 Feb NA
6: 18.90001 1999 2 1999 00:00 33.09037 Feb Summer
I get the error:
Warning messages:
1: In MonthName == c("Jun", "Jul", "Aug") :
longer object length is not a multiple of shorter object length
2: In MonthName == c("Dec", "Jan", "Feb") :
longer object length is not a multiple of shorter object length
3: In MonthName == c("Sep", "Oct", "Nov") :
longer object length is not a multiple of shorter object length
4: In MonthName == c("Mar", "Apr", "May") :
longer object length is not a multiple of shorter object length
ALongside this, for reasons that I don't know, some of the summer months are correctly assigned "summer", but others are assigned NA, e.g rows 1 and 2 should both be summer, but return differently.
Thanks in advance!

One pretty straightforward way is to use a lookup table to map month names to seasons:
# create a named vector where names are the month names and elements are seasons
seasons <- rep(c("winter","spring","summer","fall"), each = 3)
names(seasons) <- month.abb[c(6:12,1:5)] # thanks thelatemail for pointing out month.abb
seasons
# Jun Jul Aug Sep Oct Nov Dec Jan
#"winter" "winter" "winter" "spring" "spring" "spring" "summer" "summer"
# Feb Mar Apr May
#"summer" "fall" "fall" "fall"
Use it:
dt[, season := seasons[MonthName]]
data:
dt <- setDT(read.table(text=" rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb",
header = TRUE, stringsAsFactors = FALSE))

A bit of typing, but the code is efficient
dt[MonthName %in% c("Jun","Jul","Aug"), Season := "Winter"]
dt[MonthName %in% c("Dec","Jan","Feb"), Season := "Summer"]
dt[MonthName %in% c("Sep","Oct","Nov"), Season := "Spring"]
dt[is.na(MonthName), Season := "Autumn"]
Here we are assigning by-reference on a subset of the data.table
I prefer this to a lot of nested ifelses
If you want to check if a value is in a vector, you have to use %in%. See the different behaviour of:
myVec <- c("a","b","c")
"a" == myVec
[1] TRUE FALSE FALSE
"a" %in% myVec
[1] TRUE

Related

How to calculate the duration of employment spells

I have data on peoples employment status monthly for 1 year. The dataframe includes 4 variables: ID of a Person, Country, Month and Main Activity in that specific month (Employed, Unemployed, Inactive, Other). I put an example of it here:
ID <- c(1:10, 1:10)
country <- c("AT", "BE", "CH", "CZ", "HR", "SO", "SV", "RU", "GR", "GE", "AT", "BE", "CH", "CZ", "HR",
"SO", "SV", "RU", "GR", "GE")
month <- c("Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec", "Nov", "Sep", "Jan", "Jun", "Jul", "Oct",
"Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec")
act <- c("Unemployed", "Employed", "Other", "Other", "Inactive", "Unemployed", "Employed",
"Employed", "Employed", "Unemployed", "Other", "Unemployed", "Unemployed", "Unemployed",
"Other", "Other", "Employed", "Other", "Other", "NA")
df <- data.frame(ID, country, month, act)
df[order(ID),]
ID country month act
1 1 AT Jan Unemployed
11 1 AT Jun Other
21 1 AT Nov Unemployed
2 2 BE Feb Employed
12 2 BE Jul Unemployed
22 2 BE Sep Unemployed
3 3 CH Mar Other
13 3 CH Oct Unemployed
23 3 CH Jan NA
4 4 CZ Apr Other
14 4 CZ Jan Unemployed
24 4 CZ Jun Unemployed
5 5 HR May Inactive
15 5 HR Feb Other
25 5 HR Jul Other
6 6 SO Aug Unemployed
16 6 SO Mar Other
26 6 SO Oct Employed
7 7 SV Dec Employed
17 7 SV Apr Employed
27 7 SV Nov Employed
8 8 RU Nov Employed
18 8 RU May Other
28 8 RU Jan NA
9 9 GR Sep Employed
19 9 GR Aug Other
29 9 GR Jun Inactive
10 10 GE Jan Unemployed
20 10 GE Dec NA
30 10 GE Aug Unemployed
My goal is to create a new dataframe, where every row represents one spell of employment but with the condition that prior to AND after the spell of employment had to be a spell of unemployment. So that I am able to include only spells of employment where people switched from unemployment to employment and back to unemployment and calculate the duration of these spells. Ideally in the end there would be 4 variables: PersID, Country, duration of spell, starting month, end month. It should look like this:
ID country spell_duration starting ending
1 1 AT 5 Jan May
11 1 AT 5 Jun Oct
2 2 BE 7 Feb Aug
12 2 BE 6 Jul Dec
3 3 CH 10 Mar Dec
13 3 CH 1 Oct Oct
4 4 CZ 8 Apr Nov
14 4 CZ 5 Jan May
5 5 HR 5 May Sep
15 5 HR 4 Feb May
6 6 SO 2 Aug Sep
16 6 SO 6 Mar Aug
7 7 SV 1 Dec Dec
17 7 SV 9 Apr Dec
8 8 RU 8 Nov Dec
18 8 RU 7 May Nov
9 9 GR 3 Sep Nov
19 9 GR 2 Aug Sep
10 10 GE 8 Jan Aug
20 10 GE 1 Dec Dec
I already found this solution by Maria (How to calculate number and duration of categorical spells by ID in R) but her problem is different. I don't want the overall duration of employment and I also don't really need the number of spells
I used data.table package for manupulation and loops are working i guess.
EDIT: One extra "}" left and I edited it. I've tried it and it works.
EDIT2: I added "setDT(df)" too.
library(data.table)
df <- fread(paste("ID country month act
1 AT Jan Unemployed
1 AT Jun Other
1 AT Nov Unemployed
2 BE Feb Employed
2 BE Jul Unemployed
2 BE Sep Unemployed
3 CH Mar Other
3 CH Oct Unemployed
3 CH Jan NA
4 CZ Apr Other
4 CZ Jan Unemployed
4 CZ Jun Unemployed
5 HR May Inactive
5 HR Feb Other
5 HR Jul Other
6 SO Aug Unemployed
6 SO Mar Other
6 SO Oct Employed
7 SV Dec Employed
7 SV Apr Employed
7 SV Nov Employed
8 RU Nov Employed
8 RU May Other
8 RU Jan NA
9 GR Sep Employed
9 GR Aug Other
9 GR Jun Inactive
10 GE Jan Unemployed
10 GE Dec NA
10 GE Aug Unemployed", collapse = '\n'))
setDT(df)
df[, monthInt := match(month, month.abb)]
df <- df[order(ID,monthInt)]
finalDt <- data.table()
for (i in unique(df[, ID])) {
tempT <- df[ID == i]
for (tim in 1:(nrow(tempT)-1)) {
timT <- data.table(ID = tempT[tim,ID],
country = tempT[tim, country],
spell_duration = tempT[tim+1, monthInt] - tempT[tim, monthInt],
starting = month.abb[tempT[tim, monthInt]],
ending = month.abb[tempT[tim+1, monthInt]-1])
finalDt <- rbind(finalDt,timT)
}
}
Without much thought, the first thing that came to my mind. Very cumbersome, though. I'm sure there are more elegant solutions to this, but this doesn't require any additional packages.
data <- df
Empl_spells <- data.frame(ID = c(), Start = c(), End = c())
for(user in unique(data$ID)){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
Empl_spells <- rbind(Empl_spells, append)
}
}
}
Since I don't have your data, I didn't test this. Is this what you want?
Edit (vectorize; probably makes it faster):
data <- df
users <- unique(data$ID)
calculate <- function(user){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
return(append)
}
}
}
empl_spells <- lapply(users, FUN = calculate)
Empl_spells <- do.call(rbind.data.frame, empl_spells)
Edit #2 (calculate duration):
MonthToNumeric <- function(x){
which(c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec") == x)
}
calcDuration <- function(Start, End){
return(MonthToNumeric(End) - MonthToNumeric(Start) + 1)
}
Empl_spells$Duration <- mapply(FUN = calcDuration, Start = Empl_spells[, 2], End = Empl_spells[, 3])

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

R - split data to hydrological quarters

I wish to split my data sets into year quarters according to definition of hydrological year. According to Wikipedia, "Due to meteorological and geographical factors, the definition of the water years varies". In USA, hydrological year is a period between October 1st of one year and September 30th of the next.
I use definition of hydrological year for Poland (starts at November 1st and ends at October 31st).
Sample data set looks as folllows:
sampleData <- structure(list(date = structure(c(15946, 15947, 15875, 15910, 15869, 15888, 15823, 16059, 16068, 16067), class = "Date"),`example value` = c(-0.325806595888448, 0.116001346459147, 1.68884381116696, -0.480527505762716, -0.50307381813168,-1.12032214801472, -0.659699514672226, -0.547101497279717, 0.729148872679021,-0.769760735764215)), .Names = c("date", "example value"), row.names = c(NA, -10L), class = "data.frame")
For some reason, function "cut" in my code complains that "breaks" and "labels" differs in length (but they don't). If I omit "labels" options in cut (as below) function works perfectly.
What is wrong with labels?
ToHydroQuarters <-function(df)
{
result <- df
yearStart <- as.numeric(format(min(df$date),'%Y'))-1
#Hydrological year in Poland starts at November 1st
DateStart <- as.Date(paste(yearStart,"-11-01",sep=""))
breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter")
breakYear <- format(breaks,'%Y')
#Please, do not create labels in such way.
#Please note that for November and December we have next hydrological year - since it started at 1st November. So, we need to check month to decide which year we have (?) or use cut function again as mentioned here: http://stackoverflow.com/questions/22073881/hydrological-year-time-series
labels <- c(paste("Winter",breakYear[1]),
paste("Spring",breakYear[2]),
paste("Summer",breakYear[3]),
paste("Autumn",breakYear[4]),
paste("Autumn",breakYear[5]))
######Here is problem - once I add labels parameter, function complains about different lengths
result$hydroYear <- cut(df$date, breaks)
result
}
Firstly I think it is unwise to have labels as a "hardcoded" variable in a function since it is impossible to check without some kind of reproducible example, however I can see what you're trying to achieve.
You claim that your break and labels should be the correct length, however the function itself doesn't always work (this is without the labels, even if the labels did exist the cut function did not process the last portion of the dates).
For example:
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
a <- ToHydroQuarters(df)
tail(a)
returns:
date hydroYear
971 2011-08-29 <NA>
972 2011-08-30 <NA>
973 2011-08-31 <NA>
974 2011-09-01 <NA>
975 2011-09-02 <NA>
976 2011-09-03 <NA>
Doing something like breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter"), does resolve that issue, as it forces a break to actually exist. This might solve your labelling issue that you've had in your function, but it does not make the function "generic".
Personally on the coding side I think it would be better to convert the month, and year parts separately, because it would be easier to understand. For example, you could use library(lubridate) to easily extract the month and specify the breaks and the labels as you normally would. I was thinking the function could look something like this:
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
So then using some dummy data ...
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="month")))
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
df$newdate <- thq(df$date)
Which has the following output:
date newdate
1 2009-01-01 Spring 2009
2 2009-02-01 Spring 2009
3 2009-03-01 Spring 2009
4 2009-04-01 Summer 2009
5 2009-05-01 Summer 2009
6 2009-06-01 Summer 2009
7 2009-07-01 Autumn 2009
8 2009-08-01 Autumn 2009
9 2009-09-01 Autumn 2009
10 2009-10-01 Winter 2010
11 2009-11-01 Winter 2010
12 2009-12-01 Winter 2010
13 2010-01-01 Spring 2010
14 2010-02-01 Spring 2010
15 2010-03-01 Spring 2010
16 2010-04-01 Summer 2010
17 2010-05-01 Summer 2010
18 2010-06-01 Summer 2010
19 2010-07-01 Autumn 2010
20 2010-08-01 Autumn 2010
21 2010-09-01 Autumn 2010
22 2010-10-01 Winter 2011
23 2010-11-01 Winter 2011
24 2010-12-01 Winter 2011
25 2011-01-01 Spring 2011
26 2011-02-01 Spring 2011
27 2011-03-01 Spring 2011
28 2011-04-01 Summer 2011
29 2011-05-01 Summer 2011
30 2011-06-01 Summer 2011
31 2011-07-01 Autumn 2011
32 2011-08-01 Autumn 2011
33 2011-09-01 Autumn 2011
You can shift the months using the modulo operator if it is in a weird order...
thq <- function(date) {
mnth <- cut(((month(df$date)+1) %% 12), breaks=c(0, 3, 6, 9, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Nov_Jan", "Feb_Apr", "May_Jul", "Aug_Oct")
)
# you will need to alter the return statement yourself, because
# I feel there is enough information for you to do it, rather than
# me changing it every time you change the question.
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
df$new <- thq(df$date)
head(df)
output:
> head(df)
date new
1 2009-01-01 Nov_Jan 2009
2 2009-01-02 Nov_Jan 2009
3 2009-01-03 Nov_Jan 2009
4 2009-01-04 Nov_Jan 2009
5 2009-01-05 Nov_Jan 2009
6 2009-01-06 Nov_Jan 2009

Fill in missing year in ordered list of dates

I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010

How can I avoid having to loop through and search through this data frame?

I have a 1 million row data frame that contains monthly water usage data (HCF) for various accounts from 2003-2010:
> head(LeakyAccts)
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Sep 2007 24
3 10114488 Nov 2006 11
4 10114488 Jun 2008 18
5 10114488 Aug 2003 6
6 10114488 Jan 2008 30
Dates are yearmon's. I want to know how much each account used every month compared to the same month in the previous year. So for each row, I'd like to find the difference between the usage in that month (Date) and the usage in the same month the previous year (Date - 1). In other words, I want this:
for(i in 1:nrow(LeakyAccts)) {
row <- which((LeakyAccts$ACCOUNT == LeakyAccts[i,]$UB_ACCT_NBR) & (LeakyAccts$Date == (LeakyAccts[i,]$Date - 1)))
if (length(row) == 1) { # no previous year for 2003
LeakyAccts[i,]$Difference <- LeakyAccts[i,]$HCF - LeakyAccts[row,]$HCF
}
}
Needless to say, this loop takes hours to run and seems very un-R-like. How can I avoid using an ugly for loop and speed up the computation? Is there perhaps a way to do this using an apply function or a data.table?
I've reconfigured your data a little to give a complete example:
library(zoo)
dat <- structure(list(ACCOUNT = c(10114488L, 10114488L, 10114488L, 20114488L, 20114488L, 20114488L), ate = structure(c(2010.75, 2009.75, 2008.75, 2008, 2007, 2006), class = "yearmon"), HCF = c(25L, 24L, 11L, 18L, 6L, 30L)), .Names = c("ACCOUNT", "Date", "HCF"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
Which looks like:
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Oct 2009 24
3 10114488 Oct 2008 11
4 20114488 Jan 2008 18
5 20114488 Jan 2007 6
6 20114488 Jan 2006 30
Since yearmon is essentially just a numeric value where a difference of 1 is a year's difference, you can get the matching differences from a year ago like:
dat$HCF - dat$HCF[match(dat$Date-1,dat$Date)]
#[1] 1 13 NA 12 -24 NA
...which you can also apply within each group like:
do.call(c,by(dat,dat$ACCOUNT,function(x) x$HCF - x$HCF[match(x$Date-1,x$Date)]))
#101144881 101144882 101144883 201144881 201144882 201144883
# 1 13 NA 12 -24 NA
Or using data.table like:
library(data.table)
dat <- as.data.table(dat)
dat[, Difference := HCF - HCF[match(Date-1,Date)], by=ACCOUNT]
dat
# ACCOUNT Date HCF Difference
#1: 10114488 Oct 2010 25 1
#2: 10114488 Oct 2009 24 13
#3: 10114488 Oct 2008 11 NA
#4: 20114488 Jan 2008 18 12
#5: 20114488 Jan 2007 6 -24
#6: 20114488 Jan 2006 30 NA

Resources