I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010
Related
I have a list of dfs and a list of annual budgets.
Each df represents one business year, and each budget represents a total spend for that year.
# the business year starts from Feb and ends in Jan.
# the budget column is first populated with the % of annual budget allocation
df <- data.frame(monthly_budget=c(0.06, 0.13, 0.07, 0.06, 0.1, 0.06, 0.06, 0.09, 0.06, 0.06, 0.1, 0.15),
month=month.abb[c(2:12, 1)])
# dfs for 3 years
df2019_20 <- df
df2020_21 <- df
df2021_22 <- df
# budgets for 3 years
budget2019_20 <- 6000000
budget2020_21 <- 7000000
budget2021_22 <- 8000000
# into lists
df_list <- list(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
I've written the following function to both apply the right year to Jan and fill in the rest by deparsing the respective dfs name.
It works perfectly if I supply a single df and a single budget.
budget_func <- function(df, budget){
df_name <- deparse(substitute(df))
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
To speed things up I want to pass both lists as arguments to mapply. However I don't get the results I want - what am I doing wrong?
final_budgets <- mapply(budget_func, df_list, budget_list)
Instead of using deparse/substitute (which works when we are passing a single dataset, and is different in the loop because the object passed is not the object name), we may add a new argument to pass the names. In addition, when we create the list, it should have the names as well. We can either use list(df2019_20 = df2019_20, ...) or use setNames or an easier option is dplyr::lst which does return with the name of the object passed
budget_func <- function(df, budget, nm1){
df_name <- nm1
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
-testing
df_list <- dplyr::lst(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
Map(budget_func, df_list, budget_list, names(df_list))
-output
$df2019_20
monthly_budget month year
1 360000 Feb 2019
2 780000 Mar 2019
3 420000 Apr 2019
4 360000 May 2019
5 600000 Jun 2019
6 360000 Jul 2019
7 360000 Aug 2019
8 540000 Sep 2019
9 360000 Oct 2019
10 360000 Nov 2019
11 600000 Dec 2019
12 900000 Jan 2020
$df2020_21
monthly_budget month year
1 420000 Feb 2020
2 910000 Mar 2020
3 490000 Apr 2020
4 420000 May 2020
5 700000 Jun 2020
6 420000 Jul 2020
7 420000 Aug 2020
8 630000 Sep 2020
9 420000 Oct 2020
10 420000 Nov 2020
11 700000 Dec 2020
12 1050000 Jan 2021
$df2021_22
monthly_budget month year
1 480000 Feb 2021
2 1040000 Mar 2021
3 560000 Apr 2021
4 480000 May 2021
5 800000 Jun 2021
6 480000 Jul 2021
7 480000 Aug 2021
8 720000 Sep 2021
9 480000 Oct 2021
10 480000 Nov 2021
11 800000 Dec 2021
12 1200000 Jan 2022
I have a wide data frame in R and I am trying to rename the column names so that I can reshape it to a long format.
Currently, the data is structured like this:
long lat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V477
I'd like to rename the columns so that they are:
long lat Jan_1979 Feb_1979 Mar_1979 Apr_1979 ... Sept_2018
I'm not sure how to go about doing this. Any help would be appreciated.
There are multiple ways you could do this.
One way in base R is by using seq to create monthly dates in the format you need. So for example, you could create first 10 sequence starting from 1979-01-01 by
format(seq(as.Date('1979-01-01'), length.out = 10, by = "1 month"), "%b_%Y")
#[1] "Jan_1979" "Feb_1979" "Mar_1979" "Apr_1979" "May_1979" "Jun_1979" "Jul_1979"
#[8] "Aug_1979" "Sep_1979" "Oct_1979"
For your case, this should work
names(df)[3:479] <- format(seq(as.Date('1979-01-01'),
length.out = 477, by = "1 month"), "%b_%Y")
We can use expand.grid to get all month year combinations:
name_combn <- expand.grid(month.abb, 1979:2018)[1:477,]
names(df) <- c('long', 'lat', paste(name_combn$Var1, name_combn$Var2, sep = "_"))
Output:
> head(name_combn, 20)
Var1 Var2
1 Jan 1979
2 Feb 1979
3 Mar 1979
4 Apr 1979
5 May 1979
6 Jun 1979
7 Jul 1979
8 Aug 1979
9 Sep 1979
10 Oct 1979
11 Nov 1979
12 Dec 1979
13 Jan 1980
14 Feb 1980
15 Mar 1980
16 Apr 1980
17 May 1980
18 Jun 1980
19 Jul 1980
20 Aug 1980
I'm using data.table and I am trying to make a new column, called "season", which creates a column with the corresponding season, e.g summer, winter... based on a column called "MonthName".
I'm wondering whether there is a more efficient way to add a season column to a data table based on month values.
This is the first 6 of 300,000 observations, assume that the table is called "dt".
rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb
I have tried the following code:
dt[, Season := ifelse(MonthName = c("Jun", "Jul", "Aug"),"Winter", ifelse(MonthName = c("Dec", "Jan", "Feb"), "Summer", ifelse(MonthName = c("Sep", "Oct", "Nov"), "Spring" , ifelse(MonthName = c("Mar", "Apr", "May"), "Autumn", NA))))]
Which returns:
rrp totaldemand Year Month Finyear hourminute AvgPriceByTOD MonthName Season
1: 35.27500 1999 1 1999 00:00 33.09037 Jan NA
2: 21.01167 1999 1 1999 00:00 33.09037 Jan Summer
3: 25.28667 1999 2 1999 00:00 33.09037 Feb Summer
4: 18.42334 1999 2 1999 00:00 33.09037 Feb NA
5: 16.67499 1999 2 1999 00:00 33.09037 Feb NA
6: 18.90001 1999 2 1999 00:00 33.09037 Feb Summer
I get the error:
Warning messages:
1: In MonthName == c("Jun", "Jul", "Aug") :
longer object length is not a multiple of shorter object length
2: In MonthName == c("Dec", "Jan", "Feb") :
longer object length is not a multiple of shorter object length
3: In MonthName == c("Sep", "Oct", "Nov") :
longer object length is not a multiple of shorter object length
4: In MonthName == c("Mar", "Apr", "May") :
longer object length is not a multiple of shorter object length
ALongside this, for reasons that I don't know, some of the summer months are correctly assigned "summer", but others are assigned NA, e.g rows 1 and 2 should both be summer, but return differently.
Thanks in advance!
One pretty straightforward way is to use a lookup table to map month names to seasons:
# create a named vector where names are the month names and elements are seasons
seasons <- rep(c("winter","spring","summer","fall"), each = 3)
names(seasons) <- month.abb[c(6:12,1:5)] # thanks thelatemail for pointing out month.abb
seasons
# Jun Jul Aug Sep Oct Nov Dec Jan
#"winter" "winter" "winter" "spring" "spring" "spring" "summer" "summer"
# Feb Mar Apr May
#"summer" "fall" "fall" "fall"
Use it:
dt[, season := seasons[MonthName]]
data:
dt <- setDT(read.table(text=" rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb",
header = TRUE, stringsAsFactors = FALSE))
A bit of typing, but the code is efficient
dt[MonthName %in% c("Jun","Jul","Aug"), Season := "Winter"]
dt[MonthName %in% c("Dec","Jan","Feb"), Season := "Summer"]
dt[MonthName %in% c("Sep","Oct","Nov"), Season := "Spring"]
dt[is.na(MonthName), Season := "Autumn"]
Here we are assigning by-reference on a subset of the data.table
I prefer this to a lot of nested ifelses
If you want to check if a value is in a vector, you have to use %in%. See the different behaviour of:
myVec <- c("a","b","c")
"a" == myVec
[1] TRUE FALSE FALSE
"a" %in% myVec
[1] TRUE
I've created a multiple line graph using ggplot2, where each line represents a year that is plotted against month (click link below). Volume is represented on the y-axis.
Here is the code I used to plot the figure above:
ggplot(data=df26, aes(x=Month, y=C1, group=Year, colour=factor(Year))) +
geom_line(size=.75) + geom_point() +
scale_x_discrete(limits=c("Jan","Feb","Mar","Apr","May","Jun","Jul",
"Aug","Sep","Oct","Nov","Dec")) +
scale_y_continuous(labels=comma) +
scale_colour_manual(values=cPalette, name="Year") +
ylab("Volume")
Question: How do I also include another line to the plot that represents the mean volume within each month with the ability to modify the line thickness and color of that mean line? So far, all of my attempts at producing the right code have been unsuccessful (most likely due to my relative newbie status using R). Any help is much appreciated!
Edit: Dataframe df26 is provided below (as requested by a commenter):
Year Month C1
2010 Jan NA
2010 Feb NA
2010 Mar NA
2010 Apr NA
2010 May NA
2010 Jun NA
2010 Jul NA
2010 Aug 183.6516764
2010 Sep 120.6303348
2010 Oct 85.31007613
2010 Nov 13.7347988
2010 Dec 20.93950545
2011 Jan 13.35780833
2011 Feb 14.16910945
2011 Mar 9.786319721
2011 Apr 41.24848885
2011 May 122.3014387
2011 Jun 422.4012809
2011 Jul 539.8569592
2011 Aug 527.6301222
2011 Sep 385.8199781
2011 Oct 201.7846973
2011 Nov 27.91934061
2011 Dec 7.919004379
2012 Jan 10.22724424
2012 Feb 10.64391791
2012 Mar 88.06585438
2012 Apr 124.0320675
2012 May 325.1399457
2012 Jun 465.938168
2012 Jul 567.2273488
2012 Aug 459.769634
2012 Sep 333.8636373
2012 Oct 102.0607986
2012 Nov 23.18822051
2012 Dec 15.64841121
2013 Jan 7.458238256
2013 Feb 4.34972039
2013 Mar 26.2019396
2013 Apr 38.82781323
2013 May 257.0920645
2013 Jun 357.594195
2013 Jul 383.2780483
2013 Aug 456.469314
2013 Sep 319.3616298
2013 Oct NA
2013 Nov NA
2013 Dec 17.01748185
You need to calculate the means. Then you can plot them.
Using dplyr
library(dplyr)
df26means <- df26 %>%
group_by(Month) %>%
summarize(C1 = mean(C1, na.rm = T))
Then add it to your plot:
ggplot(data=df26, aes(x=Month, y=C1, group=Year, colour=factor(Year))) +
geom_line(size=.75) + geom_point() +
scale_x_discrete(limits=c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
scale_y_continuous(labels=comma) +
scale_colour_manual(values=cPalette, name="Year") +
ylab("Volume") +
geom_line(data = df26means, aes(group = 1), size = 1.25, color = "black")
I'd recommend using annotate to add a nice piece of text on the plot identifying that line as the mean line. To get it in the legend, you'd probably need to set df26means$Year = "Mean", convert df26$Year to a character, rbind the two dataframes together, then convert Year to a factor. The plot code would be simpler, but the data wrangling is a bit more complicated.
I have a 1 million row data frame that contains monthly water usage data (HCF) for various accounts from 2003-2010:
> head(LeakyAccts)
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Sep 2007 24
3 10114488 Nov 2006 11
4 10114488 Jun 2008 18
5 10114488 Aug 2003 6
6 10114488 Jan 2008 30
Dates are yearmon's. I want to know how much each account used every month compared to the same month in the previous year. So for each row, I'd like to find the difference between the usage in that month (Date) and the usage in the same month the previous year (Date - 1). In other words, I want this:
for(i in 1:nrow(LeakyAccts)) {
row <- which((LeakyAccts$ACCOUNT == LeakyAccts[i,]$UB_ACCT_NBR) & (LeakyAccts$Date == (LeakyAccts[i,]$Date - 1)))
if (length(row) == 1) { # no previous year for 2003
LeakyAccts[i,]$Difference <- LeakyAccts[i,]$HCF - LeakyAccts[row,]$HCF
}
}
Needless to say, this loop takes hours to run and seems very un-R-like. How can I avoid using an ugly for loop and speed up the computation? Is there perhaps a way to do this using an apply function or a data.table?
I've reconfigured your data a little to give a complete example:
library(zoo)
dat <- structure(list(ACCOUNT = c(10114488L, 10114488L, 10114488L, 20114488L, 20114488L, 20114488L), ate = structure(c(2010.75, 2009.75, 2008.75, 2008, 2007, 2006), class = "yearmon"), HCF = c(25L, 24L, 11L, 18L, 6L, 30L)), .Names = c("ACCOUNT", "Date", "HCF"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
Which looks like:
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Oct 2009 24
3 10114488 Oct 2008 11
4 20114488 Jan 2008 18
5 20114488 Jan 2007 6
6 20114488 Jan 2006 30
Since yearmon is essentially just a numeric value where a difference of 1 is a year's difference, you can get the matching differences from a year ago like:
dat$HCF - dat$HCF[match(dat$Date-1,dat$Date)]
#[1] 1 13 NA 12 -24 NA
...which you can also apply within each group like:
do.call(c,by(dat,dat$ACCOUNT,function(x) x$HCF - x$HCF[match(x$Date-1,x$Date)]))
#101144881 101144882 101144883 201144881 201144882 201144883
# 1 13 NA 12 -24 NA
Or using data.table like:
library(data.table)
dat <- as.data.table(dat)
dat[, Difference := HCF - HCF[match(Date-1,Date)], by=ACCOUNT]
dat
# ACCOUNT Date HCF Difference
#1: 10114488 Oct 2010 25 1
#2: 10114488 Oct 2009 24 13
#3: 10114488 Oct 2008 11 NA
#4: 20114488 Jan 2008 18 12
#5: 20114488 Jan 2007 6 -24
#6: 20114488 Jan 2006 30 NA