R - approximate missing month values using zoo package - r

Consider code:
library('zoo')
data <- c(1, 2, 4, 6)
dates <- c("2016-11-01", "2016-12-01", "2017-02-01", "2017-04-01");
z1 <- zoo(data, as.yearmon(dates))
z2 <- na.approx(z1)
Variable z2 looks like this:
nov 2016 dec 2016 feb 2017 apr 2017
1 2 4 6
But I need z2 to be similar to this:
nov 2016 dec 2016 jan 2017 feb 2017 mar 2017 apr 2017
1 2 3 4 5 6
I just need to approximate values for months where value is missing. Thanks for any hints.

With the new as.zoo argument, calendar, in zoo 1.8 (which defaults to TRUE so we don't have to specify it) we can just convert the input to "ts" and then back to "zoo" again applying na.approx after that:
na.approx(as.zoo(as.ts(z2)))
## Nov 2016 Dec 2016 Jan 2017 Feb 2017 Mar 2017 Apr 2017
## 1 2 3 4 5 6
With prior versions of zoo we can do the same but manually convert the index back to "yearmon":
na.approx(aggregate(as.zoo(as.ts(z2)), as.yearmon, c))
magrittr
Using zoo with magrittr these can be expressed as the following pipelines, respectively:
library(magrittr)
z2 %>% as.ts %>% as.zoo %>% na.approx
z2 %>% as.ts %>% as.zoo %>% aggregate(as.yearmon, c) %>% na.approx

One way using just na.approx and base R:
#add your data and dates together
df <- data.frame(data, dates = as.Date(dates))
#create all dates using seq
new_dates <- data.frame(dates = seq(as.Date(dates[1]), as.Date(dates[4]), by = 'month'))
#merge the two and then na.approx
new_df <- merge(new_dates, df, by = 'dates', all.x = TRUE)
na.approx(new_df$data)
Out:
[1] 1 2 3 4 5 6

Related

Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).
I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.
Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem?
-> I even tried to manage more complex orders-argument for lubricate::parse_date_time like
parse_date_time(input_col, orders="mdy", "my")
but this didn't work.
Here's the code:
library("tidyverse")
library("lubridate")
Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
"June 24, 2010","January 2005")
df_dates <- data.frame(Observation, Date_original)
# looking for a comma in the cell
comma_detect <- function(a_string){
str_detect(a_string, ",")
}
# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
if_else(comma_detect(input_col),
parse_date_time(input_col, orders="mdy"),
parse_date_time(input_col, orders="my"))
}
# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
mutate(df, output_col = date_correction_row(input_col))
}
df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()
OUTPUT
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01
In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.
1) Try each format and take the one that does not give NA. This uses only base R.
output_col <- "Date"
within(df_dates, assign(output_col, pmin(na.rm = TRUE,
as.Date(Date_original, "%B %d, %Y"),
as.Date(paste(Date_original, 1), "%B %Y %d"))))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE),
mdy(Date_original)))
## Observation Date_original Date
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.
library(dplyr)
library(lubridate)
output_col <- "Date"
df_dates %>%
mutate(!!output_col := if_else(grepl(",", Date_original),
mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1 1 October 2014 2014-10-01
## 2 2 August 2014 2014-08-01
## 3 3 June 2013 2013-06-01
## 4 4 June 24, 2010 2010-06-24
## 5 5 January 2005 2005-01-01
When the date structure is known, I like to explicitly correct the date structure first, then parse. Here I use regex to sub in 1 when the day is missing, then we just parse like normal.
library(tidyverse)
df_dates %>%
mutate(
output_col = gsub("(?<!,)\\s(?=\\d{4})", " 1, ", Date_original, perl = TRUE) %>%
as.Date(., format = '%B %d, %Y')
)
Observation Date_original output_col
1 1 October 2014 2014-10-01
2 2 August 2014 2014-08-01
3 3 June 2013 2013-06-01
4 4 June 24, 2010 2010-06-24
5 5 January 2005 2005-01-01

Filling missing date months in a time series by Item

I am fairly new to R and I am trying to do the following task:
I have the following dataset:
df1 <- data.frame(ITEM = c("A","A","A","A","A","B","B","B","B","B"),
Date = c("Jan-2020","Feb-2020","May-2020","Jun-2020","Jul-2020","Jan-2020","Apr-2020","Jun-2020","Jul-2020","Aug-2020"))
Here is an image:
I have used the library "zoo" to change the date column into yearmon and I am trying to create rows for the missing "yearmon" dates. So something like this:
Anyone has any idea how I can do this?
Thank you
You can create a sequence of yearmon objects for each ITEM and use it in complete.
library(dplyr)
library(zoo)
library(tidyr)
df1 %>%
mutate(Date = as.yearmon(Date, '%b-%Y')) %>%
group_by(ITEM) %>%
complete(Date = seq(min(Date), max(Date), 1/12)) %>%
ungroup
# ITEM Date
# <chr> <yearmon>
# 1 A Jan 2020
# 2 A Feb 2020
# 3 A Mar 2020
# 4 A Apr 2020
# 5 A May 2020
# 6 A Jun 2020
# 7 A Jul 2020
# 8 B Jan 2020
# 9 B Feb 2020
#10 B Mar 2020
#11 B Apr 2020
#12 B May 2020
#13 B Jun 2020
#14 B Jul 2020
#15 B Aug 2020
If you want a sequence of date objects you can use :
df1 %>%
mutate(Date = as.Date(as.yearmon(Date, '%b-%Y'))) %>%
group_by(ITEM) %>%
complete(Date = seq(min(Date), max(Date), 'month')) %>%
ungroup()

Add Month and Year column from complete date column

I have a column with date formatted as MM-DD-YYYY, in the Date format.
I want to add 2 columns one which only contains YYYY and the other only contains MM.
How do I do this?
Once again base R gives you all you need, and you should not do this with sub-strings.
Here we first create a data.frame with a proper Date column. If your date is in text format, parse it first with as.Date() or my anytime::anydate() (which does not need formats).
Then given the date creating year and month is simple:
R> df <- data.frame(date=Sys.Date()+seq(1,by=30,len=10))
R> df[, "year"] <- format(df[,"date"], "%Y")
R> df[, "month"] <- format(df[,"date"], "%m")
R> df
date year month
1 2017-12-29 2017 12
2 2018-01-28 2018 01
3 2018-02-27 2018 02
4 2018-03-29 2018 03
5 2018-04-28 2018 04
6 2018-05-28 2018 05
7 2018-06-27 2018 06
8 2018-07-27 2018 07
9 2018-08-26 2018 08
10 2018-09-25 2018 09
R>
If you want year or month as integers, you can wrap as as.integer() around the format.
A base R option would be to remove the substring with sub and then read with read.table
df1[c('month', 'year')] <- read.table(text=sub("-\\d{2}-", ",", df1$date), sep=",")
Or using tidyverse
library(tidyverse)
separate(df1, date, into = c('month', 'day', 'year') %>%
select(-day)
Note: it may be better to convert to datetime class instead of using the string formatting.
df1 %>%
mutate(date =mdy(date), month = month(date), year = year(date))
data
df1 <- data.frame(date = c("05-21-2017", "06-25-2015"))

Date formatting MMM-YYYY

I have a dataset with dates in following format:
Initial:
Jan-2015 Apr-2013 Jun-2014 Jan-2015 Jan-2016 Jan-2015 Jan-2016 Jan-2015 Apr-2012 Nov-2012 Jun-2013 Sep-2013
Final:
Feb-2014 Jan-2013 Sep-2014 Apr-2013 Sep-2014 Mar-2013 Aug-2012 Apr-2012 Oct-2012 Oct-2013 Jun-2014 Oct-2013
I would like to perform these steps:
create dummy variables for Month and Year
Subtract these dates from another dates to find out duration (final- initials) in months
I would like to do these in R?
You could use as.yearmon from the zoo package for this.
library(zoo)
12 * (as.yearmon("Jan-2015", "%b-%Y") - as.yearmon("Feb-2014", "%b-%Y"))
# result
# [1] 11
To expand on #neilfws answer, you can use the month and year functions from the lubridate package to create your dummy variables with the month and year in your data frame.
Here is the code:
library(lubridate)
library(zoo)
df <- data.frame(Initial = c("Jan-2015", "Apr-2013", "Jun-2014", "Jan-2015", "Jan-2016", "Jan-2015",
"Jan-2016", "Jan-2015", "Apr-2012", "Nov-2012", "Jun-2013", "Sep-2013"),
Final = c("Feb-2014", "Jan-2013", "Sep-2014", "Apr-2013", "Sep-2014", "Mar-2013",
"Aug-2012", "Apr-2012", "Oct-2012", "Oct-2013", "Jun-2014", "Oct-2013"))
df$Initial <- as.character(df$Initial)
df$Final <- as.character(df$Final)
df$Initial <- as.yearmon(df$Initial, "%b-%Y")
df$Final <- as.yearmon(df$Final, "%b-%Y")
df$month_initial <- month(df$Initial)
df$year_intial <- year(df$Initial)
df$month_final <- month(df$Final)
df$year_final <- year(df$Final)
df$Difference <- 12*(df$Initial-df$Final)
And here is the final data.frame:
> head(df)
Initial Final month_initial year_intial month_final year_final Difference
1 Jan 2015 Feb 2014 1 2015 2 2014 11
2 Apr 2013 Jan 2013 4 2013 1 2013 3
3 Jun 2014 Sep 2014 6 2014 9 2014 -3
4 Jan 2015 Apr 2013 1 2015 4 2013 21
5 Jan 2016 Sep 2014 1 2016 9 2014 16
6 Jan 2015 Mar 2013 1 2015 3 2013 22
Hope this helps!

Aggregate data to weekly level with every week starting from Monday

I have a data frame like,
2015-01-30 1 Fri
2015-01-30 2 Sat
2015-02-01 3 Sun
2015-02-02 1 Mon
2015-02-03 1 Tue
2015-02-04 1 Wed
2015-02-05 1 Thu
2015-02-06 1 Fri
2015-02-07 1 Sat
2015-02-08 1 Sun
I want to aggregaate it to weekly level such that every week starts from "monday" and ends in "sunday". So, in the aggregated data for above, first week should end on 2015-02-01.
output should look like something for above
firstweek 6
secondweek 7
I tried this,
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
But here in the final result, every week is starting from Sunday.
This should work. I've called the dataframe m and named the columns possibly different to yours.
library(plyr) # install.packages("plyr")
colnames(m) = c("Date", "count","Day")
start = as.Date("2015-01-26")
m$Week <- floor(unclass(as.Date(m$Date) - as.Date(start)) / 7) + 1
m$Week = as.numeric(m$Week)
m %>% group_by(Week) %>% summarise(count = sum(count))
The library plyr is great for data manipulation, but it's just a rough hack to get the week number in.
Convert to date and use the %W format to get a week number...
df <- read.csv(textConnection("2015-01-30, 1, Fri,
2015-01-30, 2, Sat,
2015-02-01, 3, Sun,
2015-02-02, 1, Mon,
2015-02-03, 1, Tue,
2015-02-04, 1, Wed,
2015-02-05, 1, Thu,
2015-02-06, 1, Fri,
2015-02-07, 1, Sat,
2015-02-08, 1, Sun"), header=F, stringsAsFactors=F)
names(df) <- c("date", "something", "day")
df$date <- as.Date(df$date, format="%Y-%m-%d")
df$week <- format(df$date, "%W")
aggregate(df$something, list(df$week), sum)
Wit dplyr and lubridate is this really easy thanks to the function isoweek
my.df <- read.table(header=FALSE, text=
'2015-01-30 1 Fri
2015-01-30 2 Sat
2015-02-01 3 Sun
2015-02-02 1 Mon
2015-02-03 1 Tue
2015-02-04 1 Wed
2015-02-05 1 Thu
2015-02-06 1 Fri
2015-02-07 1 Sat
2015-02-08 1 Sun')
my.df %>% mutate(week = isoweek(V1)) %>% group_by(week) %>% summarise(sum(V2))
or a bit shorter
my.df %>% group_by(isoweek(V1)) %>% summarise(sum(V2))

Resources