This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Add missing months for a range of date in R
(2 answers)
Closed 2 years ago.
I have a data of random dates from 2008 to 2020 and their corresponding value
Date Val
September 16, 2012 32
September 19, 2014 33
January 05, 2008 26
June 07, 2017 02
December 15, 2019 03
May 28, 2020 18
I want to fill the missing dates from January 01 2008 to March 31, 2020 and their corresponding value as 1.
I refer some of the post like Post1, Post2 and I am not able to solve the problem based on that. I am a beginner in R.
I am looking for data like this
Date Val
January 01, 2008 1
January 02, 2008 1
January 03, 2008 1
January 04, 2008 1
January 05, 2008 26
........
Use tidyr::complete :
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%B %d, %Y")) %>%
tidyr::complete(Date = seq(as.Date('2008-01-01'), as.Date('2020-03-31'),
by = 'day'), fill = list(Val = 1)) %>%
mutate(Date = format(Date, "%B %d, %Y"))
# A tibble: 4,475 x 2
# Date Val
# <chr> <dbl>
# 1 January 01, 2008 1
# 2 January 02, 2008 1
# 3 January 03, 2008 1
# 4 January 04, 2008 1
# 5 January 05, 2008 26
# 6 January 06, 2008 1
# 7 January 07, 2008 1
# 8 January 08, 2008 1
# 9 January 09, 2008 1
#10 January 10, 2008 1
# … with 4,465 more rows
data
df <- structure(list(Date = c("September 16, 2012", "September 19, 2014",
"January 05, 2008", "June 07, 2017", "December 15, 2019", "May 28, 2020"
), Val = c(32L, 33L, 26L, 2L, 3L, 18L)), class = "data.frame",
row.names = c(NA, -6L))
We can create data frame with the desired date range and then join our data frame on it and replace all NAs with 1:
library(tidyverse)
days_seq %>%
left_join(df) %>%
mutate(Val = if_else(is.na(Val), as.integer(1), Val))
Joining, by = "Date"
# A tibble: 4,474 x 2
Date Val
<date> <int>
1 2008-01-01 1
2 2008-01-02 1
3 2008-01-03 1
4 2008-01-04 1
5 2008-01-05 33
6 2008-01-06 1
7 2008-01-07 1
8 2008-01-08 1
9 2008-01-09 1
10 2008-01-10 1
# ... with 4,464 more rows
Data
days_seq <- tibble(Date = seq(as.Date("2008/01/01"), as.Date("2020/03/31"), "days"))
df <- tibble::tribble(
~Date, ~Val,
"2012/09/16", 32L,
"2012/09/19", 33L,
"2008/01/05", 33L
)
df$Date <- as.Date(df$Date)
Related
I have a large dataset that spans over 20 years. I have a column for the date and another column for the hour ending (HE). I'm trying to add a new column to provide the hour by hour (hrxhr) information in a given year (so running total). So date: Jan 1, 2023, HE: 1 should be hrxhr: 1 and Dec 31, 2023, HE: 24, should be hrxhr:8760 (8784 on leap years).
Should look like this:
YEAR
MONTH
DAY
HOUR OF DAY
Month_num
Date
Date1
NEW COLUMN hrxhr
2023
Dec
31
22
12
2023-12-31
365
8758
2023
Dec
31
23
12
2023-12-31
365
8759
2023
Dec
31
24
12
2023-12-31
365
8760
2024
Jan
01
1
01
2024-01-01
1
1
2024
Jan
01
2
01
2024-01-01
1
2
At first I thought I could get the Julian date and then multiple that by the HE, but that is incorrect since Jan 2, 2023, HE:1 would then equal 2 but the hrxhr/running total should equal 25.
In base R:
df <- data.frame(
YEAR = c(2023L, 2023L, 2023L, 2024L, 2023L),
MONTH = c("Dec", "Dec", "Dec", "Jan", "Jan"), DAY = c(31L, 31L, 31L, 1L, 1L),
HOUR_OF_DAY = c(22L, 23L, 24L, 1L, 2L), Month_num = c(12L,
12L, 12L, 12L, 12L), Date = c("2023-12-31", "2023-12-31",
"2023-12-31", "2024-01-01", "2024-01-01"), Date1 = c(365L,
365L, 365L, 1L, 1L))
df$hrxhr <- mapply(\(from, to, by) length(seq.POSIXt(from, to, by)),
from = trunc(as.POSIXlt(df$Date), "years"),
to = as.POSIXlt(df$Date),
by="1 hour") + df$HOUR_OF_DAY - 1
df
#> YEAR MONTH DAY HOUR_OF_DAY Month_num Date Date1 hrxhr
#> 1 2023 Dec 31 22 12 2023-12-31 365 8758
#> 2 2023 Dec 31 23 12 2023-12-31 365 8759
#> 3 2023 Dec 31 24 12 2023-12-31 365 8760
#> 4 2024 Jan 1 1 12 2024-01-01 1 1
#> 5 2023 Jan 1 2 12 2024-01-01 1 2
If you are open to a tidyverse / lubridate solution, you could use
library(dplyr)
library(lubridate)
df1 %>%
mutate(
begin = ymd_hms(paste(year(Date), "-01-01 00:00:00")),
target = ymd_hms(paste(Date, HOUR_OF_DAY, ":00:00")),
hrxhr = time_length(interval(begin, target), "hours")) %>%
select(-begin, -target)
This returns
# A tibble: 5 × 7
YEAR MONTH DAY HOUR_OF_DAY Month_num Date hrxhr
<dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 2023 Dec 31 22 12 2023-12-31 8758
2 2023 Dec 31 23 12 2023-12-31 8759
3 2023 Dec 31 24 12 2023-12-31 8760
4 2024 Jan 01 1 12 2024-01-01 1
5 2024 Jan 01 2 12 2024-01-01 2
Data
structure(list(YEAR = c(2023, 2023, 2023, 2024, 2024), MONTH = c("Dec",
"Dec", "Dec", "Jan", "Jan"), DAY = c("31", "31", "31", "01",
"01"), HOUR_OF_DAY = c(22, 23, 24, 1, 2), Month_num = c(12, 12,
12, 12, 12), Date = structure(c(19722, 19722, 19722, 19723, 19723
), class = "Date")), row.names = c(NA, -5L), class = "data.frame")
I have columns like these:
year period period2 Sales
2015 201504 April 2015 10000
2015 201505 May 2015 11000
2018 201803 March 2018 12000
I want to change the type of period or period2 column as a date, to use later in time series analysis
Data:
tibble::tibble(
year = c(2015,2015,2018),
period = c(201504, 201505,201803 ),
period2 = c("April 2015", "May 2015", "March 2018"),
Sales = c(10000,11000,12000)
)
Using lubridate package you can transform them into date variables:
df <- tibble::tibble(
year = c(2015,2015,2018),
period = c(201504, 201505,201803 ),
period2 = c("April 2015", "May 2015", "March 2018"),
Sales = c(10000,11000,12000)
)
library(dplyr)
df %>%
mutate(period = lubridate::ym(period),
period2 = lubridate::my(period2))
Suppose I'm given the following input dataframe:
ID Date
1 20th May, 2020
1 21st May, 2020
1 28th May, 2020
1 29th May, 2020
2 20th May, 2020
2 1st June, 2020
I want to generate the following dataframe:
ID Date Delta
1 20th May, 2020 0
1 21st May, 2020 1
1 28th May, 2020 7
1 29th May, 2020 1
2 20th May, 2020 0
2 1st June, 2020 12
Where the idea is, first I group by id. Then within my current id. I iterate over the days and subtract the current date with the previous date with the exception of the first date which is just itself.
I have been using dplyr but I am uncertain on how to achieve this for groups and how to do this iteratively
My goal is to filter the deltas and retain 0 and anything larger than 7 but it must follow the 'preceeding date' logic within a specific id.
library(dplyr)
dat %>%
mutate(Date = as.Date(gsub("[a-z]{2} ", " ", Date), format = "%d %b, %Y")) %>%
group_by(ID) %>%
mutate(Delta = c(0, diff(Date))) %>%
ungroup()
# # A tibble: 6 x 3
# ID Date Delta
# <dbl> <date> <dbl>
# 1 1 2020-05-20 0
# 2 1 2020-05-21 1
# 3 1 2020-05-28 7
# 4 1 2020-05-29 1
# 5 2 2020-05-20 0
# 6 2 2020-06-01 12
Steps:
remove the ordinal from numbers, so that we can
convert them to proper Date-class objects, then
diff them within ID groups.
Data
dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Date = c(" 20th May, 2020", " 21st May, 2020", " 28th May, 2020", " 29th May, 2020", " 20th May, 2020", " 1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
Similar logic as #r2evans but with different functions.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Delta = as.integer(Date - lag(Date, default = first(Date)))) %>%
ungroup
# ID Date Delta
# <int> <date> <int>
#1 1 2020-05-20 0
#2 1 2020-05-21 1
#3 1 2020-05-28 7
#4 1 2020-05-29 1
#5 2 2020-05-20 0
#6 2 2020-06-01 12
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L), Date = c("20th May, 2020",
"21st May, 2020", "28th May, 2020", "29th May, 2020", "20th May, 2020",
"1st June, 2020")), class = "data.frame", row.names = c(NA, -6L))
I have daily scores and their corresponding dates as seen below and currently struggling to convert them to quarterly. However, the years are first of all not chronological and am quite confused as to how to deal with the two situations.
see sample.
data
dates score
1 July 1, 2019 Monday 8
2 October 25, 2015 Sunday -3
3 June 17, 2020 Wednesday -5
4 January 17, 2018 Wednesday -1
5 April 15, 2019 Monday 6
6 October 30, 2019 Wednesday 10
7 March 6, 2017 Monday -2
8 November 19, 2018 Monday 3
9 June 11, 2020 Thursday 5
10 October 11, 2017 Wednesday -13
11 December 3, 2017 Sunday -8
12 November 14, 2018 Wednesday -6
13 August 22, 2017 Tuesday 8
14 December 13, 2017 Wednesday 5
15 January 22, 2016 Friday 5`
dates <- sapply(date, function(x)
trimws(grep(paste(month.name, collapse = '|'), x, value = TRUE)));
sort(as.Date(dates,'%B %d, %Y %A'))
This is a job for lubridate. You can parse your date column with lubridate::parse_date_time() and extract the quarter they fall in with lubridate::quarter():
library("tibble")
library("dplyr")
library("lubridate")
tbl <- tribble(~date, ~score,
"July 1, 2019 Monday", 8,
"October 25, 2015 Sunday", -3,
"June 17, 2020 Wednesday", -5,
"January 17, 2018 Wednesday", -1,
"April 15, 2019 Monday", 6,
"October 30, 2019 Wednesday", 10,
"March 6, 2017 Monday", -2,
"November 19, 2018 Monday", 3,
"June 11, 2020 Thursday", 5,
"October 11, 2017 Wednesday", -13,
"December 3, 2017 Sunday", -8,
"November 14, 2018 Wednesday", -6,
"August 22, 2017 Tuesday", 8,
"December 13, 2017 Wednesday", 5,
"January 22, 2016 Friday", 5)
tbl %>%
mutate(date = parse_date_time(date, "B d, Y")) %>%
mutate(quarter = quarter(date, with_year = TRUE))
#> # A tibble: 15 x 3
#> date score quarter
#> <dttm> <dbl> <dbl>
#> 1 2019-07-01 00:00:00 8 2019.3
#> 2 2015-10-25 00:00:00 -3 2015.4
#> 3 2020-06-17 00:00:00 -5 2020.2
#> 4 2018-01-17 00:00:00 -1 2018.1
#> 5 2019-04-15 00:00:00 6 2019.2
#> 6 2019-10-30 00:00:00 10 2019.4
#> 7 2017-03-06 00:00:00 -2 2017.1
#> 8 2018-11-19 00:00:00 3 2018.4
#> 9 2020-06-11 00:00:00 5 2020.2
#> 10 2017-10-11 00:00:00 -13 2017.4
#> 11 2017-12-03 00:00:00 -8 2017.4
#> 12 2018-11-14 00:00:00 -6 2018.4
#> 13 2017-08-22 00:00:00 8 2017.3
#> 14 2017-12-13 00:00:00 5 2017.4
#> 15 2016-01-22 00:00:00 5 2016.1
If you are trying to change dates column to Date class you can use as.Date.
df$new_date <- as.Date(trimws(df$dates), '%B %d, %Y')
Or this should also work with lubridate's mdy :
df$new_date <- lubridate::mdy(df$dates)
Once the data has been converted to date values per Ronak Shah's answer, we can use lubridate::quarter() to generate year and quarter values.
textData <- " dates|score
July 1, 2019 Monday| 8
October 25, 2015 Sunday| -3
June 17, 2020 Wednesday| -5
January 17, 2018 Wednesday| -1
April 15, 2019 Monday| 6
October 30, 2019 Wednesday| 10
March 6, 2017 Monday| -2
November 19, 2018 Monday| 3
June 11, 2020 Thursday| 5
October 11, 2017 Wednesday| -13
December 3, 2017 Sunday| -8
November 14, 2018 Wednesday| -6
August 22, 2017 Tuesday| 8
December 13, 2017 Wednesday| 5
January 22, 2016 Friday| 5
"
df <- read.csv(text=textData,
header=TRUE,
sep="|")
library(lubridate)
df$dt_quarter <- quarter(mdy(df$dates),with_year = TRUE,
fiscal_start = 1)
head(df)
We include with_year = TRUE and fiscal_start = 1 arguments to illustrate that one can change the output to include / exclude the year information, as well as change the start month for the year from the default of 1.
...and the output:
> head(df)
dates score dt_quarter
1 July 1, 2019 Monday 8 2019.3
2 October 25, 2015 Sunday -3 2015.4
3 June 17, 2020 Wednesday -5 2020.2
4 January 17, 2018 Wednesday -1 2018.1
5 April 15, 2019 Monday 6 2019.2
6 October 30, 2019 Wednesday 10 2019.4
The yearqtr class represents a year and quarter as the year plus 0 for Q1, 0.25 for Q2, 0.5 for Q3 and 0.75 for Q4. If date is defined as a yearqtr object as below then as.integer(date) is the year and cycle(date) is the quarter: 1, 2, 3 or 4. Note that junk at the end of the date field is ignored by as.yearqtr so we only need to specify month, day and year percent codes.
If you want a Date object instead of a yearqtr object then uncomment one of the commented out lines.
data is defined reproducibly in the Note at the end. (In the future please use dput to display your input data to prevent ambiguity as discussed in the information at the top of the r tag page.)
library(zoo)
date <- as.yearqtr(data$date, "%B %d, %Y")
# uncomment one of these lines if you want a Date object instead of yearqtr object
# date <- as.Date(date) # first day of quarter
# date <- as.Date(date, frac = 1) # last day of quarter
data.frame(date, score = data$score)[order(date), ]
giving the following sorted data frame assuming that we do not uncomment any of the commented out lines above.
date score
2 2015 Q4 -3
15 2016 Q1 5
7 2017 Q1 -2
...snip...
Time series
If this is supposed to be a time series with a single aggregated score per quarter then we can get a zoo series like this where data is the original data defined in the Note below.
library(zoo)
to_ym <- function(x) as.yearqtr(x, "%B %d, %Y")
z <- read.zoo(data, FUN = to_ym, aggregate = "mean")
z
## 2015 Q4 2016 Q1 2017 Q1 2017 Q3 2017 Q4 2018 Q1 2018 Q4 2019 Q2
## -3.000000 5.000000 -2.000000 8.000000 -5.333333 -1.000000 -1.500000 6.000000
## 2019 Q3 2019 Q4 2020 Q2
## 8.000000 10.000000 0.000000
or as a ts object like this:
as.ts(z)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2015 -3.000000
## 2016 5.000000 NA NA NA
## 2017 -2.000000 NA 8.000000 -5.333333
## 2018 -1.000000 NA NA -1.500000
## 2019 NA 6.000000 8.000000 10.000000
## 2020 NA 0.000000
Note
The input data in reproducible form:
data <- structure(list(dates = c("July 1, 2019 Monday", "October 25, 2015 Sunday",
"June 17, 2020 Wednesday", "January 17, 2018 Wednesday", "April 15, 2019 Monday",
"October 30, 2019 Wednesday", "March 6, 2017 Monday", "November 19, 2018 Monday",
"June 11, 2020 Thursday", "October 11, 2017 Wednesday", "December 3, 2017 Sunday",
"November 14, 2018 Wednesday", "August 22, 2017 Tuesday", "December 13, 2017 Wednesday",
"January 22, 2016 Friday"), score = c(8L, -3L, -5L, -1L, 6L,
10L, -2L, 3L, 5L, -13L, -8L, -6L, 8L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-15L))
Update
Have updated this answer several times so be sure you are looking at the latest version.
I have a data set that contains a simple column consisting of dates, like this:
Dates
1 2012/04/10
2 2012/03/30
3 2012/03/24
4 2012/03/25
5 2012/04/10
6 2012/04/14
7 2012/04/21
My desired output is this:
Dates DateName
1 2012/04/10 April 2012
2 2015/03/30 March 2015
3 2011/03/24 March 2011
4 2016/12/25 December 2016
5 2014/06/10 June 2014
6 2014/05/14 May 2014
7 2018/07/21 August 2018
To do this I used the following code:
dt$Dates <- as.Date(dt$Dates)
dt$DateName <- format(dt$Dates,"%B %Y")
Whilst this works fine, my new column comes out a character class. I wish for this to come out as a date class instead. This is because I cannot sort this column by calendar date. Rather, it sorts alphabetically.
Is there a way to class or re-class my new date format as some sort of date or calander class?
(I'm not necessarily looking for a base-R solution).
(If possible, I would also highly prefer to keep my new format as is).
I have tried the following lines of code and more, but these only return errors.
dt$DateName <- format.Date(dt$Dates,"%B %Y")
dt$DateName <- format.POSIXlt(dt$Dates,"%B %Y")
dt$DateName <- format.difftime(dt$Dates,"%B %Y")
dt$DateName <- as.Date(dt$Dates, format ="%B %Y")
You can convert dates to yearmon class :
dt$month_year <- zoo::as.yearmon(dt$Dates, "%Y/%m/%d")
dt
# Dates month_year
#1 2012/04/10 Apr 2012
#2 2012/03/30 Mar 2012
#3 2012/03/24 Mar 2012
#4 2012/03/25 Mar 2012
#5 2012/04/10 Apr 2012
#6 2012/04/14 Apr 2012
#7 2012/04/21 Apr 2012
class(dt$month_year)
#[1] "yearmon"
You can then sort them
dt[order(dt$month_year), ]
# Dates month_year
#2 2012/03/30 Mar 2012
#3 2012/03/24 Mar 2012
#4 2012/03/25 Mar 2012
#1 2012/04/10 Apr 2012
#5 2012/04/10 Apr 2012
#6 2012/04/14 Apr 2012
#7 2012/04/21 Apr 2012
data
dt <- structure(list(Dates = structure(c(4L, 3L, 1L, 2L, 4L, 5L, 6L
), .Label = c("2012/03/24", "2012/03/25", "2012/03/30", "2012/04/10",
"2012/04/14", "2012/04/21"), class = "factor")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7"))