Convert name of day to date format in R - r

I want to convert this data in format of date and create new column with the value of month-year:
month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
day : chr [1:41188] "mon" "mon" "mon" "mon" ...
year : num [1:41188] 2008 2008 2008 2008 2008 ...
I make a dput()
dput(head(df))
df <-
structure(list(month = structure(c(7L, 7L, 7L, 7L, 7L, 7L),
.Label = c("apr", "aug", "dec", "jul", "jun", "mar", "may",
"nov", "oct", "sep"), class = "factor"), day = c("mon", "mon",
"mon", "mon", "mon", "mon"), year = c(2008, 2008, 2008, 2008,
2008, 2008)), class = "data.frame", row.names = c(NA, -6L))
The main of problem is the month and day columns because the format is factor and character
I try the next sentences:
as.integer(factor(df$month, levels=month.abb))
And this:
match(df$month, month.abb)
I make it:
df$date<-paste(as.character(df$month), df$year)
And this worked and returns:
$ date : chr [1:41188] "may 2008" "may 2008" "may 2008" "may 2008"
How can I change to date format?

I'll arbitrarily pick the "first" day-of-month for each weekday that you've listed. To make it interesting, I'll change the weekdays so that we have some variability in the data.
df <-
structure(list(month = structure(c(7L, 7L, 7L, 7L, 7L, 7L),
.Label = c("apr", "aug", "dec", "jul", "jun", "mar", "may",
"nov", "oct", "sep"), class = "factor"), day = c("mon", "tue",
"wed", "fri", "sat", "sun"), year = c(2008, 2008, 2008, 2008,
2008, 2008)), class = "data.frame", row.names = c(NA, -6L))
df
# month day year
# 1 may mon 2008
# 2 may tue 2008
# 3 may wed 2008
# 4 may fri 2008
# 5 may sat 2008
# 6 may sun 2008
From here, we need to determine what the 1st day of each month is, and then find the first day-of-week that is at or after that day.
firstdow <- as.POSIXlt(paste(df$year, df$month, "01", sep = "-"), format = "%Y-%b-%d")$wday
# ?strptime says with '%u' that monday is 1
datadow <- match(df$day, c("mon", "tue", "wed", "thu", "fri", "sat", "sun"))
datadom <- (firstdow + datadow - 1) %% 7 + 1
df$date <- as.Date(paste(df$year, df$month, datadom, sep = "-"), format = "%Y-%b-%d")
df
# month day year date
# 1 may mon 2008 2008-05-05
# 2 may tue 2008 2008-05-06
# 3 may wed 2008 2008-05-07
# 4 may fri 2008 2008-05-02
# 5 may sat 2008 2008-05-03
# 6 may sun 2008 2008-05-04
And proof that this came up with the correct day-of-month to get the first day-of-week:
format(df$date, format = "%a")
# [1] "Mon" "Tue" "Wed" "Fri" "Sat" "Sun"

We could do 2 things:
As month.abb is a system constant, we can use it to get numeric month
Use as.yearmon from zoo package to get month and year
library(zoo)
df %>%
mutate(month = match(month, tolower(month.abb))) %>%
mutate(new_date = as.yearmon(paste(year, month), "%Y %m"))
Output:
month day year new_date
1 5 mon 2008 Mai 2008
2 5 mon 2008 Mai 2008
3 5 mon 2008 Mai 2008
4 5 mon 2008 Mai 2008
5 5 mon 2008 Mai 2008
6 5 mon 2008 Mai 2008

Related

How to create new column based off information in other columns in R

I have a large dataset that spans over 20 years. I have a column for the date and another column for the hour ending (HE). I'm trying to add a new column to provide the hour by hour (hrxhr) information in a given year (so running total). So date: Jan 1, 2023, HE: 1 should be hrxhr: 1 and Dec 31, 2023, HE: 24, should be hrxhr:8760 (8784 on leap years).
Should look like this:
YEAR
MONTH
DAY
HOUR OF DAY
Month_num
Date
Date1
NEW COLUMN hrxhr
2023
Dec
31
22
12
2023-12-31
365
8758
2023
Dec
31
23
12
2023-12-31
365
8759
2023
Dec
31
24
12
2023-12-31
365
8760
2024
Jan
01
1
01
2024-01-01
1
1
2024
Jan
01
2
01
2024-01-01
1
2
At first I thought I could get the Julian date and then multiple that by the HE, but that is incorrect since Jan 2, 2023, HE:1 would then equal 2 but the hrxhr/running total should equal 25.
In base R:
df <- data.frame(
YEAR = c(2023L, 2023L, 2023L, 2024L, 2023L),
MONTH = c("Dec", "Dec", "Dec", "Jan", "Jan"), DAY = c(31L, 31L, 31L, 1L, 1L),
HOUR_OF_DAY = c(22L, 23L, 24L, 1L, 2L), Month_num = c(12L,
12L, 12L, 12L, 12L), Date = c("2023-12-31", "2023-12-31",
"2023-12-31", "2024-01-01", "2024-01-01"), Date1 = c(365L,
365L, 365L, 1L, 1L))
df$hrxhr <- mapply(\(from, to, by) length(seq.POSIXt(from, to, by)),
from = trunc(as.POSIXlt(df$Date), "years"),
to = as.POSIXlt(df$Date),
by="1 hour") + df$HOUR_OF_DAY - 1
df
#> YEAR MONTH DAY HOUR_OF_DAY Month_num Date Date1 hrxhr
#> 1 2023 Dec 31 22 12 2023-12-31 365 8758
#> 2 2023 Dec 31 23 12 2023-12-31 365 8759
#> 3 2023 Dec 31 24 12 2023-12-31 365 8760
#> 4 2024 Jan 1 1 12 2024-01-01 1 1
#> 5 2023 Jan 1 2 12 2024-01-01 1 2
If you are open to a tidyverse / lubridate solution, you could use
library(dplyr)
library(lubridate)
df1 %>%
mutate(
begin = ymd_hms(paste(year(Date), "-01-01 00:00:00")),
target = ymd_hms(paste(Date, HOUR_OF_DAY, ":00:00")),
hrxhr = time_length(interval(begin, target), "hours")) %>%
select(-begin, -target)
This returns
# A tibble: 5 × 7
YEAR MONTH DAY HOUR_OF_DAY Month_num Date hrxhr
<dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 2023 Dec 31 22 12 2023-12-31 8758
2 2023 Dec 31 23 12 2023-12-31 8759
3 2023 Dec 31 24 12 2023-12-31 8760
4 2024 Jan 01 1 12 2024-01-01 1
5 2024 Jan 01 2 12 2024-01-01 2
Data
structure(list(YEAR = c(2023, 2023, 2023, 2024, 2024), MONTH = c("Dec",
"Dec", "Dec", "Jan", "Jan"), DAY = c("31", "31", "31", "01",
"01"), HOUR_OF_DAY = c(22, 23, 24, 1, 2), Month_num = c(12, 12,
12, 12, 12), Date = structure(c(19722, 19722, 19722, 19723, 19723
), class = "Date")), row.names = c(NA, -5L), class = "data.frame")

How to add an increasing index based on multiple columns in R

I have a data frame that contains the columns "hour", "day","month" and "count".
library(tidyverse)
set.seed(0)
df <- expand_grid(expand_grid(
hour = seq(0:23),
day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
mutate(count = sample(0:100, n(), replace = TRUE))
head(df)
# A tibble: 6 × 4
hour day month count
<int> <chr> <chr> <int>
1 1 Mon Jan 13
2 1 Mon Feb 67
3 1 Mon Mar 38
4 1 Mon Apr 0
5 1 Mon May 33
6 1 Mon Jun 86
I would like to add a new column named "id" that contains an increasing index which can be used to sort the data in chronological order. The solution I found is not particularly concise and requires me to set factor levels before calling arrange(). Is there another way to solve this issue that capitalises on the fact that I am working with (unformatted) dates?
This is my solution with arrange():
df2 <- df %>%
mutate(day = factor(day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))) %>%
arrange(month, day, hour) %>%
mutate(id = row_number())
head(df2)
# A tibble: 6 × 5
hour day month count id
<int> <fct> <fct> <int> <int>
1 1 Mon Jan 13 1
2 2 Mon Jan 43 2
3 3 Mon Jan 82 3
4 4 Mon Jan 66 4
5 5 Mon Jan 49 5
6 6 Mon Jan 79 6
Any suggestions are much appreciated. Thank you!

Add a column in R with season time

I have a dataset like thatI want to add a column with season time like this:
Month
Year
Region
Season
January
2019
NY
Winter
February
2019
NY
Winter
March
2019
NY
Spring
September
2019
NY
Fall
How can I do a code in R that automatically add a column where all January, February and December are Winter, all March, April and May are Spring and so on.
Thanks a lot for helping
season <- c(data, Spring = "March", Spring = "April")
We can create a keyvalue dataset and do a join
library(dplyr)
keydat <- tibble(Month = month.name,
Season = rep(c("Winter", "Spring", "Summer", "Fall", "Winter"),
c(2, 3, 3, 3, 1)))
df1 <- left_join(df1, keydat)
-output
df1
Month Year Region Season
1 January 2019 NY Winter
2 February 2019 NY Winter
3 March 2019 NY Spring
4 September 2019 NY Fall
data
df1 <- structure(list(Month = c("January", "February", "March", "September"
), Year = c(2019L, 2019L, 2019L, 2019L), Region = c("NY", "NY",
"NY", "NY")), class = "data.frame", row.names = c(NA, -4L))
In base R you could do:
df1$Season <- c('Winter', 'Spring', 'Summer', 'Fall')[
1 + (match(df1$Month, month.name) %/% 3) %% 4]
Which results in:
df1
#> Month Year Region Season
#> 1 January 2019 NY Winter
#> 2 February 2019 NY Winter
#> 3 March 2019 NY Spring
#> 4 September 2019 NY Fall
(Using akrun's reproducible data)

Combining abbreviated months and year into one variable in R

I have a time series data with a column for a month and a column for a year. The months are JAN, FEB, etc.
I'm trying to combine them into one month year variable in order to run time series analysis on it. I'm very new to R and could use any guidance.
Perhaps something like this?
library(dplyr)
c("JAN", "FEB", "MAR", "APR",
"MAY", "JUN", "JUL", "AUG",
"SEP", "OCT", "NOV", "DEC") %>%
rep(., times = 3) %>%
as.factor() -> months
c("2018", "2019", "2020") %>%
rep(., each = 12) %>%
as.factor() -> years
df1 <- cbind.data.frame(months, years)
paste(df1$months, df1$years, sep = ".") %>%
as.factor() -> merged.years.months
Start with your month/year df.
library(tidyverse)
library(lubridate)
events <- tibble(month = c("JAN", "MAR", "FEB", "NOV", "AUG"),
year = c(2018, 2019, 2018, 2020, 2019))
Let's say that each of your time periods start on the first of the month.
series <- events %>%
mutate(mo1 = dmy(paste(1, month, year)))
This is what you want
R > series
# A tibble: 5 x 3
month year mo1
<chr> <dbl> <date>
1 JAN 2018 2018-01-01
2 MAR 2019 2019-03-01
3 FEB 2018 2018-02-01
4 NOV 2020 2020-11-01
5 AUG 2019 2019-08-01
These are now dates;you can use them in other analyses.
Base R solution:
events <- within(events,{
month_no <- as.integer(as.factor(sort(month)))
date <- as.Date(paste(year, ifelse(nchar(month_no) < 2, paste0("0", month_no),
month_no), "01", sep = "-"), "%Y-%m-%d")
rm(month_no, month, year)
}
)

Sum by months of the year with decades of data in R

I have a dataframe with some monthly data for 2 decades:
year month value
1960 January 925
1960 February 903
1960 March 1006
...
1969 December 892
1970 January 990
1970 February 866
...
1979 December 120
I would like to create a dataframe where I sum up the totals, for each decade, by month, as follows:
year month value
decade_60s January 4012
decade_60s February 8678
decade_60s March 9317
...
decade_60s December 3995
decade_70s January 8005
decade_70s February 9112
...
decade_70s December 325
I have been looking at the aggregate function, but this doesn't appear to be the right option.
I looked instead at some careful subsetting using the which function but this quickly became too messy.
For this kind of problem, what would be the correct approach? Will I need to use apply at some point, and if so, how?
I feel the temptation to use a for loop growing but I don't think this would be the best way to improve my skills in R..
Thanks for the advice.
PS: The month value is an ordinal factor, if this matters.
Aggregate is a way to go using base R
First define the decade
yourdata$decade <- cut(yourdata$year, breaks=c(1960,1970,1980), labels=c(60,70),
include.lowest=TRUE, right=FALSE)
Then aggregate the data
aggregate(value ~ decade + month, data=yourdata , sum)
Then order to get required output
plyr's count + gsub are definitely your friends here:
library(plyr)
dat <- structure(list(year = c(1960L, 1960L, 1960L, 1969L, 1970L, 1970L, 1979L),
month = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 1L),
.Label = c("December", "February", "January", "March"),
class = "factor"),
value = c(925L, 903L, 1006L, 892L, 990L, 866L, 120L)),
.Names = c("year", "month", "value"),
class = "data.frame", row.names = c(NA, -7L))
dat$decade <- gsub("[0-9]$", "0", dat$year)
count(dat, .(decade, month), wt_var=.(value))
## decade month freq
## 1 1960 December 892
## 2 1960 February 903
## 3 1960 January 925
## 4 1960 March 1006
## 5 1970 December 120
## 6 1970 February 866
## 7 1970 January 990

Resources