How to clean messy date formats in a dataframe using R - r

What is a quick way to clean a column with multiple date formats and obtain only the year?
Suppose in r there is a dataframe (df) as below, which has aDatecolumn of characters with different dates formats.
df <- data.frame(z= paste("Date",seq(1:10)), Date=c("2000-10-22", "9/21/2001", "2003", "2017/2018", "9/28/2010",
"9/27/2011","2019/2020", "2017-10/2018-12", "NA", "" ))
df:
z Date
1 Date 1 2000-10-22
2 Date 2 9/21/2001
3 Date 3 2003
4 Date 4 2017/2018
5 Date 5 9/28/2010
6 Date 6 9/27/2011
7 Date 7 2019/2020
8 Date 8 2017-10/2018-12
9 Date 9 NA
10 Date 10
Using r commands what is a quick way to extract out the years e.g. 2003, 2010 from the Date column? The first year is to be selected for cells with two years in a row.
So that the expected output would be like below:
z Date year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2007/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2007/2018 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10

Use extract from tidyr. If there are two years it will use the first.
library(dplyr)
library(tidyr)
df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2017/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2019/2020 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10 NA
If the second year is needed as well then:
df %>%
extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year Year2
1 Date 1 2000-10-22 2000 NA
2 Date 2 9/21/2001 2001 NA
3 Date 3 2003 2003 NA
4 Date 4 2017/2018 2017 2018
5 Date 5 9/28/2010 2010 NA
6 Date 6 9/27/2011 2011 NA
7 Date 7 2019/2020 2019 2020
8 Date 8 2017-10/2018-12 2017 2018
9 Date 9 NA NA NA
10 Date 10 NA NA

Related

R - working days in a given week of a month

I am trying to get the working days of a given week on a given month.
This is my current data, as example:
year month week
2020 6 1
2020 6 2
2020 6 3
2020 6 4
2020 6 5
2020 7 1
2020 7 2
2020 7 3
2020 7 4
2020 7 5
The expected result:
year month week work_days
2020 6 1 5
2020 6 2 5
2020 6 3 5
2020 6 4 5
2020 6 5 2
2020 7 1 3
2020 7 2 5
2020 7 3 5
2020 7 4 5
2020 7 5 5
So as you can see I have year, month and week of the month but I can't get my head around how to get the working days for the week in R for any week of any month.
Thanks in advance
cal <- data.table(dates = seq.Date(ymd(20200601), ymd(20200731), by = "day")) %>%
.[, .(dates,
year = year(dates),
month = month(dates),
week = isoweek(dates), # new week starts on monday
weekday = !(weekdays(dates) %in% c("Sunday", "Saturday"))
)]
cal[, .(work_days = sum(weekday)), by = .(year, month, week)
][, week := rowid(month)][]

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

How to row bind all cases of a particular weekday in a given year-month into an R dataset

I have data that includes a date and day of week.
I would like to identify all instances of a particular weekday that match the given year/month/weekday
in the original data.
For instance if the first record has the date "2010-07-05" which is a Thursday, I want to rowbind all Thursdays
that occur in July of 2010 to my original dataset.
While adding those new rows, I also want to fill in those new rows with values from the original data for all columns, except one. The exception is a variable which indicates whether or not that row
was in the original dataset or not.
Example data:
(1) alldays -- this data includes all dates and weekdays for the appropriate years.
(2) dt1 -- this is the example dataset that includes the date Adate, and day of week dow that will be used to identify the year/month/weekday and then look for all dates within that same month for the given weekday. For example - all Thursdays in July of 2017 will need to row bound to the original data.
library(data.table)
library(tidyverse)
library(lubridate)
alldays <- data.table (date = seq(as.Date("2010-01-01"),
as.Date("2011-12-31"), by="days"))
alldays <- alldays %>%
dplyr::mutate(year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date),
dow = weekdays(date))
setDT(alldays)
head(alldays)
date year month day dow
1 2010-01-01 2010 1 1 Friday
2 2010-01-02 2010 1 2 Saturday
3 2010-01-03 2010 1 3 Sunday
4 2010-01-04 2010 1 4 Monday
5 2010-01-05 2010 1 5 Tuesday
6 2010-01-06 2010 1 6 Wednesday
Here is an example of the primary dataset
id <- seq(1:2)
admit <- rep(1,2)
zip <- c(54123, 54789)
Adate <- as.Date(c("2010-07-15","2011-03-14"))
year <- c(2010, 2011)
month <- c(7,3)
day <- c(15,14)
dow <- c("Thursday","Monday")
dt1 <- data.table(id, admit, zip, Adate, year, month, day, dow)
dt1
#> id admit zip Adate year month day dow
#> 1: 1 1 54123 2010-07-15 2010 7 15 Thursday
#> 2: 2 1 54789 2011-03-14 2011 3 14 Monday
The resulting dataset should be:
id admit zip Adate year month day dow
1: 1 0 54123 2010-07-01 2010 7 1 Thursday
2: 1 0 54123 2010-07-08 2010 7 8 Thursday
3: 1 1 54123 2010-07-15 2010 7 15 Thursday
4: 1 0 54123 2010-07-22 2010 7 22 Thursday
5: 1 0 54123 2010-07-29 2010 7 29 Thursday
6: 2 0 54789 2011-03-07 2011 3 7 Monday
7: 2 1 54789 2011-03-14 2011 3 14 Monday
8: 2 0 54789 2011-03-21 2011 3 21 Monday
9: 2 0 54789 2011-03-28 2011 3 28 Monday
So we can see that the first date dt1 2010-07-15 associated with id=1, which was a Thursday fell within a month with 4 additional Thursday in that month which were added to the dataset. The variable admit is the indicator of whether that row was in the original or subsequently added by virtue of the being matched.
I have tried first selecting the additional dates from alldays with matching weekdays but I am running into issues on how to rowbind those back into the original dataset while filling in the other values appropriately. Eventually I will be running this on a dataset with about 300,000 rows.
Here is an option:
alldays[dt1[, .(id, zip, admit=0L, year, month, dow)],
on=.(year, month, dow), allow.cartesian=TRUE][
dt1, on=.(id, date=Adate), admit := i.admit][]
output:
date year month day dow id zip admit
1: 2010-07-01 2010 7 1 Thursday 1 54123 0
2: 2010-07-08 2010 7 8 Thursday 1 54123 0
3: 2010-07-15 2010 7 15 Thursday 1 54123 1
4: 2010-07-22 2010 7 22 Thursday 1 54123 0
5: 2010-07-29 2010 7 29 Thursday 1 54123 0
6: 2011-03-07 2011 3 7 Monday 2 54789 0
7: 2011-03-14 2011 3 14 Monday 2 54789 1
8: 2011-03-21 2011 3 21 Monday 2 54789 0
9: 2011-03-28 2011 3 28 Monday 2 54789 0

How to merging R based on various column info and only one unique id

I m trying to merge 2 datasets:
dataset 1
id, month, year, postal
dataset 2
id, month, year, postal, Income, name, division
dataset 1
id year month postal
1 2010 9 j0r1h0
2 2010 8 j0r1h0
....
....
7 2007 6 j3x4p2
dataset 2
id, year, month, postal, name, division
1 2010 9 j0r1h0 john starting
2 2010 8 j0r1h0 lili retired
I want to keep all my columns and rows in dataset 1 and get the extra columns from dataset 2, like Income and division.
I get wrong result, duplicate field in month and year when I tried:
merge(a,b,by=c(postal,month,year,all.x=TRUE)
This is my expected result:
id year month postal name division
1 2010 9 j0r1h0 john starting
2 2010 8 j0r1h0 lili retired
3 2010 7 j1v3c4 verna starting
4 2009 1 j23c5 Greg medium
5 2007 1 j2j4d3 Greg medium
6 2008 2 j2p4s3 na na
7 2007 6 j3x4p2 na starting
And this is my result:
id year month postal name division
1 2010 9 j0r1h0 john starting
2 2010 8 j0r1h0 lili retired
3 2010 8 j0r1h0 na na
4 2010 7 na na na
5 2010 7 j1v3c4 verna starting
6 2009 1 j23c5 Greg medium
7 2007 1 j2j4d3 Greg medium
8 2008 2 j2p4s3 na na
9 2007 6 j3x4p2 na starting
9 2007 1 j3x4p2 na starting
my real data set size is over 200000 x 16

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

Resources