I have a dataframe like this:
df <- data.frame(id = c("a", "b"), date1 = c("06/10/2003", "2006-05-12"), date2 = c("2003-07-15", "10/01/2010"))
id date1 date2
a 06/10/2003 2003-07-15
b 2006-05-12 10/01/2010
I would like to convert these characters to dates. So far, I have been able to do it one column at a time with the following code:
df$new_date <- as.Date(df$date1, format = "%m/%d/%Y")
df$new_date2 <- as.Date(df$date1, format = "%Y-%m-%d")
df <- df %>%
mutate(date1 = coalesce(new_date,new_date2))
But I have a bunch of columns, is there a way to loop this? Thanks in advance!
We can use a function from lubridate, along with across within mutate:
library(tidyverse)
df %>%
mutate(across(starts_with("date"),
~lubridate::parse_date_time(.,orders = c("mdy", "ymd"))))
# id date1 date2
# 1 a 2003-06-10 2003-07-15
# 2 b 2006-05-12 2010-10-01
You could reshape the data frame using pivot_longer so all of the dates are in one column, then use a vectorized condition to address each of the formatting variations in turn, then use pivot_wider to return the data frame to its original shape.
library(tidyverse)
pivot_longer(df, cols = c("date1", "date2")) %>%
mutate(
value = case_when(
grepl("-", value) ~ as.Date(value, format = "%Y-%m-%d"),
grepl("/", value) ~ as.Date(value, format = "%m/%d/%Y")
)
) %>%
pivot_wider(names_from = "name", values_from = "value")
# A tibble: 2 x 3
id date1 date2
<chr> <date> <date>
1 a 2003-06-10 2003-07-15
2 b 2006-05-12 2010-10-01
Related
I have a dataset which has multiple start dates and end dates for each Id. I would like to take the earliest date from the "startDate" column and the latest date from the endDate column.
data = data.frame(ID=c(1,1,1,1,2,2,2),
startDate= c("2018-01-31", "2018-01-31", "2018-01-31", "2019-06-06",
"2002-06-07", "2002-06-07", "2002-09-12"),
endDate = c(NA,NA,NA,"2019-07-09",NA,NA, "2002-10-02"))
This is the output I was hoping to get:
data = data.frame(ID=c(1,2),
startDate= c("2018-01-31","2002-06-07"),
endDate = c("2019-07-09","2002-10-02"))
After trying I have figured out how to do this through the following code, but would prefer something more efficient if at all possible. I am continuously needing to do this and i would rather not have to create two separate dataframes. Thank you guys for your help!
data_start <- data %>%
group_by(ID) %>%
arrange(startDate) %>%
slice(1L)
data_end <- data %>%
group_by(ID) %>%
arrange(desc(endDate)) %>%
slice(1L)
data <- left_join(data_start[,c(1,2)], data_end[,c(1,3)], by="ID")
Or with first and last:
library(dplyr)
data %>%
group_by(ID) %>%
summarise(
startDate = first(startDate),
endDate = last(endDate)
)
# A tibble: 2 x 3
ID startDate endDate
* <dbl> <chr> <chr>
1 1 2018-01-31 2019-07-09
2 2 2002-06-07 2002-10-02
You can use min and max, working the variables as dates
data %>% group_by(ID) %>%
summarise(startDate = min(as.Date(startDate),na.rm = T),
endDate = max(as.Date(endDate),na.rm = T))
I am having trouble standardizing the Date format to be dd-mm-YYYY, This is my current code
Dataset
date
1 23/07/2020
2 22-Jul-2020
Current Output
df$date<-as.Date(df$date)
df$date = format(df$date, "%d-%b-%Y")
date
1 20-Jul-0022
2 <NA>
Desired Output
date
1 23-Jul-2020
2 22-Jul-2020
You can try this way
library(lubridate)
df$date <- dmy(df$date)
df$date <- format(df$date, format = "%d-%b-%Y")
# date
# 1 23-Jul-2020
# 2 22-Jul-2020
Data
df <- read.table(text = "date
1 23/07/2020
2 22-Jul-2020", header = TRUE)
I've saved your example data set as a dataframe named df. I used group_by from dplyr to all each date to be converted separately to the correct format.
library(tidyverse)
df %>%
group_by(date) %>%
mutate(date = as.Date(date, tryFormats = c("%d-%b-%Y", "%d/%m/%Y"))) %>%
mutate(date = format(date, "%d-%b-%Y"))
Having a dataframe like this:
data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"), stringAsFactor = FALSE)
How is it possible to keep only year in the timestamp column having in mind that all years are after 2000? An expected output:
data.frame(id = c(1,3), timestamp = c("2009", "2017"), stringAsFactor = FALSE)
Base R:
format(as.Date(df$timestamp, "%d-%m-%Y %H:%M:%S"), "%Y")
[1] "2009" "2017"
So in the dataframe:
df$year <- format(as.Date(df$timestamp, "%d-%m-%Y %H:%M:%S"), "%Y")
id timestamp year
1 1 20-10-2009 11:35:12 2009
2 3 01-01-2017 12:21:21 2017
Another option, if you're into or familiar with regex, is this:
sub(".*([0-9]{4}).*", "\\1", df$timestamp)
[1] "2009" "2017"
See if this answers your question. The code and the output is as follows :-
library(lubridate)
library(tidyverse)
df <- data.frame(id = c(1,3,4), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21","01-01-1998 12:21:21"), stringAsFactor = FALSE)
df$timestamp <- dmy_hms(df$timestamp)
df1 <- df %>%
filter(year(timestamp) > 2000) %>%
mutate(new_year = year(timestamp))
df1
#id timestamp stringAsFactor new_year
#1 1 2009-10-20 11:35:12 FALSE 2009
#2 3 2017-01-01 12:21:21 FALSE 2017
If you're not afraid of external packages, one option would be to make use of the lubridate package:
df <- data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"))
df <- df %>%
mutate(timestamp = lubridate::dmy_hms(timestamp)) %>%
mutate(year = lubridate::year(timestamp))
Obviously, if you actually want to replace the timestampe column, you have to change the last mutate command. Result:
id timestamp year
1 1 2009-10-20 11:35:12 2009
2 3 2017-01-01 12:21:21 2017
I have a tidyverse solution to your problem:
library(tidyverse)
data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"), stringAsFactor = FALSE) %>%
mutate(timestamp = timestamp %>%
str_extract("\\d{4}"))
The function str_extract("\\d{4}") should always extract the first four digits of your target variable.
Having two dataframes with dates like this:
df1 <- data.frame(id = c(1,1,1,1,2,2), date=c("2019/12/11 20:30:12", "2019/12/12 09:20:12", "2019/12/12 11:30:40", "2019/12/13 20:12:34", "2019/12/11 12:20:12", "2019/12/11 19:20:12"), values = c(23,4,1,3,4,2))
df2 <- data.frame(id = c(1,2), date = c("2019/12/12 09:20:12", "2019/12/11 19:20:12"))
How is it possible to use the values of dates of the second dataframe to keep rows before this date into the first dataframe?
Example of expected output:
data.frame(id = c(1,1,2,2), date=c("2019/12/11 20:30:12", "2019/12/12 09:20:12", "2019/12/11 12:20:12, "2019/12/11 19:20:12"), values = c(23,4,4,2))
We can do a left_join and then filter
library(dplyr)
library(lubridate)
left_join(df1, df2, by = 'id') %>%
filter(ymd_hms(date.x) <= ymd_hms(date.y)) %>%
select(id, date = date.x, values)
#id date values
#1 1 2019/12/11 20:30:12 23
#2 1 2019/12/12 09:20:12 4
#3 2 2019/12/11 12:20:12 4
#4 2 2019/12/11 19:20:12 2
I would like to retain my current date column in year-month format as date. It currently gets converted to chr format. I have tried as_datetime but it coerces all values to NA.
The format I am looking for is: "2017-01"
library(lubridate)
df<- data.frame(Date=c("2017-01-01","2017-01-02","2017-01-03","2017-01-04",
"2018-01-01","2018-01-02","2018-02-01","2018-03-02"),
N=c(24,10,13,12,10,10,33,45))
df$Date <- as_datetime(df$Date)
df$Date <- ymd(df$Date)
df$Date <- strftime(df$Date,format="%Y-%m")
Thanks in advance!
lubridate only handle dates, and dates have days. However, as alistaire mentions, you can floor them by month of you want work monthly:
library(tidyverse)
df_month <-
df %>%
mutate(Date = floor_date(as_date(Date), "month"))
If you e.g. want to aggregate by month, just group_by() and summarize().
df_month %>%
group_by(Date) %>%
summarize(N = sum(N)) %>%
ungroup()
#> # A tibble: 4 x 2
#> Date N
#> <date> <dbl>
#>1 2017-01-01 59
#>2 2018-01-01 20
#>3 2018-02-01 33
#>4 2018-03-01 45
You can solve this with zoo::as.yearmon() function. Follows the solution:
library(tidyquant)
library(magrittr)
library(dplyr)
df <- data.frame(Date=c("2017-01-01","2017-01-02","2017-01-03","2017-01-04",
"2018-01-01","2018-01-02","2018-02-01","2018-03-02"),
N=c(24,10,13,12,10,10,33,45))
df %<>% mutate(Date = zoo::as.yearmon(Date))
You can use cut function, and use breaks="month" to transform all your days in your dates to the first day of the month. So any date within the same month will have the same date in the new created column.
This is usefull to group all other variables in your data frame by month (essentially what you are trying to do). However cut will create a factor, but this can be converted back to a date. So you can still have the date class in your data frame.
You just can't get rid of the day in a date (because then, is not a date...). Afterwards you can create a nice format for axes or tables. For example:
true_date <-
as.POSIXlt(
c(
"2017-01-01",
"2017-01-02",
"2017-01-03",
"2017-01-04",
"2018-01-01",
"2018-01-02",
"2018-02-01",
"2018-03-02"
),
format = "%F"
)
df <-
data.frame(
Date = cut(true_date, breaks = "month"),
N = c(24, 10, 13, 12, 10, 10, 33, 45)
)
## here df$Date is a 'factor'. You could use substr to create a formated column
df$formated_date <- substr(df$Date, start = 1, stop = 7)
## and you can convert back to date class. format = "%F", is ISO 8601 standard date format
df$true_date <- strptime(x = as.character(df$Date), format = "%F")
str(df)