Parsing dates with different formats using lubridate

Parsing dates with different formats using lubridate - r

I am importing data from a csv file where the date column contains dates recorded in different formats. I wish to parse the column so that it has the class date and such that all of the dates are formatted in the same style (i.e %d-%m-%Y). I wish to use lubridate as I have some experience with it and want to get better using it.
I have looked for answers here Parsing dates with different formats and here Parsing dates in multiple formats in R using lubridate but I found the answers incomplete.
Typically when I import csv data I change the col_types like so:
potatoes <- read_csv("data/potato_prices.csv",
col_types = cols(
DATE = col_date(format = "%Y-%m-%d"),
'M04003DE00BERM372NNBR' = col_double())) %>%
rename("Price" = "M04003DE00BERM372NNBR")
but because my DATE column contains dates in different formats, dates not formatted like "%Y-%m-%d" return an NA and the class of the column appears as unknown.
I have tried col_guess, instead of specifying with col_date with the exact date formats and then mutating the DATE column with the following code, but it has not worked as I would like.
potatoes <- read_csv("data/potato_prices.csv",
col_types = cols(
DATE = col_guess(),
'M04003DE00BERM372NNBR' = col_double()))
potatoes <- potatoes %>%
mutate(DATE = parse_date_time(DATE, orders = c("Ymd", "dmY"))) %>%
rename("Price" = "M04003DE00BERM372NNBR")
Here is an example of how my data appears in excel in csv format
DATE <- c("1879-01-01", "1879-02-01", "1879-03-01", "1879-04-01", "1/05/1990", "1/06/1990", "1/07/1990", "1/08/1990", "1/09/1990", "1/10/1990")
Price <- c("23", "17.9", "17.8", "18", "20", "22", "20", "19", "17.2", "15")
spuds <- data.frame(DATE, Price)
I wish to have a tibble with two columns; DATE as class col_date and Price as class col_double. I will then create plots using ggplot and I think it will be easiest if my DATE column is in class date.
Thanks

The following function will try the several date formats passed in its argument format. It uses lubridate function guess_formats in order to get the possible formats based on that argument.
as_Date <- function(x, format = c("ymd", "dmy", "mdy")){
fmt <- lubridate::guess_formats(x, format)
fmt <- unique(fmt)
y <- as.Date(x, format = fmt[1])
for(i in seq_along(fmt)[-1]){
na <- is.na(y)
if(!any(na)) break
y[na] <- as.Date(x[na], format = fmt[i])
}
y
}
formats <- c("ymd", "dmy")
as_Date(spuds$DATE, formats)
#[1] "1879-01-01" "1879-02-01" "1879-03-01" "1879-04-01"
#[5] "1990-05-01" "1990-06-01" "1990-07-01" "1990-08-01"
#[9] "1990-09-01" "1990-10-01"

Related

How to format properly date-time column in R using mutate?

I am trying to format a string column to a date-time serie.
The row in the column are like this example: "2019-02-27T19:08:29+000"
(dateTime is the column, the variable)
mutate(df,dateTime=as.Date(dateTime, format = "%Y-%m-%dT%H:%M:%S+0000"))
But the results is:
2019-02-27
What about the hours, minutes and seconds ?
I need it to apply a filter by date-time

Your code is almost correct. Just the extra 0 and the as.Date command were wrong:
library("dplyr")
df <- data.frame(dateTime = "2019-02-27T19:08:29+000",
stringsAsFactors = FALSE)
mutate(df, dateTime = as.POSIXct(dateTime, format = "%Y-%m-%dT%H:%M:%S+000"))

Strptime fails when working with a dataframe

Strptime seems to be missing something in this scenario:
aDateInPOSIXct <- strptime("2018-12-31", format = "%Y-%m-%d")
someText <- "asdf"
df <- data.frame(aDateInPOSIXct, someText, stringsAsFactors = FALSE)
bDateInPOSIXct <- strptime("2019-01-01", format = "%Y-%m-%d")
df[1,1] <- bDateInPOSIXct
Assignment of bDate to the dataframe fails with:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
And a warning:
provided 11 variables to replace 1 variables
I want to use both POSIXct dates and POSIXct date-times to compare this and that. It's way less work than manipulating character strings -- and POSIX takes care of the time zone issues. Unfortunately, I'm missing something.

You only need to cast your calls to strptime to POSIXct explicitly:
aDateInPOSIXct <- as.POSIXct(strptime("2018-12-31", format = "%Y-%m-%d"))
someText <- "asdf"
df <- data.frame(aDateInPOSIXct, someText, stringsAsFactors = FALSE)
bDateInPOSIXct <- as.POSIXct(strptime("2019-01-01", format = "%Y-%m-%d"))
df[1,1] <- bDateInPOSIXct
Check the R documentation which says:
Character input is first converted to class "POSIXlt" by strptime: numeric input is first converted to "POSIXct".

R, csv file, As.Date returns only NAs

I have a csv file containing financial data (i.e. dates with corresponding prices). My goal is to load these data in R and convert the dates from character data to dates. I tried the following:
data<-read.csv("data.csv",sep=";")
attach(data)
as.Date(Date,format="%Y-%b-%d") #'Date' is the column containing the dates
Unfortunately, this only leads to NAs in Date. Things that were proposed in other threads on this issue but did not help me:
reading in the csv file with 'stringsAsFactors=FALSE'
formatting the dates in Excel as dates
Here is a sample of my csv file:
Date;Open;High;Low;Close;Volume;Adj Close
30.10.2015;10842.51953;10850.58008;10748.7002;10850.13965;89270000;10850.13965
29.10.2015;10867.19043;10886.98047;10741.13965;10800.83984;122513100;10800.83984
28.10.2015;10728.16016;10848.41016;10691.62988;10831.95996;0;10831.95996
27.10.2015;10761.37012;10807.41016;10692.19043;10692.19043;0;10692.19043
26.10.2015;10791.17969;10863.08984;10756.83008;10801.33984;73091500;10801.33984
23.10.2015;10610.33008;10847.46973;10586.95996;10794.54004;0;10794.54004
22.10.2015;10213.00977;10508.25;10194.74023;10491.96973;107511600;10491.96973
21.10.2015;10185.41992;10277.58984;10107.91992;10238.09961;70021400;10238.09961
20.10.2015;10174.79981;10194.53027;10080.19043;10147.67969;67235200;10147.67969

Your format argument was incorrect, which is usually the cause of NAs when coercing strings to Date objects. You can use this instead:
R> as.Date(Df$Date, format = "%d.%m.%Y")
#[1] "2015-10-30" "2015-10-29" "2015-10-28" "2015-10-27" "2015-10-26"
#[6] "2015-10-23" "2015-10-22" "2015-10-21" "2015-10-20"
Instead of attach, you can use alternatives such as within to avoid qualifying your column names. For example,
Df <- within(Df, {
Date <- as.Date(Date, format = "%d.%m.%Y")
})
##
R> class(Df$Date)
#[1] "Date"
Data:
Df <- read.table(
text = "Date;Open;High;Low;Close;Volume;Adj Close
30.10.2015;10842.51953;10850.58008;10748.7002;10850.13965;89270000;10850.13965
29.10.2015;10867.19043;10886.98047;10741.13965;10800.83984;122513100;10800.83984
28.10.2015;10728.16016;10848.41016;10691.62988;10831.95996;0;10831.95996
27.10.2015;10761.37012;10807.41016;10692.19043;10692.19043;0;10692.19043
26.10.2015;10791.17969;10863.08984;10756.83008;10801.33984;73091500;10801.33984
23.10.2015;10610.33008;10847.46973;10586.95996;10794.54004;0;10794.54004
22.10.2015;10213.00977;10508.25;10194.74023;10491.96973;107511600;10491.96973
21.10.2015;10185.41992;10277.58984;10107.91992;10238.09961;70021400;10238.09961
20.10.2015;10174.79981;10194.53027;10080.19043;10147.67969;67235200;10147.67969",
header = TRUE, stringsAsFactors = FALSE, sep = ";")

Why does readr store date objects as integer values?

When reading in csv files using the readr package date objects are stored as integer values. When I say stored as integer I don't mean the class of the date column, I mean the underlying date value R stores. This prevents the ability to use the dplyr join functions if one data frame's dates are stored as numeric values and the other's are integer. I've included a reproducible example below. Is there anything I can do to prevent this behavior?
library(readr)
df1 <- data.frame(Date = as.Date(c("2012-11-02", "2012-11-04", "2012-11-07", "2012-11-09", "2012-11-11")), Text = c("Why", "Does", "This", "Happen", "?"), stringsAsFactors = F)
class(df1$Date)
# [1] "Date"
dput(df1$Date[1])
# structure(15646, class = "Date")
# Write to dummy csv
write.csv(df1, file = "dummy_csv.csv", row.names = F)
# Read back in data using both read.csv and read_csv
df2 <- read.csv("dummy_csv.csv", as.is = T, colClasses = c("Date", "character"))
df3 <- read_csv("dummy_csv.csv")
# Examine structure of date values
class(df2$Date)
# [1] "Date"
class(df3$Date)
# [1] "Date"
dput(df2$Date[1])
# structure(15646, class = "Date")
dput(df3$Date[1])
# structure(15646L, class = "Date")
# Try to join using dplyr joins
both <- full_join(df2, df3, by = c("Date"))
Error: cannot join on columns 'Date' x 'Date': Cant join on 'Date' x 'Date' because of incompatible types (Date / Date)
# Base merge works
both2 <- merge(df2, df3, by = "Date")
# converting a POSIXlt object to Date is also stored as numeric
temp_date <- as.Date(as.POSIXct("11OCT2012:19:00:00", format = "%d%b%Y:%H:%M:%S"))
dput(temp_date)
# structure(15624, class = "Date")
Judging by this issue on the dplyr repo it seems like Hadley thinks this is a feature but any time your date values are stored differently you can't merge on them, and I haven't figured out a way to convert the integer date object to a numeric one. Is there anyway to stop the readr package from doing this or any way to convert a Date object stored as an integer to a numeric value?

According to the big man himself This is a bug with dplyr not readr. He says the storing of numeric vs integer values when reading in files is ok but dplyr should be able to handle the difference like merge does.

Subsetting data table in R by date

I need to subset my data by a date range, below is the code.
I read in two .csv (data2010, data2), I changed the date format to exclude the timestamp, rename the headers so they are the same for both files, then merge(data2011).
The files seem to actually merge but when I subset by the date range, no observations are created.
However, the date is grouped like 01/01/10 01/01/11 01/02/10 01/02/11 =
so same month/same day/different year pairing.
data2010 <- read.csv(file="2010final.csv")
data2 <- read.csv(file="2011final.csv")
#change format of timestamp to date with mm/dd/yyyy for 2011
data2$newdate <-strptime(as.character(data2$Date), "%m/%d/%y")
data2$Date <- format(data2$newdate, "%m/%d/%y")
data2$newdate <- NULL
#rename and format 2010
names(data2010) <- c("Region", "District", "Age", "Gender", "Marital Status", "Date", "Reason")
data2010$newdate <-strptime(as.character(data2010$Date), "%m/%d/%y %H")
data2010$Date <- format(data2010$newdate, "%m/%d/%y")
data2010$newdate <- NULL
#merge
data2011 <- rbind(data2010, data2)
summary(data2011)
str(data2011)
#I see from the above commands that the files have merged
jan6Before <- subset(data2011, Date >= "12/22/10" & Date <= "01/06/11")
summary(jan6Before)
str(jan6Before)
#But this does not produce any observations

I suspect it's because your Date variable is a character, not date, being compared to another character constant i.e. "12/22/10".
I suggest you have a look at the package lubridate. You can then easily convert character (in this case month-date-year) to compare, e.g. mdy(Date) >= mdy("12/22/10") .

Merge on your variable newDate, and use that for subsetting also.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Parsing dates with different formats using lubridate - r

Related

How to format properly date-time column in R using mutate?

Strptime fails when working with a dataframe

R, csv file, As.Date returns only NAs

Why does readr store date objects as integer values?

Subsetting data table in R by date

Categories

Resources