When reading in csv files using the readr package date objects are stored as integer values. When I say stored as integer I don't mean the class of the date column, I mean the underlying date value R stores. This prevents the ability to use the dplyr join functions if one data frame's dates are stored as numeric values and the other's are integer. I've included a reproducible example below. Is there anything I can do to prevent this behavior?
library(readr)
df1 <- data.frame(Date = as.Date(c("2012-11-02", "2012-11-04", "2012-11-07", "2012-11-09", "2012-11-11")), Text = c("Why", "Does", "This", "Happen", "?"), stringsAsFactors = F)
class(df1$Date)
# [1] "Date"
dput(df1$Date[1])
# structure(15646, class = "Date")
# Write to dummy csv
write.csv(df1, file = "dummy_csv.csv", row.names = F)
# Read back in data using both read.csv and read_csv
df2 <- read.csv("dummy_csv.csv", as.is = T, colClasses = c("Date", "character"))
df3 <- read_csv("dummy_csv.csv")
# Examine structure of date values
class(df2$Date)
# [1] "Date"
class(df3$Date)
# [1] "Date"
dput(df2$Date[1])
# structure(15646, class = "Date")
dput(df3$Date[1])
# structure(15646L, class = "Date")
# Try to join using dplyr joins
both <- full_join(df2, df3, by = c("Date"))
Error: cannot join on columns 'Date' x 'Date': Cant join on 'Date' x 'Date' because of incompatible types (Date / Date)
# Base merge works
both2 <- merge(df2, df3, by = "Date")
# converting a POSIXlt object to Date is also stored as numeric
temp_date <- as.Date(as.POSIXct("11OCT2012:19:00:00", format = "%d%b%Y:%H:%M:%S"))
dput(temp_date)
# structure(15624, class = "Date")
Judging by this issue on the dplyr repo it seems like Hadley thinks this is a feature but any time your date values are stored differently you can't merge on them, and I haven't figured out a way to convert the integer date object to a numeric one. Is there anyway to stop the readr package from doing this or any way to convert a Date object stored as an integer to a numeric value?
According to the big man himself This is a bug with dplyr not readr. He says the storing of numeric vs integer values when reading in files is ok but dplyr should be able to handle the difference like merge does.
Related
I have two databases where I need to combine columns based on 2 common Date columns, with condition that the DAY for those dates are the same.
"2020/01/01 20:30" MUST MATCH "2020/01//01 17:50"
All dates are in POSIXct format.
While I could use some pre-cprocessing with string parsing or the like, I wanted to handle it via lubridate/dplyr like:
DB_New <- left_join(DB_A,DB_B, by=c((date(Date1) = date(Date2)))
notice I am using the function "date" from dplyr to rightly match condition as explained above. I am though getting the error as below:
DB_with_rain <- left_join(DB_FEB_2019_join,Chuvas_BH, by=c(date(Saida_Real)= date(DateTime)))
Error: unexpected '=' in "DB_with_rain <- left_join(DB_FEB_2019_join,Chuvas_BH, by=c(date(Saida_Real)="
Within in the by, we cannot do the conversion - it expects the column name as a string. It should be done before the left_join
library(dplyr)
DF_FEB_2019_join %>%
mutate(Saida_Real = as.Date(Saida_Real, format = "%Y/%m/%d %H:%M")) %>%
left_join(Chuvas_BH %>%
mutate(DateTime = as.Date(DateTime, format = "%Y/%m/%d %H:%M")),
by = c(Saida_Real = "DateTime"))
With lubridate function, the as.Date can be replaced with ymd_hm and convert to Date class with as.Date
I have a large dataset that I'm importing from a txt file that has multiple date variables that are being formatted as number values 20190101, is there a way to assign a date format as part of import? There is no header in the file and I'm assigning names and lengths sample code below.
df <- read_fwf("file name",
fwf_cols(id = 8,
update_date = 8,
name = 35),
skip = 0)
Or is there a way to convert multiple values in one statement vs one at a time?
df$update_date <- as.Date(as.character(df$update_date), "%Y%m%d")
Here is a way to convert multiple values in one statement into Dates
(assuming yyyy mm dd). Here we target all columns that end with "date" in their name.
library(dplyr)
df <- data.frame(update_date = c(20190101, 20190102, 20190103),
end_date = c(20200101, 20200102, 20200103))
df %>% mutate_at(vars(ends_with("date")), ~as.Date(as.character(.x),format="%Y%m%d"))
You might similarly use
mutate_at(vars(starts_with("date"))
or
mutate_at(vars(c(update_date, end_date)
I am Rstudio for my R sessions and I have the following R codes:
d1 <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = TRUE)
d2 <- d1 %>%
mutate(PickUpDate = ymd(PickUpDate))
str(d2$PickUpDate)
output of last line of code above is as follows:
Date[1:14258], format: "2016-10-21" "2016-07-15" "2016-07-01" "2016-07-01" "2016-07-01" "2016-07-01" ...
I need an additional column (let's call it MthDD) to the dataframe d2, which will be the Month and Day of the "PickUpDate" column. So, column MthDD need to be in the format mm-dd but most importantly, it should still be of the date type.
How can I achieve this?
UPDATE:
I have tried the following but it outputs the new column as a character type. I need the column to be of the date type so that I can use it as the x-axis component of a plot.
d2$MthDD <- format(as.Date(d2$PickUpDate), "%m-%d")
Date objects do not display as mm-dd. You can create a character string with that representation but it will no longer be of Date class -- it will be of character class.
If you want an object that displays as mm-dd and still acts like a Date object what you can do is create a new S3 subclass of Date that displays in the way you want and use that. Here we create a subclass of Date called mmdd with an as.mmdd generic, an as.mmdd.Date method, an as.Date.mmdd method and a format.mmdd method. The last one will be used when displaying it. mmdd will inherit methods from Date class but you may still need to define additional methods depending on what else you want to do -- you may need to experiment a bit.
as.mmdd <- function(x, ...) UseMethod("as.mmdd")
as.mmdd.Date <- function(x, ...) structure(x, class = c("mmdd", "Date"))
as.Date.mmdd <- function(x, ...) structure(x, class = "Date")
format.mmdd <- function(x, format = "%m-%d", ...) format(as.Date(x), format = format, ...)
DF <- data.frame(x = as.Date("2018-03-26") + 0:2) # test data
DF2 <- transform(DF, y = as.mmdd(x))
giving:
> DF2
x y
1 2018-03-26 03-26
2 2018-03-27 03-27
3 2018-03-28 03-28
> class(DF2$y)
[1] "mmdd" "Date"
> as.Date(DF2$y)
[1] "2018-03-26" "2018-03-27" "2018-03-28"
Try using this:
PickUpDate2 <- format(PickUpDate,"%m-%d")
PickUpDate2 <- as.Date(PickUpDate2, "%m-%d")
This should work, and you should be able to bind_cols afterwards, or just add it to the data frame right away, as you proposed in the code you provided. So the code should be substituted to be:
d2$PickUpDate2 <- format(d2$PickUpDate,"%m-%d")
d2$PickUpDate2 <- as.Date(d2$PickUpDate2, "%m-%d")
I have a csv file containing financial data (i.e. dates with corresponding prices). My goal is to load these data in R and convert the dates from character data to dates. I tried the following:
data<-read.csv("data.csv",sep=";")
attach(data)
as.Date(Date,format="%Y-%b-%d") #'Date' is the column containing the dates
Unfortunately, this only leads to NAs in Date. Things that were proposed in other threads on this issue but did not help me:
reading in the csv file with 'stringsAsFactors=FALSE'
formatting the dates in Excel as dates
Here is a sample of my csv file:
Date;Open;High;Low;Close;Volume;Adj Close
30.10.2015;10842.51953;10850.58008;10748.7002;10850.13965;89270000;10850.13965
29.10.2015;10867.19043;10886.98047;10741.13965;10800.83984;122513100;10800.83984
28.10.2015;10728.16016;10848.41016;10691.62988;10831.95996;0;10831.95996
27.10.2015;10761.37012;10807.41016;10692.19043;10692.19043;0;10692.19043
26.10.2015;10791.17969;10863.08984;10756.83008;10801.33984;73091500;10801.33984
23.10.2015;10610.33008;10847.46973;10586.95996;10794.54004;0;10794.54004
22.10.2015;10213.00977;10508.25;10194.74023;10491.96973;107511600;10491.96973
21.10.2015;10185.41992;10277.58984;10107.91992;10238.09961;70021400;10238.09961
20.10.2015;10174.79981;10194.53027;10080.19043;10147.67969;67235200;10147.67969
Your format argument was incorrect, which is usually the cause of NAs when coercing strings to Date objects. You can use this instead:
R> as.Date(Df$Date, format = "%d.%m.%Y")
#[1] "2015-10-30" "2015-10-29" "2015-10-28" "2015-10-27" "2015-10-26"
#[6] "2015-10-23" "2015-10-22" "2015-10-21" "2015-10-20"
Instead of attach, you can use alternatives such as within to avoid qualifying your column names. For example,
Df <- within(Df, {
Date <- as.Date(Date, format = "%d.%m.%Y")
})
##
R> class(Df$Date)
#[1] "Date"
Data:
Df <- read.table(
text = "Date;Open;High;Low;Close;Volume;Adj Close
30.10.2015;10842.51953;10850.58008;10748.7002;10850.13965;89270000;10850.13965
29.10.2015;10867.19043;10886.98047;10741.13965;10800.83984;122513100;10800.83984
28.10.2015;10728.16016;10848.41016;10691.62988;10831.95996;0;10831.95996
27.10.2015;10761.37012;10807.41016;10692.19043;10692.19043;0;10692.19043
26.10.2015;10791.17969;10863.08984;10756.83008;10801.33984;73091500;10801.33984
23.10.2015;10610.33008;10847.46973;10586.95996;10794.54004;0;10794.54004
22.10.2015;10213.00977;10508.25;10194.74023;10491.96973;107511600;10491.96973
21.10.2015;10185.41992;10277.58984;10107.91992;10238.09961;70021400;10238.09961
20.10.2015;10174.79981;10194.53027;10080.19043;10147.67969;67235200;10147.67969",
header = TRUE, stringsAsFactors = FALSE, sep = ";")
I need to subset my data by a date range, below is the code.
I read in two .csv (data2010, data2), I changed the date format to exclude the timestamp, rename the headers so they are the same for both files, then merge(data2011).
The files seem to actually merge but when I subset by the date range, no observations are created.
However, the date is grouped like 01/01/10 01/01/11 01/02/10 01/02/11 =
so same month/same day/different year pairing.
data2010 <- read.csv(file="2010final.csv")
data2 <- read.csv(file="2011final.csv")
#change format of timestamp to date with mm/dd/yyyy for 2011
data2$newdate <-strptime(as.character(data2$Date), "%m/%d/%y")
data2$Date <- format(data2$newdate, "%m/%d/%y")
data2$newdate <- NULL
#rename and format 2010
names(data2010) <- c("Region", "District", "Age", "Gender", "Marital Status", "Date", "Reason")
data2010$newdate <-strptime(as.character(data2010$Date), "%m/%d/%y %H")
data2010$Date <- format(data2010$newdate, "%m/%d/%y")
data2010$newdate <- NULL
#merge
data2011 <- rbind(data2010, data2)
summary(data2011)
str(data2011)
#I see from the above commands that the files have merged
jan6Before <- subset(data2011, Date >= "12/22/10" & Date <= "01/06/11")
summary(jan6Before)
str(jan6Before)
#But this does not produce any observations
I suspect it's because your Date variable is a character, not date, being compared to another character constant i.e. "12/22/10".
I suggest you have a look at the package lubridate. You can then easily convert character (in this case month-date-year) to compare, e.g. mdy(Date) >= mdy("12/22/10") .
Merge on your variable newDate, and use that for subsetting also.