identify date format in R before converting - r

I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.

> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440

Related

Convert tibble to xts for analysis with performanceanalytics package

I have a tibble with a date and return column, that looks as follows:
> head(return_series)
# A tibble: 6 x 2
date return
<chr> <dbl>
1 2002-01 0.0292
2 2002-02 0.0439
3 2002-03 0.0240
4 2002-04 0.00585
5 2002-05 -0.0169
6 2002-06 -0.0686
I first add the day to the date column with the following code:
return_series$date <- as.Date(as.yearmon(return_series$date))
# A tibble: 6 x 2
date return
<date> <dbl>
1 2002-01-01 0.0292
2 2002-02-01 0.0439
3 2002-03-01 0.0240
4 2002-04-01 0.00585
5 2002-05-01 -0.0169
6 2002-06-01 -0.0686
My goal is to convert the return_series tibble to xts data to use it for further analysis with the PerformanceAnalytics package. But when I use the command as.xts I receive the following error:
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
How can I change the format to xts or is there an other possibility to work with the PerformanceAnalytics package instead of converting to xts?
Thank you very much for your help!
You need to follow the xts documentation more closely:
> tb <- as_tibble(data.frame(date=as.Date("2002-01-01") + (0:5)*30,
+ return=rnorm(6)))
> tb
# A tibble: 6 × 2
date return
<date> <dbl>
1 2002-01-01 0.223
2 2002-01-31 -0.352
3 2002-03-02 0.149
4 2002-04-01 1.42
5 2002-05-01 -1.04
6 2002-05-31 0.507
>
> x <- xts(tb[,-1], order.by=as.POSIXct(tb[[1]]))
> x
return
2001-12-31 18:00:00 0.222619
2002-01-30 18:00:00 -0.352288
2002-03-01 18:00:00 0.149319
2002-03-31 18:00:00 1.421967
2002-04-30 19:00:00 -1.035087
2002-05-30 19:00:00 0.507046
>
An xts object prefers a POSIXct datetime object, which you can convert from a Date object. For a (closely-related) zoo object you could keep Date.

Summarise a vector and then append the summary statistics to the original dataframe in R

Intro:
I would like to compute the mean, standard deviation, and standard error of a numeric vector in a given dataframe and then create three new vectors using these summary statistics. I then need to combine them with the original dataframe.
Example Code:
## Creating our dataframe:
datetime <- c("5/12/2017 16:15:00","5/16/2017 16:45:00","5/19/2017 17:00:00")
datetime <- as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S")
values <- c(1,2,3)
df <- data.frame(datetime, values)
## Here's the current output:
head(df)
datetime values
1 2017-05-12 16:15:00 1
2 2017-05-16 16:45:00 2
3 2017-05-19 17:00:00 3
## And here's the desired output:
head(df1)
datetime values mean sd se
1 2017-05-12 16:15:00 1 2 0.816 0.471
2 2017-05-16 16:45:00 2 2 0.816 0.471
3 2017-05-19 17:00:00 3 2 0.816 0.471
Thanks in advance!
For those who are curious as to why I am trying to do this, I am following this tutorial. I need to make one of those line graph plots with errorbars for some calibrations between a low-cost sensor and an expensive reference instrument.
You can do the assignment simultaneously. Suppose you already have the helper function for you choice of sd and se:
sd0 <- function(x){sd(x) / sqrt(length(x)) * sqrt(length(x) - 1)}
se0 <- function(x){ sd0(x) / sqrt(length(x))}
Then you can try:
df[c('mean', 'sd', 'se')] <- lapply(list(mean, sd0, se0), function(f) f(df$values))
# > df
# datetime values mean sd se
# 1 2017-05-12 16:15:00 1 2 0.8164966 0.4714045
# 2 2017-05-16 16:45:00 2 2 0.8164966 0.4714045
# 3 2017-05-19 17:00:00 3 2 0.8164966 0.4714045
Here is the dplyr solution, with sd0 and se0 given in mt1022's answer:
df %>% mutate("mean"=mean(values),"sd"=sd0(values),"se"=se0(values))

How can I filter specifically for certain months if the days are not the same in each year?

This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]

R Search for a particular time from index

I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6

which() with objects of type character

I have a questions that might be too basic, but here it is...
I want to extract monthly data from a dataset like this:
Date Obs
1 2001-01-01 120
2 2001-01-02 100
3 2001-01-03 150
4 2001-01-04 175
5 2001-01-05 121
6 2001-01-06 100
I just want to get the rows from the data where I have a certain month(e.g. January), this works perfectly:
output=which(strftime(dataset[,1],"%m")=="01",dataset[,1])
However when I try to create a loop to go through all the months using a variable that is declared has character it doesn't work and I only get "FALSE".
value=as.character(k)
output=which(strftime(dataset[,1],"%m")==value,dataset[,1])
Do not parse dates as strings. That is too error prone. Parse dates as dates, and do logical comparisons on them.
Here is one approach, creating January to March data and sub-setting February based on a comparison:
R> output <- data.frame(date=seq(as.Date("2011-01-01"), by=7, length=10),
+ value=cumsum(runif(10)*100))
R> output
date value
1 2011-01-01 8.29916
2 2011-01-08 44.82950
3 2011-01-15 72.08662
4 2011-01-22 134.19277
5 2011-01-29 221.67744
6 2011-02-05 245.77195
7 2011-02-12 314.82081
8 2011-02-19 396.34661
9 2011-02-26 437.14286
10 2011-03-05 442.41321
R> output[ output[,"date"] >= as.Date("2011-02-01") &
+ output[,"date"] <= as.Date("2011-02-28"), ]
date value
6 2011-02-05 245.772
7 2011-02-12 314.821
8 2011-02-19 396.347
9 2011-02-26 437.143
R>
Another approach uses the xts package:
R> oo <- xts(output[,"value"], order.by=output[,"date"])
R> oo
[,1]
2011-01-01 8.29916
2011-01-08 44.82950
2011-01-15 72.08662
2011-01-22 134.19277
2011-01-29 221.67744
2011-02-05 245.77195
2011-02-12 314.82081
2011-02-19 396.34661
2011-02-26 437.14286
2011-03-05 442.41321
R> oo["2011-02-01::2011-02-28"]
[,1]
2011-02-05 245.772
2011-02-12 314.821
2011-02-19 396.347
2011-02-26 437.143
R>
as xts has convenient date parsing for the index; see the package documentation for details.
I'm assuming k is an integer in 1:12. I suspect you may be better off using abbreviated month names:
value <- month.abb[k]
output <- which(strftime(dataset[,1],"%b")==value,dataset[,1])
The reason you way isn't working is because the month number is zero-padded and "1" != "01".
You can also use dates as dates with POSIXlt()$mon
as.POSIXlt(output$date)$mon # Note that Jan = 0 and Feb=1
[1] 0 0 0 0 0 1 1 1 1 2
There are several other packages such as chron, lubridate and gdata that provide date handling functions. I found the functions in lubridate particularly intuitive and less prone to errors in my clumsy hands.

Resources