How to standardized datetime format in R [duplicate] - r

A sample of my dataframe:
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 18 August 1992
5 19 September 1995
6 16-Oct-63
7 30-Sep-65
8 22 Jan 2008
9 13-11-1961
10 18 August 1987
11 15-Sep-70
12 5 October 1994
13 5 December 1984
14 03/23/87
15 30 August 1988
16 26-10-1993
17 22 August 1989
18 13-Sep-97
I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.
I have tried using straightforward conversions e.g.
strftime(mydf$date,"%d/%m/%Y")
but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?
Code for example:
structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L,
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961",
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992",
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008",
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988",
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA,
-18L), class = "data.frame")

You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...
library(lubridate)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.
Other date-time formats which can be used in orders are listed in the Details section in ?strptime.

Here is a base solution:
fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")
We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).
Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:
past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
as.Date(sapply(d, past), "1970-01-01")
For the sample data the last line gives:
[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"

Try writing a function and then call it later. for example:
You have a character string "dd-mm-yyyy" and would like to only extract month out of it, then
month <- function(date_var){
# Store the month value in month
ay_month<- as.Date(date_var,format = "%d-%m-%Y")
month <- format(date_var, "%m")
return(month)
}
Now pass to find month in your vector, change the character format to Date. The output would be 04
month(as.Date("12-04-2014", format = "%d-%m-%Y"))

Related

Extracting Date and time from timestamp data [duplicate]

A sample of my dataframe:
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 18 August 1992
5 19 September 1995
6 16-Oct-63
7 30-Sep-65
8 22 Jan 2008
9 13-11-1961
10 18 August 1987
11 15-Sep-70
12 5 October 1994
13 5 December 1984
14 03/23/87
15 30 August 1988
16 26-10-1993
17 22 August 1989
18 13-Sep-97
I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.
I have tried using straightforward conversions e.g.
strftime(mydf$date,"%d/%m/%Y")
but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?
Code for example:
structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L,
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961",
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992",
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008",
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988",
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA,
-18L), class = "data.frame")
You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...
library(lubridate)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.
Other date-time formats which can be used in orders are listed in the Details section in ?strptime.
Here is a base solution:
fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")
We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).
Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:
past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
as.Date(sapply(d, past), "1970-01-01")
For the sample data the last line gives:
[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
Try writing a function and then call it later. for example:
You have a character string "dd-mm-yyyy" and would like to only extract month out of it, then
month <- function(date_var){
# Store the month value in month
ay_month<- as.Date(date_var,format = "%d-%m-%Y")
month <- format(date_var, "%m")
return(month)
}
Now pass to find month in your vector, change the character format to Date. The output would be 04
month(as.Date("12-04-2014", format = "%d-%m-%Y"))

as.Date gives na because of mixed formats in .txt file [duplicate]

A sample of my dataframe:
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 18 August 1992
5 19 September 1995
6 16-Oct-63
7 30-Sep-65
8 22 Jan 2008
9 13-11-1961
10 18 August 1987
11 15-Sep-70
12 5 October 1994
13 5 December 1984
14 03/23/87
15 30 August 1988
16 26-10-1993
17 22 August 1989
18 13-Sep-97
I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.
I have tried using straightforward conversions e.g.
strftime(mydf$date,"%d/%m/%Y")
but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?
Code for example:
structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L,
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961",
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992",
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008",
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988",
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA,
-18L), class = "data.frame")
You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...
library(lubridate)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.
Other date-time formats which can be used in orders are listed in the Details section in ?strptime.
Here is a base solution:
fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")
We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).
Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:
past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
as.Date(sapply(d, past), "1970-01-01")
For the sample data the last line gives:
[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
Try writing a function and then call it later. for example:
You have a character string "dd-mm-yyyy" and would like to only extract month out of it, then
month <- function(date_var){
# Store the month value in month
ay_month<- as.Date(date_var,format = "%d-%m-%Y")
month <- format(date_var, "%m")
return(month)
}
Now pass to find month in your vector, change the character format to Date. The output would be 04
month(as.Date("12-04-2014", format = "%d-%m-%Y"))

Data frame with date field as a factor and mix of date values

I've a data frame with date field as factor and have mix of values as below. How can I standardize, convert into date format and extract month and year?
1 Oct 24 2013 3:59PM
2 Nov 5 2013 3:00PM
3 Nov 26 2013 1:00PM
4 2015-05-05 21:09:00
5 Nov 19 2013 1:00PM
6 2015-05-28 20:23:00
7 2015-05-28 20:24:00
8 Nov 12 2013 1:00PM
9 2015-05-28 20:29:00
10 2015-05-28 20:26:00
See if data can be parsed with one format (the default for as.POSIXct.factor in this case) and then try the other if unsuccessful:
dats$dt2 <- as.POSIXct( # Needed b/c get numeric values from the `if(){}`; Why?
sapply(trim(dats$dt), # sjmisc:trim() only needed if have extra spaces
function(d) if( !is.na( strptime(d, "%Y-%m-%d %H:%M:%S") ) ){
as.POSIXct(d) } else {
as.POSIXct( d, format="%b %d %Y %H:%M%p") }), origin="1970-01-01" )
You could try parse_date_time() from the lubridate package. I find it makes dealing with multiple formats a lot easier. It's just a matter of fiddling around with the orders argument. Here we can use c("mdyR", "ymdT") for our orders vector.
library(lubridate)
parse_date_time(df$V1, c("mdyR", "ymdT"))
# [1] "2013-10-24 15:59:00 UTC" "2013-11-05 15:00:00 UTC"
# [3] "2013-11-26 13:00:00 UTC" "2015-05-05 21:09:00 UTC"
# [5] "2013-11-19 13:00:00 UTC" "2015-05-28 20:23:00 UTC"
# [7] "2015-05-28 20:24:00 UTC" "2013-11-12 13:00:00 UTC"
# [9] "2015-05-28 20:29:00 UTC" "2015-05-28 20:26:00 UTC"
To extract the month and year, we, can do the following.
pdt <- parse_date_time(df$V1, c("mdyR", "ymdT"))
month(pdt)
# [1] 10 11 11 5 11 5 5 11 5 5
year(pdt)
# [1] 2013 2013 2013 2015 2013 2015 2015 2013 2015 2015
Data:
df <- structure(list(V1 = structure(c(10L, 9L, 8L, 1L, 7L, 2L, 3L,
6L, 5L, 4L), .Label = c("2015-05-05 21:09:00", "2015-05-28 20:23:00",
"2015-05-28 20:24:00", "2015-05-28 20:26:00", "2015-05-28 20:29:00",
"Nov 12 2013 1:00PM", "Nov 19 2013 1:00PM", "Nov 26 2013 1:00PM",
"Nov 5 2013 3:00PM", "Oct 24 2013 3:59PM"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-10L))

How to convert variable with mixed date formats to one format?

A sample of my dataframe:
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 18 August 1992
5 19 September 1995
6 16-Oct-63
7 30-Sep-65
8 22 Jan 2008
9 13-11-1961
10 18 August 1987
11 15-Sep-70
12 5 October 1994
13 5 December 1984
14 03/23/87
15 30 August 1988
16 26-10-1993
17 22 August 1989
18 13-Sep-97
I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.
I have tried using straightforward conversions e.g.
strftime(mydf$date,"%d/%m/%Y")
but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?
Code for example:
structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L,
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961",
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992",
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008",
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988",
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA,
-18L), class = "data.frame")
You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...
library(lubridate)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.
Other date-time formats which can be used in orders are listed in the Details section in ?strptime.
Here is a base solution:
fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")
We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).
Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:
past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
as.Date(sapply(d, past), "1970-01-01")
For the sample data the last line gives:
[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
Try writing a function and then call it later. for example:
You have a character string "dd-mm-yyyy" and would like to only extract month out of it, then
month <- function(date_var){
# Store the month value in month
ay_month<- as.Date(date_var,format = "%d-%m-%Y")
month <- format(date_var, "%m")
return(month)
}
Now pass to find month in your vector, change the character format to Date. The output would be 04
month(as.Date("12-04-2014", format = "%d-%m-%Y"))

Sum by months of the year with decades of data in R

I have a dataframe with some monthly data for 2 decades:
year month value
1960 January 925
1960 February 903
1960 March 1006
...
1969 December 892
1970 January 990
1970 February 866
...
1979 December 120
I would like to create a dataframe where I sum up the totals, for each decade, by month, as follows:
year month value
decade_60s January 4012
decade_60s February 8678
decade_60s March 9317
...
decade_60s December 3995
decade_70s January 8005
decade_70s February 9112
...
decade_70s December 325
I have been looking at the aggregate function, but this doesn't appear to be the right option.
I looked instead at some careful subsetting using the which function but this quickly became too messy.
For this kind of problem, what would be the correct approach? Will I need to use apply at some point, and if so, how?
I feel the temptation to use a for loop growing but I don't think this would be the best way to improve my skills in R..
Thanks for the advice.
PS: The month value is an ordinal factor, if this matters.
Aggregate is a way to go using base R
First define the decade
yourdata$decade <- cut(yourdata$year, breaks=c(1960,1970,1980), labels=c(60,70),
include.lowest=TRUE, right=FALSE)
Then aggregate the data
aggregate(value ~ decade + month, data=yourdata , sum)
Then order to get required output
plyr's count + gsub are definitely your friends here:
library(plyr)
dat <- structure(list(year = c(1960L, 1960L, 1960L, 1969L, 1970L, 1970L, 1979L),
month = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 1L),
.Label = c("December", "February", "January", "March"),
class = "factor"),
value = c(925L, 903L, 1006L, 892L, 990L, 866L, 120L)),
.Names = c("year", "month", "value"),
class = "data.frame", row.names = c(NA, -7L))
dat$decade <- gsub("[0-9]$", "0", dat$year)
count(dat, .(decade, month), wt_var=.(value))
## decade month freq
## 1 1960 December 892
## 2 1960 February 903
## 3 1960 January 925
## 4 1960 March 1006
## 5 1970 December 120
## 6 1970 February 866
## 7 1970 January 990

Resources