Related
I have data with dates in a not directly usable format. I have data that are either annual, quaterly or mensual. Annual are stored correctly, quaterly are in the form 1Q2010, and monthly JAN2010.
So something like
library(tidyverse)
library(data.table)
MWE <- data.table(date=c("JAN2020","FEB2020","1Q2020","2020"),
value=rnorm(4,2,1))
> MWE
date value
1: JAN2020 2.5886057
2: FEB2020 0.5913031
3: 1Q2020 1.6237973
4: 2020 1.4093762
I want to have them in a standard format. I thing a decently readable way to do that is to replace the non standard elements, so to have these elements :
Date_Brute <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC","1Q","2Q","3Q","4Q")
Replaced by these ones
Date_Standardisee <- c("01-01","01-02","01-03","01-04","01-05","01-06","01-07", "01-08","01-09","01-10","01-11","01-12","01-01","01-04","01-07","01-10")
Now I think gsub does not work with vectors. I have found this answer that suggests using stingr::str_replace_all but I have not been able to make it function in a data.table.
I am open to other functions to replace a vector by another one, but would like to avoid for instance slicing the data, and using specific date lectures functions.
Desired output :
> MWE
date value
1: 01-01-2020 2.5886057
2: 01-02-2020 0.5913031
3: 01-01-2020 1.6237973
4: 2020 1.4093762
You can try with lubridate::parse_date_time() and which takes a vector of candidate formats to attempt in the conversion:
library(lubridate)
library(data.table)
MWE[, date := parse_date_time(date, orders = c("bY","qY", "Y"))]
date value
1: 2020-01-01 -0.4948354
2: 2020-02-01 1.0227036
3: 2020-01-01 2.6285688
4: 2020-01-01 1.9158595
We can use grep with as.yearqtr and as.yearmon to convert those 'date' elements into Date class and further change it to the specified format
library(zoo)
library(data.table)
MWE[grep('Q', date), date := format(as.Date(as.yearqtr(date,
'%qQ %Y')), '%d-%m-%Y')]
MWE[grep("[A-Z]", date), date := format(as.Date(as.yearmon(date)), '%d-%m-%Y')]
-output
MWE
# date value
#1: 01-01-2020 0.8931051
#2: 01-02-2020 2.9813625
#3: 01-01-2020 1.1918638
#4: 2020 2.8001267
Or another option is fcoalecse with myd from lubridate
library(lubridate)
MWE[, date := fcoalesce(format(myd(date, truncated = 2), '%d-%m-%Y'), date)]
I am trying to compare two dataframe date columns.
In the first dataframe
Name DOB
Alex 25071986
Jane 14122002
Sujan 28021999
The DOB in ddmmyyyy format.
In the other dataframe
Name DOB
Alex 0250786
Jane 1141202
Sujan 0280299
The DOB is in cddmmyy format.
Here c represnts the century elapse from 1900. So for 1986 it is 0, for 2002 it is 1 and so on...
What I have done so far is:
1) abc <- lubridate::mdy(df1[,DOB])
which shows abc in YYYY-MM-DD format.
2) a <- strftime(abc, format = "%C%d%m%y")
which gives me CCDDMMYY for example for 2016-12-11 it gives 20111216
Which is not what I need, I need it to be 1111216 (CDDMMYY).
Can someone help?
As suggested by #Rui Barradas,
sprintf("%d%s", lubridate::year(abc) %/% 100 - 19, strftime(abc, format = "%d%m%y"))
did the job for me.
I have a data frame
> df
Age year sex
12 80210 F
13 9123 M
I want to convert the year 80210 as 26june1982. How can I do this that the new data frame contains year in day month year formate from Julian days.
You can convert Julian dates to dates using as.Date and specifying the appropriate origin:
as.Date(8210, origin=as.Date("1960-01-01"))
#[1] "1982-06-24"
However, 80210 needs an origin pretty long ago.
You should substract the origin from the year column.
as.Date(c(80210,9123)-80210,origin='1982-06-26')
[1] "1982-06-26" "1787-11-08"
There are some options for doing this job in the R package date.
See for example on page 4, the function date.mmddyy, which says:
Given a vector of Julian dates, this returns them in the form “10/11/89”, “28/7/54”, etc.
Try this code:
age = c(12,13)
year = c(8210,9123)
sex = c("F","M")
df = data.frame(cbind(age,year,sex))
library(date)
date = date.mmddyy(year, sep = "/")
df2 = transform(df,year=date) #hint provided by jilber
df2
age year sex
1 12 6/24/82 F
2 13 12/23/84 M
I have questionnaire data where participants have inputted their date of birth in a wide variety of formats:
ID <- c(101,102,103,104,105,106,107)
dob <- c("20/04/2001","29/10/2000","September 1 2012","15/11/00","20.01.1999","April 20th 1999", "04/08/01")
df <- data.frame(ID, dob)
Before doing any analysis, I need to be able to subset the data when it is not in the correct format (i.e. dd/mm/yr) and then correct each cell in turn manually.
I tried using:
df$dob <- strptime(dob, "%d/%m/%Y")
... to highlight which of my dates were in the correct format, but I just get NAs for the dates that are inputted incorrectly which is not helpful if I want to subsequently change them (using the ID as a reference).
Does anyone have any ideas which may be able to help me?
Check out the lubridate package.
library(lubridate)
parse_date_time(dob, c("dmy", "Bdy"))
# [1] "2001-04-20 UTC" "2000-10-29 UTC" "2012-09-01 UTC" "0000-11-15 UTC" "1999-01-20 UTC"
# [6] "1999-04-20 UTC" "0001-08-04 UTC"
Disclaimer: I'm not sure if I understood your question completely.
In the snippet below, dob2 will contain a date or NA based on whether dob is in the correct format. You should be able to filter for is.na(dob2) to get the incorrect data. Note that 03/04/2013 can be interpreted as 3rd March or 4th April but you seem to be assuming it to be 3rd April so I went with it.
ID <- c(101,102,103,104,105,106,107)
dob <- c("20/04/2001","29/10/2000","September 1 2012","15/11/00","20.01.1999","April 20th 1999", "04/08/01")
df <- data.table(ID, dob)
df[,dob2 := as.Date(dob, "%d/%m/%Y")]
EDIT- added output. btw, you could also have done something like df[is.na(as.Date(dob, "%d/%m/%Y"))]
ID dob dob2
1: 101 20/04/2001 2001-04-20
2: 102 29/10/2000 2000-10-29
3: 103 September 1 2012 <NA>
4: 104 15/11/00 0000-11-15
5: 105 20.01.1999 <NA>
6: 106 April 20th 1999 <NA>
7: 107 04/08/01 0001-08-04
I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.