Conversion with as.date() of csv file false - r

I have a 1gb csv file with Dates and according values. Now is the Dates are in "undefined Format" - so they are diplayed as numbers in Excel like this:
DATE FXVol.DKK.EUR,0.75,4
38719 0.21825
I cannot open the csv file and change it to the date format I like since I would lose data in this way.
If I now import the data to R and convert the Dates:
as.Date( workingfilereturns[,1], format = "%Y-%m-%d")
It always yields dates that are 70 years + so 2076 instead of 2006. I really have no idea what goes wrong or how to fix this issue.

(Note: I have added a note about some quirks in R when dealing with Excel data. You may want to skip directly to that at the bottom; what follows first is the original answer.)
Going by your sample data, 38719 appears to be the number of days which have elapsed since January 1, 1900. So you can just add this number of days to January 1, 1900 to arrive at the correct Date object which you want:
as.Date("1900-01-01") + workingfilereturns[,1]
or
as.Date("1900-01-01") + workingfilereturns$DATE
Example:
> as.Date("1900-01-01") + 38719
[1] "2006-01-04"
Update:
As #Roland correctly pointed out, you could also use as.Date.numeric while specifying an origin of January 1, 1900:
> as.Date.numeric(38719, origin="1900-01-01")
[1] "2006-01-04"
Bug warning:
As the asker #Methamortix pointed out, my solution, namely using January 1, 1900, as the origin, yields a date which is two days too late in R. There are two reasons for this:
In R, the origin is indexed with 0, meaning that as.Date.numeric(0, origin="1900-01-01") is January 1, 1900, in R, but Excel starts counting at 1, meaning that formatting the number 1 in Excel as a Date yields January 1, 1900. This explains why R is one day ahead of Excel.
(Hold your breath) It appears that Excel has a bug in the year 1900, specifically Excel thinks that February 29, 1900 actually happened, even though 1900 was not a leap year (http://www.miniwebtool.com/leap-years-list/?start_year=1850&end_year=2020). As a result, when dealing with dates greater than February 28, 1900, R is a second day ahead of Excel.
As evidence of this, consider the following code:
> as.Date.numeric(57, origin="1900-01-01")
[1] "1900-02-27"
> as.Date.numeric(58, origin="1900-01-01")
[1] "1900-02-28"
> as.Date.numeric(59, origin="1900-01-01")
[1] "1900-03-01"
In other words, R's as.Date() correctly skipped over February 29th. But type the number 60 into a cell in Excel, format as date, and it will come back as February 29, 1900. My guess is that this has been reported somewhere, possibly on Stack Overflow or elsewhere, but let this serve as another reference point.
So, going back to the original question, the origin needs to be offset by 2 days when dealing with Excel dates in R, where the date is greater than February 28, 1900 (which is the case of the original problem). So he should use his date data frame in the following way:
as.Date.numeric(workingfilereturns$DATE - 2, origin="1900-01-01")
where the date column has been rolled back by two days to sync up with the values in Excel.

Related

Trouble obtaining quarterly values from a date variable in stata

I am starting with a date_of_survey variable that is a string formatted as YYYY-MM-DD. I then run the following commands to convert it to a date variable, and display that variable in a useful format:
gen date = date(date_of_survey, "YMD")
gen date_clean = date
format date_clean %dM_d,_CY
drop date_of_survey
That leaves me with a "date_clean" variable displayed as "September 3, 2020" and a corresponding "date" variable displayed as "22161" (equal to days since January 1, 1960).
I now need to create a variable that indicates the year and quarter of each observation, preferably in YYYY-QQ format. I assumed this shouldn't be difficult, but no matter how I have coded it, I wind up with years in the 7000s and inaccurate quarters. I must be misunderstanding how the dates are stored. My first instinct was to try a simple format date %tq command, but I'm still not getting the output I need. Any help is much appreciated. I read over the help files, and can't find the discrepancy that's causing this little problem.
ANSWER: I needed to put the date variable into quarters since January 1, 2021.  a qofd() function call before the format %tq did the trick!

How could generate a numerical value on Time?

I have time data for mixed linear analysis.
I hope to use R to center on time to get a numerical value.
Below is an example:
TIME = 0 at Wave 1 (0 month, September 2006),
TIME = 0.67 at Wave 2 (8 months, May 2007),
TIME = 1 at Wave 3 (12 months, September 2007),
TIME = 1.67 at Wave 4 (20 months, May 2008),
TIME = 2 at Wave 5 (24 months, September 2008),
TIME = 2.67 at Wave 6 (32 months, May 2009).
Expected format:
Time = ? Wave 1 is April 2020
Time = ? Wave 2 is July 2020
Time = ? Wave 3 is Jan 2021
Time = ? Wave 4 is April 2021
I hope to calculate the numerical value Time.
How could I use R to generate a Time Value like the example shows?
Perhaps I'm unfamiliar with this approach, but it doesn't look like you are "centering." It looks like you are calculating durations. Specifically, each of the values in the example that you give are just the time (in years) since wave 1 (i.e., May 2009 is 2.67 years from Sep 2006). There's nothing wrong with this, I just want to make sure we are working on the same problem.
Assuming you are just looking for the amount of time between two dates, you have two options.
Option 1: Lubridate
The lubridate package is generally the easiest way to work with dates. If you don't use it yet, I think you'll really appreciate how easy it makes handling dates and times in R (but it does need to be installed with "install.packages('lubridate')".
library(lubridate)
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'Jan 1, 2021', 'April 1, 2021')
wave_dates <- mdy(wave_dates) # lubridate converts from string to date objects
# get times in years
(wave_dates - min(wave_dates))/dyears(1)
# > [1] 0.0000000 0.2491444 0.7529090 0.9993155
Option 2: Base R
If you want to use base R, you'll need to make sure your dates are converted into a format R can understand with strptime(). Make sure to consult ?strptime()'s documentation for all of the different formatting instructions you can give it (there are a lot). In this case, we need...
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'January 1, 2021', 'April 1, 2021')
wave_dates <- strptime(wave_dates, '%B %d, %Y') # base R converts from string to date objects
difftime(wave_dates, min(wave_dates), units = 'days') / 365
#> [1] 0.0000000 0.2493151 0.7535388 1.0000000
Note that when using difftime() we need to divide our answer by 365 because it doesn't have a units = 'years' option. This is because some years (leap years) are a different length than others and base R is generally not designed to handle that. In contrast, lubridate can.

excel date time conversion to POSXIct

I'm reading in a date time column from Excel into R yet when converting to POSXIct and using the origin 1970-01-01 only converts to date time in 1970. Same with the lubridate package using the origin as lubridate=orgin. Any thoughts on how to remedy this?
the xlsx package works fine but for some reason the openxlsx does not.
test2 <- read.xlsx (FN, sheet = 3, startRow = 35, cols = 1:77)
test2$dt <- as.POSIXct(test2$DateTime, origin="1970-01-01")
DateTime reads in from excel as numeric and 43306.29
should be 7-25-2018 7:00:00 after conversion to POSXIct format but is 1970-01-01 05:01:46
One needs to know the key differences between the two time systems here.
In Excel 43306.29 means 43306 days from Jan 1, 1900 (day 1) and the 0.29 is the fraction of the day (about 7 hours here).
R uses Unix time keeping standard so it tracks the number of seconds from the Jan 1, 1970 (the beginning of time for a Unix programmer).
So in order to convert from Excel to R, you need to covert the number of days from the origin to the number of seconds (60 sec * 60 min * 24 hours).
as.POSIXct(43306.29*3600*24 , origin="1899-12-30")
#"2018-07-25 02:57:36 EDT"
as.POSIXct(43306.29*3600*24 , origin="1899-12-30", tz="UTC")
#"2018-07-25 06:57:36 UTC"
Note: Windows and Excel assumes there was a leap year in 1900 which there wasn't so the origin needs a correction to Dec 30, 1899.

R - Converting integer into date [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have integers like the following:
41764 41764 42634 42634 42445 42445 41792 41807 41813 41842 41838 41848 41849 41837
Which need to be converted into date, the time doesn't matter.
I'm told that when it's converted it should be in the year 2014, current conversions I've tried have either given the year as 1984 or 2084.
Thanks!
Sam Firke's janitor package includes a function for cleaning up this Excel mess:
x <- c(41764L, 41764L, 42634L, 42634L, 42445L, 42445L, 41792L, 41807L,
41813L, 41842L, 41838L, 41848L, 41849L, 41837L)
janitor::excel_numeric_to_date(x)
## [1] "2014-05-05" "2014-05-05" "2016-09-21" "2016-09-21" "2016-03-16" "2016-03-16" "2014-06-02"
## [8] "2014-06-17" "2014-06-23" "2014-07-22" "2014-07-18" "2014-07-28" "2014-07-29" "2014-07-17"
Excel reader functions likely take care of this for you, which would be the best approach.
I assume your have Excel date integers here. Microsoft Office Excel stores dates as sequential numbers that are called serial values. For example, in Microsoft Office Excel for Windows, January 1, 1900 is serial number 1, and January 1, 2008 is serial number 39448 because it is 39,448 days after January 1, 1900.
Please note Excel incorrectly assumes that the year 1900 is a leap year. No problem when calculating today only.
Microsoft Excel correctly handles all other leap years, including century years that are not leap years (for example, 2100). Only the year 1900 is incorrectly handled.
See Microsoft Knowledge Base for further information.
There is a offset of two days between the R script proposed by #loki and a calculation in Excel.
Please read following date conversion help documents (snippet see below):
## date given as number of days since 1900-01-01 (a date in 1989)
as.Date(32768, origin = "1900-01-01")
## Excel is said to use 1900-01-01 as day 1 (Windows default) or
## 1904-01-01 as day 0 (Mac default), but this is complicated by Excel
## incorrectly treating 1900 as a leap year.
## So for dates (post-1901) from Windows Excel
as.Date(35981, origin = "1899-12-30") # 1998-07-05
## and Mac Excel
as.Date(34519, origin = "1904-01-01") # 1998-07-05
## (these values come from http://support.microsoft.com/kb/214330)
use as.Date() as #MFR pointed out. However, use the origin 1900-01-01
x <- c(41764, 41764, 42634, 42634, 42445, 42445, 41792, 41807,
41813, 41842, 41838, 41848, 41849, 41837)
as.POSIXct.as.Date(x, origin = "1900-01-01")
# [1] "2014-05-07" "2014-05-07" "2016-09-23" "2016-09-23" "2016-03-18"
# [6] "2016-03-18" "2014-06-04" "2014-06-19" "2014-06-25" "2014-07-24"
# [11] "2014-07-20" "2014-07-30" "2014-07-31" "2014-07-19"

Post-Process a Stata %tw date in R

The %tw format in Stata has the form: 1960w1 which has no equivalent in R.
Therefore %tw dates must be post-processed.
Importing a .dta file into R, the date is an integer like 1304 (instead of 1985w5) or 1426 (instead of 1987w23). If it was a simple time series you could set a starting date as follows:
ts(df, start= c(1985,5), frequency=52)
Another possibility would be:
as.Date(Camp$date, format= "%Yw%W" , origin = "1985w5")
But if each row is not a single date, then you must convert it.
The package ISOweek is based on ISO-8601 with the form "1985-W05" and does not process the Stata %tw.
The Lubridate package does not work with this format. The week() returns the number of complete seven day periods that have occurred between the date and January 1st, plus one. week function
In Stata week 1 of any year starts on 1 January, whatever day of the week that is. Stata Documentation on Dates
In the format %W of Date in R the week starts as Monday as first day of the week.
From strptime %V is
the Week of the year as decimal number (00--53) as defined in ISO
8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise,
it is the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.) Strptime
Larmarange noted on Github that Haven doesn't interpret dates properly:
months, week, quarter and halfyear are specific format from Stata,
respectively %tm, %tw, %tq and %th. I'm not sure that there are
corresponding formats available in R. So far they are imported as
integers.
Is there a way to convert Stata %tw to a date format R understands?
Here is an Stata file with dates
This won't be an answer in terms of R code, but it is commentary on Stata weeks that can't be fitted into a comment.
Strictly, dates in Stata are not defined by the display formats that make them intelligible to people. A date in Stata is always a numeric variable or scalar or macro defined with origin the first instance in 1960. Thus it is at best a shorthand to talk about %tw dates, etc. We can use display to see the effects of different date display formats:
. di %td 0
01jan1960
. di %tw 0
1960w1
. di %tq 0
1960q1
. di %td 42
12feb1960
. di %tw 42
1960w43
. di %tq 42
1970q3
A subtle point made explicit above is that changing the display format will not change what is stored, i.e. the numeric value.
Otherwise put, dates in Stata are not distinct data types; they are just integers made intelligible as dates by a pertinent display format.
The question presupposes that it was correct to describe some weekly dates in terms of Stata weeks. This seems unlikely, as I know no instance in which a body outside StataCorp uses the week rules of Stata, not only that week 1 always starts on 1 January, but also that week 52 always includes either 8 or 9 days and hence that there is never a week 53 in a calendar year.
So, you need to go upstream and find out what the data should have been. Failing some explanation, my best advice is to map the 52 weeks of each year to the days that start them, namely days 1(7)358 of each calendar year.
Stata weeks won't map one-to-one to any other scheme for defining weeks.
More in this article on Stata weeks
It's not completely clear what the question is but the year and week corresponding to 1304 are:
wk <- 1304
1960 + wk %/% 52
## [1] 1985
wk %% 52 + 1
## [1] 5
so assuming that the first week of the year is week 1 and starts on Jan 1st, the beginning of the above week is this date:
as.Date(paste(1960 + wk %/% 52, 1, 1, sep = "-")) + 7 * (wk %% 52)
## [1] "1985-01-29"

Resources