I have time data for mixed linear analysis.
I hope to use R to center on time to get a numerical value.
Below is an example:
TIME = 0 at Wave 1 (0 month, September 2006),
TIME = 0.67 at Wave 2 (8 months, May 2007),
TIME = 1 at Wave 3 (12 months, September 2007),
TIME = 1.67 at Wave 4 (20 months, May 2008),
TIME = 2 at Wave 5 (24 months, September 2008),
TIME = 2.67 at Wave 6 (32 months, May 2009).
Expected format:
Time = ? Wave 1 is April 2020
Time = ? Wave 2 is July 2020
Time = ? Wave 3 is Jan 2021
Time = ? Wave 4 is April 2021
I hope to calculate the numerical value Time.
How could I use R to generate a Time Value like the example shows?
Perhaps I'm unfamiliar with this approach, but it doesn't look like you are "centering." It looks like you are calculating durations. Specifically, each of the values in the example that you give are just the time (in years) since wave 1 (i.e., May 2009 is 2.67 years from Sep 2006). There's nothing wrong with this, I just want to make sure we are working on the same problem.
Assuming you are just looking for the amount of time between two dates, you have two options.
Option 1: Lubridate
The lubridate package is generally the easiest way to work with dates. If you don't use it yet, I think you'll really appreciate how easy it makes handling dates and times in R (but it does need to be installed with "install.packages('lubridate')".
library(lubridate)
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'Jan 1, 2021', 'April 1, 2021')
wave_dates <- mdy(wave_dates) # lubridate converts from string to date objects
# get times in years
(wave_dates - min(wave_dates))/dyears(1)
# > [1] 0.0000000 0.2491444 0.7529090 0.9993155
Option 2: Base R
If you want to use base R, you'll need to make sure your dates are converted into a format R can understand with strptime(). Make sure to consult ?strptime()'s documentation for all of the different formatting instructions you can give it (there are a lot). In this case, we need...
wave_dates <- c('April 1, 2020', 'July 1, 2020', 'January 1, 2021', 'April 1, 2021')
wave_dates <- strptime(wave_dates, '%B %d, %Y') # base R converts from string to date objects
difftime(wave_dates, min(wave_dates), units = 'days') / 365
#> [1] 0.0000000 0.2493151 0.7535388 1.0000000
Note that when using difftime() we need to divide our answer by 365 because it doesn't have a units = 'years' option. This is because some years (leap years) are a different length than others and base R is generally not designed to handle that. In contrast, lubridate can.
Related
I have a hard time converting character to date in R.
I have a file where the dates are given as "2014-01", where the first is the year and the second is the week of the year. I want to convert this to a date type.
I have tried the following
z <- as.Date('2014-01', '%Y-%W')
print(z)
Output: "2014-12-05"
Which is not what I desire. I want to get the same format out, ie. the output should be "2014-01" but now as a date type.
It sounds like you are dealing with some version of year week, which exists in three forms in lubridate:
week() returns the number of complete seven day periods that have
occurred between the date and January 1st, plus one.
isoweek() returns the week as it would appear in the ISO 8601 system,
which uses a reoccurring leap week.
epiweek() is the US CDC version of epidemiological week. It follows
same rules as isoweek() but starts on Sunday. In other parts of the
world the convention is to start epidemiological weeks on Monday,
which is the same as isoweek.
Lubridate has functions to extract these from a date, but I don't know of a built-in way to go the other direction, from week to one representative day (out of 7 possible). One simple way if you're dealing with the first version would be to add 7 * (Week - 1) to jan 1 of the year.
library(dplyr)
data.frame(yearweek = c('2014-01', '2014-03')) %>%
tidyr::separate(yearweek, c("Year", "Week"), convert = TRUE) %>%
mutate(Date = as.Date(paste0(Year, "-01-01")) + 7 * (Week-1))
Year Week Date
1 2014 1 2014-01-01
2 2014 3 2014-01-15
For example:
Year A
Year B
1990
2021
1980
2021
Thanks in advance.
It depends of what you expect.
If you only want to use the difftime() function with date objects composed only of years (as below), it will work (it will set the day and month to the ones of today for the calculation).
> a = as.Date("2021", format("%Y"))
> b = as.Date("2010", format("%Y"))
> difftime(a,b)
Time difference of 4018 days
But if you want to get the difference in year, it is not possible, as the function documentation clearly state that the return value unit must be: units = c("auto", "secs", "mins", "hours", "days", "weeks")
You might find better way to handle date/time data with the lubridate package.
difftime requests a date object to be used, I tried reproducing this using only years but was unable to.
Why not simply use absolute value (abs())If you're only interested in year difference?
as an example so you can see the difference added to a new column:
Year_A <- c(1990, 1980)
Year_B <- c(2021, 2021)
df <- data.frame(Year_A, Year_B)
df$diff <- abs(Year_A - Year_B)
P.S I noticed the answer above me was added while I was answering and I can't comment to it due to low rep, i see you can't use "years" as a unit value there, the biggest one being weeks, but you can manipulate that from days/weeks to years if that's what you're after.
Using R
Got large clinical health data set to play with, but dates are weird
Most problematic is 2digityear/halfyear, as in 98/2, meaning at some point in 1998 after July 1
I have split the column up into 2 character columns, e.g. 98 and 2 but now need to convert the 2 digit year character string into an actual year.
I tried as.Date(data$variable,format="%Y") but not only did I get a conversion to 0098 as the year rather than 1998, I also got todays month and year arbitrarily added (the actual data has no month or day).
as in 0098-06-11
How do I get just 1998 instead?
Not elegant. But using combination of lubridate and as.Date you can get that.
library(lubridate)
data <- data.frame(variable = c(95, 96, 97,98,99), date=c(1,2,3,4,5))
data$variableUpdated <- year(as.Date(as.character(data$variable), format="%y"))
and only with base R
data$variableUpdated <- format(as.Date(as.character(data$variable), format="%y"),"%Y")
I have a 1gb csv file with Dates and according values. Now is the Dates are in "undefined Format" - so they are diplayed as numbers in Excel like this:
DATE FXVol.DKK.EUR,0.75,4
38719 0.21825
I cannot open the csv file and change it to the date format I like since I would lose data in this way.
If I now import the data to R and convert the Dates:
as.Date( workingfilereturns[,1], format = "%Y-%m-%d")
It always yields dates that are 70 years + so 2076 instead of 2006. I really have no idea what goes wrong or how to fix this issue.
(Note: I have added a note about some quirks in R when dealing with Excel data. You may want to skip directly to that at the bottom; what follows first is the original answer.)
Going by your sample data, 38719 appears to be the number of days which have elapsed since January 1, 1900. So you can just add this number of days to January 1, 1900 to arrive at the correct Date object which you want:
as.Date("1900-01-01") + workingfilereturns[,1]
or
as.Date("1900-01-01") + workingfilereturns$DATE
Example:
> as.Date("1900-01-01") + 38719
[1] "2006-01-04"
Update:
As #Roland correctly pointed out, you could also use as.Date.numeric while specifying an origin of January 1, 1900:
> as.Date.numeric(38719, origin="1900-01-01")
[1] "2006-01-04"
Bug warning:
As the asker #Methamortix pointed out, my solution, namely using January 1, 1900, as the origin, yields a date which is two days too late in R. There are two reasons for this:
In R, the origin is indexed with 0, meaning that as.Date.numeric(0, origin="1900-01-01") is January 1, 1900, in R, but Excel starts counting at 1, meaning that formatting the number 1 in Excel as a Date yields January 1, 1900. This explains why R is one day ahead of Excel.
(Hold your breath) It appears that Excel has a bug in the year 1900, specifically Excel thinks that February 29, 1900 actually happened, even though 1900 was not a leap year (http://www.miniwebtool.com/leap-years-list/?start_year=1850&end_year=2020). As a result, when dealing with dates greater than February 28, 1900, R is a second day ahead of Excel.
As evidence of this, consider the following code:
> as.Date.numeric(57, origin="1900-01-01")
[1] "1900-02-27"
> as.Date.numeric(58, origin="1900-01-01")
[1] "1900-02-28"
> as.Date.numeric(59, origin="1900-01-01")
[1] "1900-03-01"
In other words, R's as.Date() correctly skipped over February 29th. But type the number 60 into a cell in Excel, format as date, and it will come back as February 29, 1900. My guess is that this has been reported somewhere, possibly on Stack Overflow or elsewhere, but let this serve as another reference point.
So, going back to the original question, the origin needs to be offset by 2 days when dealing with Excel dates in R, where the date is greater than February 28, 1900 (which is the case of the original problem). So he should use his date data frame in the following way:
as.Date.numeric(workingfilereturns$DATE - 2, origin="1900-01-01")
where the date column has been rolled back by two days to sync up with the values in Excel.
The %tw format in Stata has the form: 1960w1 which has no equivalent in R.
Therefore %tw dates must be post-processed.
Importing a .dta file into R, the date is an integer like 1304 (instead of 1985w5) or 1426 (instead of 1987w23). If it was a simple time series you could set a starting date as follows:
ts(df, start= c(1985,5), frequency=52)
Another possibility would be:
as.Date(Camp$date, format= "%Yw%W" , origin = "1985w5")
But if each row is not a single date, then you must convert it.
The package ISOweek is based on ISO-8601 with the form "1985-W05" and does not process the Stata %tw.
The Lubridate package does not work with this format. The week() returns the number of complete seven day periods that have occurred between the date and January 1st, plus one. week function
In Stata week 1 of any year starts on 1 January, whatever day of the week that is. Stata Documentation on Dates
In the format %W of Date in R the week starts as Monday as first day of the week.
From strptime %V is
the Week of the year as decimal number (00--53) as defined in ISO
8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise,
it is the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.) Strptime
Larmarange noted on Github that Haven doesn't interpret dates properly:
months, week, quarter and halfyear are specific format from Stata,
respectively %tm, %tw, %tq and %th. I'm not sure that there are
corresponding formats available in R. So far they are imported as
integers.
Is there a way to convert Stata %tw to a date format R understands?
Here is an Stata file with dates
This won't be an answer in terms of R code, but it is commentary on Stata weeks that can't be fitted into a comment.
Strictly, dates in Stata are not defined by the display formats that make them intelligible to people. A date in Stata is always a numeric variable or scalar or macro defined with origin the first instance in 1960. Thus it is at best a shorthand to talk about %tw dates, etc. We can use display to see the effects of different date display formats:
. di %td 0
01jan1960
. di %tw 0
1960w1
. di %tq 0
1960q1
. di %td 42
12feb1960
. di %tw 42
1960w43
. di %tq 42
1970q3
A subtle point made explicit above is that changing the display format will not change what is stored, i.e. the numeric value.
Otherwise put, dates in Stata are not distinct data types; they are just integers made intelligible as dates by a pertinent display format.
The question presupposes that it was correct to describe some weekly dates in terms of Stata weeks. This seems unlikely, as I know no instance in which a body outside StataCorp uses the week rules of Stata, not only that week 1 always starts on 1 January, but also that week 52 always includes either 8 or 9 days and hence that there is never a week 53 in a calendar year.
So, you need to go upstream and find out what the data should have been. Failing some explanation, my best advice is to map the 52 weeks of each year to the days that start them, namely days 1(7)358 of each calendar year.
Stata weeks won't map one-to-one to any other scheme for defining weeks.
More in this article on Stata weeks
It's not completely clear what the question is but the year and week corresponding to 1304 are:
wk <- 1304
1960 + wk %/% 52
## [1] 1985
wk %% 52 + 1
## [1] 5
so assuming that the first week of the year is week 1 and starts on Jan 1st, the beginning of the above week is this date:
as.Date(paste(1960 + wk %/% 52, 1, 1, sep = "-")) + 7 * (wk %% 52)
## [1] "1985-01-29"