Post-Process a Stata %tw date in R - r

The %tw format in Stata has the form: 1960w1 which has no equivalent in R.
Therefore %tw dates must be post-processed.
Importing a .dta file into R, the date is an integer like 1304 (instead of 1985w5) or 1426 (instead of 1987w23). If it was a simple time series you could set a starting date as follows:
ts(df, start= c(1985,5), frequency=52)
Another possibility would be:
as.Date(Camp$date, format= "%Yw%W" , origin = "1985w5")
But if each row is not a single date, then you must convert it.
The package ISOweek is based on ISO-8601 with the form "1985-W05" and does not process the Stata %tw.
The Lubridate package does not work with this format. The week() returns the number of complete seven day periods that have occurred between the date and January 1st, plus one. week function
In Stata week 1 of any year starts on 1 January, whatever day of the week that is. Stata Documentation on Dates
In the format %W of Date in R the week starts as Monday as first day of the week.
From strptime %V is
the Week of the year as decimal number (00--53) as defined in ISO
8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise,
it is the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.) Strptime
Larmarange noted on Github that Haven doesn't interpret dates properly:
months, week, quarter and halfyear are specific format from Stata,
respectively %tm, %tw, %tq and %th. I'm not sure that there are
corresponding formats available in R. So far they are imported as
integers.
Is there a way to convert Stata %tw to a date format R understands?
Here is an Stata file with dates

This won't be an answer in terms of R code, but it is commentary on Stata weeks that can't be fitted into a comment.
Strictly, dates in Stata are not defined by the display formats that make them intelligible to people. A date in Stata is always a numeric variable or scalar or macro defined with origin the first instance in 1960. Thus it is at best a shorthand to talk about %tw dates, etc. We can use display to see the effects of different date display formats:
. di %td 0
01jan1960
. di %tw 0
1960w1
. di %tq 0
1960q1
. di %td 42
12feb1960
. di %tw 42
1960w43
. di %tq 42
1970q3
A subtle point made explicit above is that changing the display format will not change what is stored, i.e. the numeric value.
Otherwise put, dates in Stata are not distinct data types; they are just integers made intelligible as dates by a pertinent display format.
The question presupposes that it was correct to describe some weekly dates in terms of Stata weeks. This seems unlikely, as I know no instance in which a body outside StataCorp uses the week rules of Stata, not only that week 1 always starts on 1 January, but also that week 52 always includes either 8 or 9 days and hence that there is never a week 53 in a calendar year.
So, you need to go upstream and find out what the data should have been. Failing some explanation, my best advice is to map the 52 weeks of each year to the days that start them, namely days 1(7)358 of each calendar year.
Stata weeks won't map one-to-one to any other scheme for defining weeks.
More in this article on Stata weeks

It's not completely clear what the question is but the year and week corresponding to 1304 are:
wk <- 1304
1960 + wk %/% 52
## [1] 1985
wk %% 52 + 1
## [1] 5
so assuming that the first week of the year is week 1 and starts on Jan 1st, the beginning of the above week is this date:
as.Date(paste(1960 + wk %/% 52, 1, 1, sep = "-")) + 7 * (wk %% 52)
## [1] "1985-01-29"

Related

Convert from character to date in a "YYYY-WW" format in R

I have a hard time converting character to date in R.
I have a file where the dates are given as "2014-01", where the first is the year and the second is the week of the year. I want to convert this to a date type.
I have tried the following
z <- as.Date('2014-01', '%Y-%W')
print(z)
Output: "2014-12-05"
Which is not what I desire. I want to get the same format out, ie. the output should be "2014-01" but now as a date type.
It sounds like you are dealing with some version of year week, which exists in three forms in lubridate:
week() returns the number of complete seven day periods that have
occurred between the date and January 1st, plus one.
isoweek() returns the week as it would appear in the ISO 8601 system,
which uses a reoccurring leap week.
epiweek() is the US CDC version of epidemiological week. It follows
same rules as isoweek() but starts on Sunday. In other parts of the
world the convention is to start epidemiological weeks on Monday,
which is the same as isoweek.
Lubridate has functions to extract these from a date, but I don't know of a built-in way to go the other direction, from week to one representative day (out of 7 possible). One simple way if you're dealing with the first version would be to add 7 * (Week - 1) to jan 1 of the year.
library(dplyr)
data.frame(yearweek = c('2014-01', '2014-03')) %>%
tidyr::separate(yearweek, c("Year", "Week"), convert = TRUE) %>%
mutate(Date = as.Date(paste0(Year, "-01-01")) + 7 * (Week-1))
Year Week Date
1 2014 1 2014-01-01
2 2014 3 2014-01-15

as.Date produces unexpected result in a sequence of week-based dates

I am working on the transformation of week based dates to month based dates.
When checking my work, I found the following problem in my data which is the result of a simple call to as.Date()
as.Date("2016-50-4", format = "%Y-%U-%u")
as.Date("2016-50-5", format = "%Y-%U-%u")
as.Date("2016-50-6", format = "%Y-%U-%u")
as.Date("2016-50-7", format = "%Y-%U-%u") # this is the problem
The previous code yields correct date for the first 3 lines:
"2016-12-15"
"2016-12-16"
"2016-12-17"
The last line of code however, goes back 1 week:
"2016-12-11"
Can anybody explain what is happening here?
Working with week of the year can become very tricky. You may try to convert the dates using the ISOweek package:
# create date strings in the format given by the OP
wd <- c("2016-50-4","2016-50-5","2016-50-6","2016-50-7", "2016-51-1", "2016-52-7")
# convert to "normal" dates
ISOweek::ISOweek2date(stringr::str_replace(wd, "-", "-W"))
The result
#[1] "2016-12-15" "2016-12-16" "2016-12-17" "2016-12-18" "2016-12-19" "2017-01-01"
is of class Date.
Note that the ISO week-based date format is yyyy-Www-d with a capital W preceeding the week number. This is required to distinguish it from the standard month-based date format yyyy-mm-dd.
So, in order to convert the date strings provided by the OP using ISOweek2date() it is necessary to insert a W after the first hyphen which is accomplished by replacing the first - by -W in each string.
Also note that ISO weeks start on Monday and the days of the week are numbered 1 to 7. The year which belongs to an ISO week may differ from the calendar year. This can be seen from the sample dates above where the week-based date 2016-W52-7 is converted to 2017-01-01.
About the ISOweek package
Back in 2011, the %G, %g, %u, and %V format specifications weren't available to strptime() in the Windows version of R. This was annoying as I had to prepare weekly reports including week-on-week comparisons. I spent hours to find a solution for dealing with ISO weeks, ISO weekdays, and ISO years. Finally, I ended up creating the ISOweek package and publishing it on CRAN. Today, the package still has its merits as the aforementioned formats are ignored on input (see ?strptime for details).
As #lmo said in the comments, %u stands for the weekdays as a decimal number (1–7, with Monday as 1) and %U stands for the week of the year as decimal number (00–53) using Sunday as the first day. Thus, as.Date("2016-50-7", format = "%Y-%U-%u") will result in "2016-12-11".
However, if that should give "2016-12-18", then you should use a week format that has also Monday as starting day. According to the documentation of ?strptime you would expect that the format "%Y-%V-%u" thus gives the correct output, where %V stands for the week of the year as decimal number (01–53) with monday as the first day.
Unfortunately, it doesn't:
> as.Date("2016-50-7", format = "%Y-%V-%u")
[1] "2016-01-18"
However, at the end of the explanation of %V it sais "Accepted but ignored on input" meaning that it won't work.
You can circumvent this behavior as follows to get the correct dates:
# create a vector of dates
d <- c("2016-50-4","2016-50-5","2016-50-6","2016-50-7", "2016-51-1")
# convert to the correct dates
as.Date(paste0(substr(d,1,8), as.integer(substring(d,9))-1), "%Y-%U-%w") + 1
which gives:
[1] "2016-12-15" "2016-12-16" "2016-12-17" "2016-12-18" "2016-12-19"
The issue is because for %u, 1 is Monday and 7 is Sunday of the week. The problem is further complicated by the fact that %U assumes week begins on Sunday.
For the given input and expected behavior of format = "%Y-%U-%u", the output of line 4 is consistent with the output of previous 3 lines.
That is, if you want to use format = "%Y-%U-%u", you should pre-process your input. In this case, the fourth line would have to be as.Date("2016-51-7", format = "%Y-%U-%u") as revealed by
format(as.Date("2016-12-18"), "%Y-%U-%u")
# "2016-51-7"
Instead, you are currently passing "2016-50-7".
Better way of doing it might be to use the approach suggested in Uwe Block's answer. Since you are happy with "2016-50-4" being transformed to "2016-12-15", I suspect in your raw data, Monday is counted as 1 too. You could also create a custom function that changes the value of %U to count the week number as if week begins on Monday so that the output is as you expected.
#Function to change value of %U so that the week begins on Monday
pre_process = function(x, delim = "-"){
y = unlist(strsplit(x,delim))
# If the last day of the year is 7 (Sunday for %u),
# add 1 to the week to make it the week 00 of the next year
# I think there might be a better solution for this
if (y[2] == "53" & y[3] == "7"){
x = paste(as.integer(y[1])+1,"00",y[3],sep = delim)
} else if (y[3] == "7"){
# If the day is 7 (Sunday for %u), add 1 to the week
x = paste(y[1],as.integer(y[2])+1,y[3],sep = delim)
}
return(x)
}
And usage would be
as.Date(pre_process("2016-50-7"), format = "%Y-%U-%u")
# [1] "2016-12-18"
I'm not quite sure how to handle when the year ends on a Sunday.

Determine week number from date over several years

I'm looking for a way to determine the week number (week beginning on Monday) over several years. That means I don't want to have 0-53 but if, let's say I have 2 years of dates, I want them to be numbered with 0-106 in R.
I tried strftime(Datum, format ="%W") but then I only get the annual week number and not as a whole.
Given that you did not provide any data, I took the liberty of creating some:
#create data
Datum<-c("2013-03-01", "2014-06-02", "2013-06-01")
# format data to year-month-day with strptime
Datum<-strptime(Datum, "%Y-%m-%d")
You now need to identify the origin year. As I'm sure you are aware not all years have the same number of weeks 52.29 in a leap year vs. 52.4 in a standard calendar year but as this is unlikely to be a consideration for only 2 years we can use the number of weeks returned through the strftime function.
origin.year=as.numeric(min(substring(Datum,1,4)))
# number of weeks in first year (offset for second year)
n.weeks<-52
Now we can create a vector containing the number of weeks to offset each week in Datum (X).
X<-as.numeric(substring(Datum,1,4)!=origin.year)*n.weeks
We can then simply add this vector to the number of weeks returned by strftime when it is applied to Datum
week.vec<-as.numeric(strftime(Datum, "%W"))+X
This will work for 2 years, but if you have more years than this, you will need to modify the offsets to account for this.

How to convert specific time format to timestamp in R? [duplicate]

This question already has answers here:
Read csv with dates and numbers
(3 answers)
Closed 9 years ago.
I am working on "Localization Data for Person Activity Data Set" dataset from UCI and in this data set there is a column of date and time(both in one column) with following format:
27.05.2009 14:03:25:777
27.05.2009 14:03:25:183
27.05.2009 14:03:25:210
27.05.2009 14:03:25:237
...
I am wondering if there is anyway to convert this column to timestamp using R.
First of all, we need to substitute the colon separating the milliseconds from the seconds to a dot, otherwise the final step won't work (thanks to Dirk Eddelbuettel for this one). Since in the end R will use the separators it wants, to be quicker, I'll just go ahead and substitute all the colons for dots:
x <- "27.05.2009 14:03:25:777" # this is a simplified version of your data
y <- gsub(":", ".", x) # this is your vector with the aforementioned substitution
By the way, this is how your vector should look after gsub:
> y
[1] "27.05.2009 14.03.25.777"
Now, in order to have it show the milliseconds, you first need to adjust an R option and then use a function called strptime, which will convert your date vector to POSIXlt (an R-friendly) format. Just do the following:
> options(digits.secs = 3) # this tells R you want it to consider 3 digits for seconds.
> strptime(y, "%d.%m.%Y %H:%M:%OS") # this finally formats your vector
[1] "2009-05-27 14:03:25.777"
I've learned this nice trick here. This other answer also says you can skip the options setting and use, for example, strptime(y, "%d.%m.%Y %H:%M:%OS3"), but it doesn't work for me. Henrik noted that the function's help page, ?strptime states that the %OS3 bit is OS-dependent. I'm using an updated Ubuntu 13.04 and using %OS3 yields NA.
When using strptime (or other POSIX-related functions such as as.Date), keep in mind some of the most common conversions used (edited for brevity, as suggested by DWin. Complete list at strptime):
%a Abbreviated weekday name in the current locale.
%A Full weekday name in the current locale.
%b Abbreviated month name in the current locale.
%B Full month name in the current locale.
%d Day of the month as decimal number (01–31).
%H Hours as decimal number (00–23). Times such as 24:00:00 are accepted for input.
%I Hours as decimal number (01–12).
%j Day of year as decimal number (001–366).
%m Month as decimal number (01–12).
%M Minute as decimal number (00–59).
%p AM/PM indicator in the locale. Used in conjunction with %I and not with %H.
`%S Second as decimal number (00–61), allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
%U Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%w Weekday as decimal number (0–6, Sunday is 0).
%W Week of the year as decimal number (00–53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%y Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19
%Y Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC)

Bucketing data into weekly, bi-weekly, monthly and quarterly data in R

I have a data frame with two columns. Date, Gender
I want to change the Date column to the start of the week for that observation. For example if Jun-28-2011 is a Tuesday, I'd like to change it to Jun-27-2011. Basically I want to re-label Date fields such that two data points that are in the same week have the same Date.
I also want to be able to do it by-weekly, or monthly and specially quarterly.
Update:
Let's use this as a dataset.
datset <- data.frame(date = as.Date("2011-06-28")+c(1:100))
One slick way to do this that I just learned recently is to use the lubridate package:
library(lubridate)
datset <- data.frame(date = as.Date("2011-06-28")+c(1:100))
#Add 1, since floor_date appears to round down to Sundays
floor_date(datset$date,"week") + 1
I'm not sure about how to do bi-weekly binning, but monthly and quarterly are easily handled with the respective base functions:
quarters(datset$date)
months(datset$date)
EDIT: Interestingly, floor_date from lubridate does not appear to be able to round down to the nearest quarter, but the function of the same name in ggplot2 does.
Look at ?strftime. In particular, the following formats:
%b: Abbreviated month name in the
current locale. (Also matches full
name on input.)
%B: Full month name
in the current locale. (Also matches
abbreviated name on input.)
%m: Month as decimal number (01–12).
%W: Week of the year as decimal number
(00–53) using Monday as the first day
of week (and typically with the first
Monday of the year as day 1 of week
1). The UK convention.
eg:
> strftime("2011-07-28","Month: %B, Week: %W")
[1] "Month: July, Week: 30"
> paste("Quarter:",ceiling(as.integer(strftime("2011-07-28","%m"))/3))
[1] "Quarter: 3"

Resources