Project Euler #19 in R - r

I need help understanding why I am getting the wrong answer for Problem 19 of Project Euler.
The problem is:
You are given the following information, but you may prefer to do some research for yourself.
1 Jan 1900 was a Monday.
Thirty days has September,
April, June and November.
All the rest have thirty-one,
Saving February alone,
Which has twenty-eight, rain or shine.
And on leap years, twenty-nine.
A leap year occurs on any year evenly divisible by 4, but not on a century unless it is divisible by 400.
How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?
#rm(list=ls())
days=seq(from=as.Date("1900/1/1"), to=as.Date("2000/12/31"), by="month")
firstSundays=days[weekdays(as.Date(days))=="Sunday"&months(as.Date(days))=="January"]
length(firstSundays)
The answer it gives me is 14 and when I look at firstSundays it gives me:
[1] "1905-01-01" "1911-01-01" "1922-01-01" "1928-01-01" "1933-01-01"
[6] "1939-01-01" "1950-01-01" "1956-01-01" "1961-01-01" "1967-01-01"
[11] "1978-01-01" "1984-01-01" "1989-01-01" "1995-01-01"
I don't understand what is going on here. Could someone please explain? I am fairly new to R and I'm not sure what I am doing wrong.

To compute it in R you could do as follows:
firsts_of_months <- seq(as.Date("1901-01-01"), as.Date("2000-12-01"), by = "1 month")
sum(weekdays(firsts_of_months) == "Sonntag") # use == "sunday" or your local language

Related

best practices for avoiding roundoff gotchas in date manipulation

I am doing some date/time manipulation and experiencing explicable, but unpleasant, round-tripping problems when converting date -> time -> date . I have temporarily overcome this problem by rounding at appropriate points, but I wonder if there are best practices for date handling that would be cleaner. I'm using a mix of base-R and lubridate functions.
tl;dr is there a good, simple way to convert from decimal date (YYYY.fff) to the Date class (and back) without going through POSIXt and incurring round-off (and potentially time-zone) complications??
Start with a few days from 1918, as separate year/month/day columns (not a critical part of my problem, but it's where my pipeline happens to start):
library(lubridate)
dd <- data.frame(year=1918,month=9,day=1:12)
Convert year/month/day -> date -> time:
dd <- transform(dd,
time=decimal_date(make_date(year, month, day)))
The successive differences in the resulting time vector are not exactly 1 because of roundoff: this is understandable but leads to problems down the road.
table(diff(dd$time)*365)
## 0.999999999985448 1.00000000006844
## 9 2
Now suppose I convert back to a date: the dates are slightly before or after midnight (off by <1 second in either direction):
d2 <- lubridate::date_decimal(dd$time)
# [1] "1918-09-01 00:00:00 UTC" "1918-09-02 00:00:00 UTC"
# [3] "1918-09-03 00:00:00 UTC" "1918-09-03 23:59:59 UTC"
# [5] "1918-09-04 23:59:59 UTC" "1918-09-05 23:59:59 UTC"
# [7] "1918-09-07 00:00:00 UTC" "1918-09-08 00:00:00 UTC"
# [9] "1918-09-09 00:00:00 UTC" "1918-09-09 23:59:59 UTC"
# [11] "1918-09-10 23:59:59 UTC" "1918-09-12 00:00:00 UTC"
If I now want dates (rather than POSIXct objects) I can use as.Date(), but to my dismay as.Date() truncates rather than rounding ...
tt <- as.Date(d2)
## [1] "1918-09-01" "1918-09-02" "1918-09-03" "1918-09-03" "1918-09-04"
## [6] "1918-09-05" "1918-09-07" "1918-09-08" "1918-09-09" "1918-09-09"
##[11] "1918-09-10" "1918-09-12"
So the differences are now 0/1/2 days:
table(diff(tt))
# 0 1 2
# 2 7 2
I can fix this by rounding first:
table(diff(as.Date(round(d2))))
## 1
## 11
but I wonder if there is a better way (e.g. keeping POSIXct out of my pipeline and staying with dates ...
As suggested by this R-help desk article from 2004 by Grothendieck and Petzoldt:
When considering which class to use, always
choose the least complex class that will support the
application. That is, use Date if possible, otherwise use
chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
The extensive table in this article shows how to translate among Date, chron, and POSIXct, but doesn't include decimal time as one of the candidates ...
It seems like it would be best to avoid converting back from decimal time if at all possible.
When converting from date to decimal date, one also needs to account for time. Since Date does not have a specific time associated with it, decimal_date inherently assumes it to be 00:00:00.
However, if we are concerned only with the date (and not the time), we could assume the time to be anything. Arguably, middle of the day (12:00:00) is as good as the beginning of the day (00:00:00). This would make the conversion back to Date more reliable as we are not at the midnight mark and a few seconds off does not affect the output. One of the ways to do this would be to add 12*60*60/(365*24*60*60) to dd$time
dd$time2 = dd$time + 12*60*60/(365*24*60*60)
data.frame(dd[1:3],
"00:00:00" = as.Date(date_decimal(dd$time)),
"12:00:00" = as.Date(date_decimal(dd$time2)),
check.names = FALSE)
# year month day 00:00:00 12:00:00
#1 1918 9 1 1918-09-01 1918-09-01
#2 1918 9 2 1918-09-02 1918-09-02
#3 1918 9 3 1918-09-03 1918-09-03
#4 1918 9 4 1918-09-03 1918-09-04
#5 1918 9 5 1918-09-04 1918-09-05
#6 1918 9 6 1918-09-05 1918-09-06
#7 1918 9 7 1918-09-07 1918-09-07
#8 1918 9 8 1918-09-08 1918-09-08
#9 1918 9 9 1918-09-09 1918-09-09
#10 1918 9 10 1918-09-09 1918-09-10
#11 1918 9 11 1918-09-10 1918-09-11
#12 1918 9 12 1918-09-12 1918-09-12
It should be noted, however, that the value of decimal time obtained in this way will be different.
lubridate::decimal_date() is returning a numeric. If I understand you correctly, the question is how to convert that numeric into Date and have it round appropriately without bouncing through POSIXct.
as.Date(1L, origin = '1970-01-01') shows us that we can provide as.Date with days since some specified origin and convert immediately to the Date type. Knowing this, we can skip the year part entirely and set it as origin. Then we can convert our decimal dates to days:
as.Date((dd$time-trunc(dd$time)) * 365, origin = "1918-01-01").
So, a function like this might do the trick (at least for years without leap days):
date_decimal2 <- function(decimal_date) {
years <- trunc(decimal_date)
origins <- paste0(years, "-01-01")
# c.f. https://stackoverflow.com/questions/14449166/dates-with-lapply-and-sapply
do.call(c, mapply(as.Date.numeric, x = (decimal_date-years) * 365, origin = origins, SIMPLIFY = FALSE))
}
Side note: I admit I went down a bit of a rabbit hole with trying to move origin around deal with the pre-1970 date. I found that the further origin shifted from the target date, the more weird the results got (and not in ways that seemed to be easily explained by leap days). Since origin is flexible, I decided to target it right on top of the target values. For leap days, seconds, and whatever other weirdness time has in store for us, on your own head be it. =)

Subsetting by year gives different results using ymd() vs. year()

I am getting different nrows subsetting by year using ymd() and year() in package lubridate and I am trying to figure out what might be causing that disparity.
A 331kb CSV file with 10k dates is here. A url pointing to Google Drive and Dropbox kept throwing up errors, beyond my newbie skills to figure out.
require(data.table)
require(lubridate)
teaSet <- fread("../teaSet.csv", na.strings=c("NA","N/A", ""))
teaSet$opened <- ymd_hms(teaSet$opened, tz = "")
teaSet$year <- as.factor(teaSet$year)
ymd2010 <- teaSet[opened >= ymd("2010-01-01") & opened <= ymd("2010-12-31"),]
#1480 obs.
year2010 <- teaSet[year(opened)==2010,]
#1483 obs
summary(teaSet$year)
#2010 2011 2012 2013 2014 2015 2016
#1483 1408 1317 1414 1521 1701 1156
Can anyone explain what I am missing? I was subsetting by date range and then by year() and noticed the year() and ymd() counts were different. I created a factor column for years (and cleverly named it "year")to speed things up - my dataset has 13 million rows - but is not directly relevant to my question. Seemed like a good idea when I started. I did different sample sizes and the disparity remains across sizes. Thanks!
Looking over the problem some more it looks like: ymd("2010-12-31") is 12:00 AM on the 31st and not 12 PM.
There are 2 options which I see a possible solutions. Use the next day in the filter or covert all of your date/times to just dates with GMT.
If you change opened <= ymd("2011-1-1") it will work.
require(lubridate)
library(data.table)
teaSet <- fread("teaSet.csv", na.strings=c("NA","N/A", ""))
teaSet$opened <- ymd_hms(teaSet$opened, tz = "")
teaSet$year <- as.factor(teaSet$year)
ymd2010 <- teaSet[opened >= ymd("2010-01-01") & opened < ymd("2011-1-1"),]
print(dim(ymd2010))
#a second possible option - not as clean as the prior one
teaSet$opened <- ymd_hms(teaSet$opened, tz = "GMT")
ymd2010_2 <-teaSet[as.Date(opened) >= ymd("2010-01-01") & as.Date(opened) <= ymd("2010-12-31")]
print(dim(ymd2010_2))
year2010 <- teaSet[year(opened)==2010,]
print( dim(year2010 ))
summary(teaSet$year)
I agree the timezone issue is unintuitive, but it is what is it is. Nice job on testing for and catching the inconsistence in your original solution.

How to convert values into percentage in r software

So this is the question.
Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Suppose that the ‘24’ was a mistake, and it should have been 18. Write a code that fixes this, i.e. changing ‘24’ to ‘18’. Then compute for the new mean and standard deviation of the commute times.
Write a code which counts the number of instances that the commute time is at least 20 minutes. Then convert this into a percentage.
This is my solution for Q3 when I ran this code. I want to ask anybody if my solution is correct?
commute <- c(17,16,20,24,22,15,21,15,17,22)
commute[commute==24] <- 18
n <- length(commute)
sum((commute>=20)/n)
#[1] **0.4**
to complete the answer of the user20650, you could use a string formatted command to correctly display the outcome as a percentage as requested:
sprintf("%0.2f%%",100* mean(commute>=20))
[1] "40.00%"

How to switch rows in R?

I have a array with following content:
> head(MEAN)
1901DJF 1901JJA 1901MAM 1901SON 1902DJF 1902JJA
-0.45451556 -0.72922229 -0.17669396 -1.12095590 -0.86523850 -0.04031273
This should be a time series with seasonal mean values from 1901 to 2009. The problem is that the generated column heads are strictly alphabetically ordered. However, in terms of season this doesn't make to much sense, e.g. JJA (june, july, august) is leading MAM (march, april, may).
How could I switch each MAM and JJA entry of the array?
PS: MEAN is generated applying tapply on the data.frame pdsi
> head(pdsi)
date scPDSI month seas seasyear
1 1901-01-01 -0.10881074 Jan DJF 1901DJF
2 1901-02-01 -0.22287750 Feb DJF 1901DJF
3 1901-03-01 -0.12233192 Mär MAM 1901MAM
4 1901-04-01 -0.04440915 Apr MAM 1901MAM
5 1901-05-01 -0.36334082 Mai MAM 1901MAM
6 1901-06-01 -0.52079030 Jun JJA 1901JJA
>
> MEAN <- tapply(pdsi$scPDSI, ts.pdsi$seasyear, mean, na.rm = T)
May be there is also known a more elegant way to calculate seasonal means...
You can change the order of the factor levels:
pdsi[["seasyear"]] = factor(pdsi[["seasyear"]], levels = c("1901DJF", "1901MAM", etc))
I think this is a fairly simple way of re-ordering your means, however, it does have the assumption that your data is already ordered chronologically in the data set. So if that holds this should work.
I also created some random data, rather than copying your data, but the results should be the same
seasons = c("1901DJF", "1901MAM", "1901JJA")
seasons = rep(seasons, c(2, 3, 1))
data = data.frame(runif(1:6), seasons)
MEAN = tapply(data[,1], data[,2], mean)
1901DJF 1901JJA 1901MAM
0.5799779 0.3724785 0.6514327
order = unique(seasons)
MEAN[order]
1901DJF 1901MAM 1901JJA
0.5799779 0.6514327 0.3724785
What this does is take the order of seasyear in the data set, and reorders the object MEAN to reflect that order. Again, it assumes your data is chronologically ordered in the raw file, but I think this is a safe assumption. Apologies if it is not the case.

Assigning week numbers in a time series to obtain weekly average price

Let's say I have a time series with daily data (business days), and I would like to organize the data by business weeks. (Monday-Friday) in a similar fashion as the one in this webpage from the EIA on futures prices of crude oil:
http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RCLC1&f=D
As you can see the prices are nicely organized by weeks in this webpage.
Is there any function in R that could organize the data in a similar fashion?
You can obtain the data in .xls format at:
http://www.eia.gov/dnav/pet/hist_xls/RCLC1d.xls
What I would like to do is to assign a week number to each daily observation something like this: (Look at the weeks column)
Date Price weeks day
1983-04-04 29.44 1 Monday
1983-04-05 29.71 1 Tuesday
1983-04-06 29.92 1 Wednesday
1983-04-07 30.17 1 Thursday
1983-04-08 30.38 1 Friday
1983-04-11 30.26 2 Monday
...
...
So far I have used the week function of the lubridate package but is not working well. It seems like once a year hits the 53rd week the function fails to initiate properly the week of the following year.
I have been trying to stay away from rep, seq /5 or /7 kind of solutions since there may be some observations that I may need to filter from the data later on, so I would like to have a solution that doesn't depend on the particular vector of my data but rather I would prefer the solution to be more general, that is to depend on the date class, i.e POSIcxt, xts or zoo class
Any hints would be greatly appreciated.
Wouldn't this work?:
as.POSIXlt()$yday %/% 7
I realize that it does have part of what you wanted to avoid but it does draw its starting point from a recognized class. For your data noting that I read it in with colClasses=c("Date", "numeric","numeric","character") :
> 1 + as.POSIXlt(dat$Date)$yday %/% 7
[1] 14 14 14 14 14 15
If you want to replicate those interval labels, try adding multiples of 7 to any Monday and Friday:
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39)*7,
sep="")
#[1] "1984-01-02 to 1984-01-06" # The first new year change
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39+52)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39+52)*7,
sep="")
#[1] "1984-12-31 to 1985-01-04" # The second new year change
Here's a function that will accept an integer vector:
from8Apr83dts <- function(numwks) {
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(numwks)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(numwks)*7,
sep="")
}
# Usage
from8Apr83dts(39:40)
#[1] "1984-01-02 to 1984-01-06" "1984-01-09 to 1984-01-13"

Resources