I have a data frame like,
2015-01-30 1 Fri
2015-01-30 2 Sat
2015-02-01 3 Sun
2015-02-02 1 Mon
2015-02-03 1 Tue
2015-02-04 1 Wed
2015-02-05 1 Thu
2015-02-06 1 Fri
2015-02-07 1 Sat
2015-02-08 1 Sun
I want to aggregaate it to weekly level such that every week starts from "monday" and ends in "sunday". So, in the aggregated data for above, first week should end on 2015-02-01.
output should look like something for above
firstweek 6
secondweek 7
I tried this,
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
But here in the final result, every week is starting from Sunday.
This should work. I've called the dataframe m and named the columns possibly different to yours.
library(plyr) # install.packages("plyr")
colnames(m) = c("Date", "count","Day")
start = as.Date("2015-01-26")
m$Week <- floor(unclass(as.Date(m$Date) - as.Date(start)) / 7) + 1
m$Week = as.numeric(m$Week)
m %>% group_by(Week) %>% summarise(count = sum(count))
The library plyr is great for data manipulation, but it's just a rough hack to get the week number in.
Convert to date and use the %W format to get a week number...
df <- read.csv(textConnection("2015-01-30, 1, Fri,
2015-01-30, 2, Sat,
2015-02-01, 3, Sun,
2015-02-02, 1, Mon,
2015-02-03, 1, Tue,
2015-02-04, 1, Wed,
2015-02-05, 1, Thu,
2015-02-06, 1, Fri,
2015-02-07, 1, Sat,
2015-02-08, 1, Sun"), header=F, stringsAsFactors=F)
names(df) <- c("date", "something", "day")
df$date <- as.Date(df$date, format="%Y-%m-%d")
df$week <- format(df$date, "%W")
aggregate(df$something, list(df$week), sum)
Wit dplyr and lubridate is this really easy thanks to the function isoweek
my.df <- read.table(header=FALSE, text=
'2015-01-30 1 Fri
2015-01-30 2 Sat
2015-02-01 3 Sun
2015-02-02 1 Mon
2015-02-03 1 Tue
2015-02-04 1 Wed
2015-02-05 1 Thu
2015-02-06 1 Fri
2015-02-07 1 Sat
2015-02-08 1 Sun')
my.df %>% mutate(week = isoweek(V1)) %>% group_by(week) %>% summarise(sum(V2))
or a bit shorter
my.df %>% group_by(isoweek(V1)) %>% summarise(sum(V2))
Related
I would like to format my date variable to %d %b %Y (e.g. 05 May 2020). However, once it has been formatted, it becomes a character variable and sorting the variable from the earliest date to the latest date would not be possible (e.g. 05 May 2020 is sorted before 26 Apr 2020).
Data:
df <- structure(list(Date = structure(c(1588204800, 1587945600, 1588464000, 1588032000,
1588291200, 1588377600, 1588118400), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA, -7L))
# > df
# Date
# 1 2020-04-30
# 2 2020-04-27
# 3 2020-05-03
# 4 2020-04-28
# 5 2020-05-01
# 6 2020-05-02
# 7 2020-04-29
Here is how it looks like sorting a formatted date variable:
df %>%
mutate(Date = format(Date, "%d %b %Y")) %>%
arrange(Date)
# Date
# 1 01 May 2020
# 2 02 May 2020
# 3 03 May 2020
# 4 27 Apr 2020
# 5 28 Apr 2020
# 6 29 Apr 2020
# 7 30 Apr 2020
So, this is what I have done, which works, but I would like to know if this is really correct or if there are alternatives to solve this.
df %>%
mutate(Date = factor(Date, labels = format(sort(unique(Date)), "%d %b %Y"), ordered = TRUE)) %>%
arrange(Date)
# Date
# 1 27 Apr 2020
# 2 28 Apr 2020
# 3 29 Apr 2020
# 4 30 Apr 2020
# 5 01 May 2020
# 6 02 May 2020
# 7 03 May 2020
Edit:
Actually the reason behind wanting to format it and arranging it, is so that I can have direct access to more readable date formats when building my dashboard for my users.
When it comes to ggplot(), even after you do arrange and mutate with format, the facetted plots, will always give in sorted character order. Example below:
df %>%
arrange(Date) %>%
mutate(n = 1:n(),
Date = format(Date, "%d %b %Y")) %>%
ggplot() +
geom_bar(aes(x = n)) +
facet_wrap(~Date)
If you want to use dates in plots the main idea is to adjust the factor levels based on order in which you want to show data. arrange the dates first and attach factor levels based on occurrence of dates.
library(dplyr)
library(ggplot2)
df %>%
arrange(Date) %>%
mutate(n = row_number(),
Date = format(Date, "%d %b %Y"),
Date = factor(Date, levels = unique(Date))) %>%
ggplot() + geom_bar(aes(x = n)) + facet_wrap(~Date)
My original solution is below, but the better solution is so simple it hurts a little that I didn't spot it immediately - do your arrange() before your mutate() - at that point it is a date-type variable so will sort the way you want it to:
df %>%
arrange(Date) %>%
mutate(Date = format(Date, "%d %b %Y"))
Giving:
Date
1 27 Apr 2020
2 28 Apr 2020
3 29 Apr 2020
4 30 Apr 2020
5 01 May 2020
6 02 May 2020
7 03 May 2020
Alternatively, you could add an as.Date(..., format = "%d %b %Y") to your arrange():
df %>%
mutate(Date = format(Date, "%d %b %Y")) %>%
arrange(as.Date(Date, format = "%d %b %Y"))
Personally, I prefer the tidyverse solution for dates - lubridate. Here:
library(lubridate)
df %>%
mutate(Date = ymd(Date)) %>%
arrange(Date)
In short, you can parse your dates by combining d for day, m for month and y for year. You can add time, too. For example,
ymd_hms("20150102 12:23:01")
As the example shows we do not have to bother about the seperator. If you have access this is a nice paper on that package. Otherwise, there are many tutorials out there on lubridate.
In R, how can I produce a list of dates of all 2nd to last Wednesdays of the month in a specified date range? I've tried a few things but have gotten inconsistent results for months with five Wednesdays.
To generate a regular sequence of dates you can use seq with dates for parameter from and to. See the seq.Date documentation for more options.
Create a data frame with the date, the month and weekday. And then obtain the second to last wednesday for each month with the help of aggregate.
day_sequence = seq(as.Date("2020/1/1"), as.Date("2020/12/31"), "day")
df = data.frame(day = day_sequence,
month = months(day_sequence),
weekday = weekdays(day_sequence))
#Filter only wednesdays
df = df[df$weekday == "Wednesday",]
result = aggregate(day ~ month, df, function(x){head(tail(x,2),1)})
tail(x,2) will return the last two rows, then head(.., 1) will give you the first of these last two.
Result:
month day
1 April 2020-04-22
2 August 2020-08-19
3 December 2020-12-23
4 February 2020-02-19
5 January 2020-01-22
6 July 2020-07-22
7 June 2020-06-17
8 March 2020-03-18
9 May 2020-05-20
10 November 2020-11-18
11 October 2020-10-21
12 September 2020-09-23
There are probably simpler ways of doing this but the function below does what the question asks for. it returns a named vector of days such that
They are between from and to.
Are weekday day, where 1 is Monday.
Are n to last of the month.
By n to last I mean the nth counting from the end of the month.
whichWeekday <- function(from, to, day, n, format = "%Y-%m-%d"){
from <- as.Date(from, format = format)
to <- as.Date(to, format = format)
day <- as.character(day)
d <- seq(from, to, by = "days")
m <- format(d, "%Y-%m")
f <- c(TRUE, m[-1] != m[-length(m)])
f <- cumsum(f)
wed <- tapply(d, f, function(x){
i <- which(format(x, "%u") == day)
x[ tail(i, n)[1] ]
})
y <- as.Date(wed, origin = "1970-01-01")
setNames(y, format(y, "%Y-%m"))
}
whichWeekday("2019-01-01", "2020-03-31", 4, 2)
# 2019-01 2019-02 2019-03 2019-04 2019-05
#"2019-01-23" "2019-02-20" "2019-03-20" "2019-04-17" "2019-05-22"
# 2019-06 2019-07 2019-08 2019-09 2019-10
#"2019-06-19" "2019-07-24" "2019-08-21" "2019-09-18" "2019-10-23"
# 2019-11 2019-12 2020-01 2020-02 2020-03
#"2019-11-20" "2019-12-18" "2020-01-22" "2020-02-19" "2020-03-18"
I have a column with date formatted as MM-DD-YYYY, in the Date format.
I want to add 2 columns one which only contains YYYY and the other only contains MM.
How do I do this?
Once again base R gives you all you need, and you should not do this with sub-strings.
Here we first create a data.frame with a proper Date column. If your date is in text format, parse it first with as.Date() or my anytime::anydate() (which does not need formats).
Then given the date creating year and month is simple:
R> df <- data.frame(date=Sys.Date()+seq(1,by=30,len=10))
R> df[, "year"] <- format(df[,"date"], "%Y")
R> df[, "month"] <- format(df[,"date"], "%m")
R> df
date year month
1 2017-12-29 2017 12
2 2018-01-28 2018 01
3 2018-02-27 2018 02
4 2018-03-29 2018 03
5 2018-04-28 2018 04
6 2018-05-28 2018 05
7 2018-06-27 2018 06
8 2018-07-27 2018 07
9 2018-08-26 2018 08
10 2018-09-25 2018 09
R>
If you want year or month as integers, you can wrap as as.integer() around the format.
A base R option would be to remove the substring with sub and then read with read.table
df1[c('month', 'year')] <- read.table(text=sub("-\\d{2}-", ",", df1$date), sep=",")
Or using tidyverse
library(tidyverse)
separate(df1, date, into = c('month', 'day', 'year') %>%
select(-day)
Note: it may be better to convert to datetime class instead of using the string formatting.
df1 %>%
mutate(date =mdy(date), month = month(date), year = year(date))
data
df1 <- data.frame(date = c("05-21-2017", "06-25-2015"))
Consider code:
library('zoo')
data <- c(1, 2, 4, 6)
dates <- c("2016-11-01", "2016-12-01", "2017-02-01", "2017-04-01");
z1 <- zoo(data, as.yearmon(dates))
z2 <- na.approx(z1)
Variable z2 looks like this:
nov 2016 dec 2016 feb 2017 apr 2017
1 2 4 6
But I need z2 to be similar to this:
nov 2016 dec 2016 jan 2017 feb 2017 mar 2017 apr 2017
1 2 3 4 5 6
I just need to approximate values for months where value is missing. Thanks for any hints.
With the new as.zoo argument, calendar, in zoo 1.8 (which defaults to TRUE so we don't have to specify it) we can just convert the input to "ts" and then back to "zoo" again applying na.approx after that:
na.approx(as.zoo(as.ts(z2)))
## Nov 2016 Dec 2016 Jan 2017 Feb 2017 Mar 2017 Apr 2017
## 1 2 3 4 5 6
With prior versions of zoo we can do the same but manually convert the index back to "yearmon":
na.approx(aggregate(as.zoo(as.ts(z2)), as.yearmon, c))
magrittr
Using zoo with magrittr these can be expressed as the following pipelines, respectively:
library(magrittr)
z2 %>% as.ts %>% as.zoo %>% na.approx
z2 %>% as.ts %>% as.zoo %>% aggregate(as.yearmon, c) %>% na.approx
One way using just na.approx and base R:
#add your data and dates together
df <- data.frame(data, dates = as.Date(dates))
#create all dates using seq
new_dates <- data.frame(dates = seq(as.Date(dates[1]), as.Date(dates[4]), by = 'month'))
#merge the two and then na.approx
new_df <- merge(new_dates, df, by = 'dates', all.x = TRUE)
na.approx(new_df$data)
Out:
[1] 1 2 3 4 5 6
If I have a date, say "2014-05-13" and I want to calculate the month in decimal, I would do this:
5 + 13/31 = 5.419355
How would it be possible in R to take a vector of dates and turn in it into a "month decimal" vector?
For example:
dates = c("2010-01-24", "2013-04-08", "2014-03-05", "2013-03-08", "2014-02-14",
"2004-01-28", "2006-02-21", "2013-03-28", "2013-04-01", "2006-02-14",
"2006-01-28", "2014-01-19", "2012-03-12", "2014-01-30", "2005-04-17")
library(lubridate)
month(dates) + day(dates)/31
As you can see, it would be wrong to put "31" as the diviser since the number of days differ depending on the month, and sometimes year (leap years).
So what would be the best solution?
You can use monthDaysfunction from Hmisc package
> require(Hmisc)
> library(lubridate)
> month(dates) + day(dates)/monthDays(dates)
[1] 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333
[10] 2.500000 1.903226 1.612903 3.387097 1.967742 4.566667
With magrittr,
library(magrittr)
library(lubridate)
dates %>% ymd() %>% { month(.) + day(.) / days_in_month(.) }
## Jan Apr Mar Mar Feb Jan Feb Mar Apr Feb Jan
## 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333 2.500000 1.903226
## Jan Mar Jan Apr
## 1.612903 3.387097 1.967742 4.566667
For some reason the vector gets named, so add %>% unname() if you like.
Here is a base R hack that uses a trick I've seen on SO to get the first day of the next month and subtract 1 to return the last day of the month of interest.
# format dates to Date class
dates <- as.Date(dates)
# get the next month
nextMonths <- as.integer(substr(dates, 6, 7)) + 1L
# replace next month with 1 if it is equal to 13
nextMonths[nextMonths == 13] <- 1L
# extract the number of days using date formatting (%d), paste, and subtraction
dayCount <- as.integer(format(as.Date(paste(substr(dates, 1, 4),
nextMonths, "01", sep="-"))-1L, format="%d"))
dayCount
[1] 31 30 31 31 28 31 28 31 30 28 31 31 31 31 30
# get month with fraction using date formatting (%m)
as.integer(format(dates, format="%m")) + (as.integer(format(dates, format="%d")) / dayCount)
[1] 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333 2.500000
[11] 1.903226 1.612903 3.387097 1.967742 4.566667