I have a dataframe with a column named date structured as bellow. Note that this is a small sample of my dataframe. I have different months and different years (my main date range is from 2005-01-03 to 2021-12-31). I want to count the number of days in each month and year combination i.e. 2 days in 2005-12, 3 days in 2006-01, ... . How can I get a vector of these counts?
df$date <- as.Date(c(
"2005-12-28", "2005-12-31", "2006-01-01", "2006-01-02", "2006-01-03", "2006-02-04", "2007-03-02", "2007-03-03", "2007-03-06", "2007-04-10", "2007-04-11"))
library(dplyr)
df %>%
# distinct(date) %>% # unnecessary if no dupe dates
mutate(month = lubridate::floor_date(date, "month")) %>%
count(month)
Result
month n
1 2005-12-01 2
2 2006-01-01 3
3 2006-02-01 1
4 2007-03-01 3
5 2007-04-01 2
Data used:
df <- structure(list(date = structure(c(13145, 13148, 13149, 13150,
13151, 13183, 13574, 13575, 13578, 13613, 13614), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df %>% mutate(date = format(.$date, "%Y-%m")) %>% group_by(date) %>% count(date) -> out
out gives you summary by year and month as tibble.
Here is another solution ,
a <- as.Date(c("2005-12-28", "2005-12-31", "2006-01-01",
"2006-01-02", "2006-01-03", "2006-02-04",
"2007-03-02", "2007-03-03", "2007-03-06",
"2007-04-10", "2007-04-11"))
date <- strsplit(as.character(a) , "-")
# to extract months
months <- lapply(date , function(x) x[2])
# to extract years
years <- lapply(date , function(x) x[1])
table(unlist(months))
#>
#> 01 02 03 04 12
#> 3 1 3 2 2
table(unlist(years))
#>
#> 2005 2006 2007
#> 2 4 5
Created on 2022-06-01 by the reprex package (v2.0.1)
Related
I have a column in my large data set called Date. How do I extract both the year and month from it? I would like to create a column Month where the month goes from 1-12 and year where the year goes from the first year in my data set to the last year in my data set.
Thanks.
> typeof(data$Date)
[1] "character
> head(data$Date)
[1] "2/06/2020 11:23" "12/06/2020 7:56" "12/06/2020 7:56" "29/06/2020 16:54" "3/06/2020 15:09" "25/06/2020 17:11"
dplyr and lubridate -
library(dplyr)
library(lubridate)
data <- data %>%
mutate(Date = dmy_hm(Date),
month = month(Date),
year = year(Date))
# Date month year
#1 2020-06-02 11:23:00 6 2020
#2 2020-06-12 07:56:00 6 2020
#3 2020-06-12 07:56:00 6 2020
#4 2020-06-29 16:54:00 6 2020
#5 2020-06-03 15:09:00 6 2020
#6 2020-06-25 17:11:00 6 2020
Base R -
data$Date <- as.POSIXct(data$Date, tz = 'UTC', format = '%d/%m/%Y %H:%M')
data <- transform(data, Month = format(Date, '%m'), Year = format(Date, '%Y'))
data
data <- structure(list(Date = c("2/06/2020 11:23", "12/06/2020 7:56",
"12/06/2020 7:56", "29/06/2020 16:54", "3/06/2020 15:09", "25/06/2020 17:11"
)), class = "data.frame", row.names = c(NA, -6L))
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
I have some data in a format like the reproducible example below (code for data input below the question, at the end). Two things:
Not all dates have a value (i.e. many dates are missing).
Some dates have multiple values, eg 16 June 2020.
#> date value
#> 1 30-Jun-20 20
#> 2 29-Jun-20 -100
#> 3 26-Jun-20 -4
#> 4 16-Jun-20 -13
#> 5 16-Jun-20 40
#> 6 9-Jun-20 -6
For two week periods, ending on Tuesdays, I would like to take a sum of the value column.
So in the example data above, I want to sum ending on:
two weeks ending on Tuesday 16 June 2020 (i.e. from 3 June 2020 - 16 June 2020, inclusive)
two weeks ending on Tuesday 30 June 2020 (17 June 2020 - 30 June 2020 inclusive)
I'd ultimately like the code to continue summing all two week periods ending on every second Tuesday for when there's more data.
So my desired output is:
#2_weeks_end total
#30-Jun-20 -84
#16-Jun-20 21
Tidyverse and lubridate solutions would be my first preference.
Code for data input below:
df <- data.frame(
stringsAsFactors = FALSE,
date = c("30-Jun-20","29-Jun-20",
"26-Jun-20","16-Jun-20","16-Jun-20","9-Jun-20"),
value = c(20L, -100L, -4L, -13L, 40L, -6L)
)
df
Solution using findInterval().
df$date <- dmy(df$date)
df_intervals <- seq(as.Date("2020-06-03"), as.Date("2020-06-03")+14*3, 14)
df %>%
mutate(interval = findInterval(date, df_intervals)) %>%
mutate(`2_weeks_end` = df_intervals[interval+1]-1) %>%
group_by(`2_weeks_end`) %>%
summarise(total= sum(value))
Returns:
# A tibble: 2 x 2
2_weeks_end total
<date> <int>
1 2020-06-16 21
2 2020-06-30 -84
Here is an option if you like weekly or any other unit that is in lubridate by default:
library(dplyr)
library(lubridate)
df%>%
mutate(date = as.Date(date, format = "%d-%b-%y"))%>%
group_by(week_ceil = ceiling_date(date - 1L, unit = "week", week_start = 2L))%>%
summarize(sums = sum(value))
Here is a data.table approach that creates a reference table followed by a non-equi join:
library(data.table)
setDT(df)
df[, date := as.Date(date, format = "%d-%b-%y")]
ref_dt = df[, .(beg_date = seq.Date(from = floor_date(min(date), unit = "week", week_start = 3L),
to = max(date),
by = "2 weeks"))]
ref_dt[, end_date := beg_date +13L]
df[ref_dt,
on = .(date > beg_date,
date <= end_date),
sum(value),
by = .EACHI]
## date date V1
##1: 2020-06-03 2020-06-16 21
##2: 2020-06-17 2020-06-30 -84
This question already has an answer here:
Sort year-month column by year AND month
(1 answer)
Closed 1 year ago.
I have dates in the format mm/yyyy in column 1, and then results in column 2.
month Result
01/2018 96.13636
02/2018 96.40000
3/2018 94.00000
04/2018 97.92857
05/2018 95.75000
11/2017 98.66667
12/2017 97.78947
How can I order by month such that it will start from the first month (11/2017) and end (05/2018).
I have tried a few 'orders', but none seem to be ordering by year and then by month
In tidyverse (w/ lubridate added):
library(tidyverse)
library(lubridate)
dfYrMon <-
df1 %>%
mutate(date = parse_date_time(month, "my"),
year = year(date),
month = month(date)
) %>%
arrange(year, month) %>%
select(date, year, month, result)
With data:
df1 <- tibble(month = c("01/2018", "02/2018", "03/2018", "04/2018", "05/2018", "11/2017", "12/2017"),
result = c(96.13636, 96.4, 94, 97.92857, 95.75, 98.66667, 97.78947))
Will get you this 'dataframe':
# A tibble: 7 x 4
date year month result
<dttm> <dbl> <dbl> <dbl>
1 2017-11-01 2017 11 98.66667
2 2017-12-01 2017 12 97.78947
3 2018-01-01 2018 1 96.13636
4 2018-02-01 2018 2 96.40000
5 2018-03-01 2018 3 94.00000
6 2018-04-01 2018 4 97.92857
7 2018-05-01 2018 5 95.75000
Making your data values atomic (year in its own column, month in its own column) generally improves the ease of manipulation.
Or if you want to use base R date manipulations instead of lubridate's:
library(tidyverse)
dfYrMon_base <-
df1 %>%
mutate(date = as.Date(paste("01/", month, sep = ""), "%d/%m/%Y"),
year = format(as.Date(date, format="%d/%m/%Y"),"%Y"),
month = format(as.Date(date, format="%d/%m/%Y"),"%m")
) %>%
arrange(year, month) %>%
select(date, year, month, result)
dfYrMon_base
Note the datatypes created.
# A tibble: 7 x 4
date year month result
<date> <chr> <chr> <dbl>
1 2017-11-01 2017 11 98.66667
2 2017-12-01 2017 12 97.78947
3 2018-01-01 2018 01 96.13636
4 2018-02-01 2018 02 96.40000
5 2018-03-01 2018 03 94.00000
6 2018-04-01 2018 04 97.92857
7 2018-05-01 2018 05 95.75000
We can convert it to yearmon class and then do the order
library(zoo)
out <- df1[order(as.yearmon(df1$month, "%m/%Y"), df1$Result),]
row.names(out) <- NULL
out
# month Result
#1 11/2017 98.66667
#2 12/2017 97.78947
#3 01/2018 96.13636
#4 02/2018 96.40000
#5 03/2018 94.00000
#6 04/2018 97.92857
#7 05/2018 95.75000
data
df1 <- structure(list(month = c("01/2018", "02/2018", "03/2018", "04/2018",
"05/2018", "11/2017", "12/2017"), Result = c(96.13636, 96.4,
94, 97.92857, 95.75, 98.66667, 97.78947)), .Names = c("month",
"Result"), class = "data.frame",
row.names = c("1", "2", "3",
"4", "5", "6", "7"))
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).