I need to get the row of the first and last day of each month in a big data frame where I need to apply operations that cover accurately each month, using a for loop. Unfortunately, the data frame is not very homogeneous. Here a reproducible example to work upon:
dataframe <- data.frame(Date=c(seq.Date(as.Date("2020-01-01"),as.Date("2020-01-31"),by="day"),
seq.Date(as.Date("2020-02-01"),as.Date("2020-02-28"),by="day"),seq.Date(as.Date("2020-03-02"),
as.Date("2020-03-31"),by="day")))
We can create a grouping column by converting to yearmon and then get the first and last
library(zoo)
library(dplyr)
dataframe %>%
group_by(yearMon = as.yearmon(Date)) %>%
summarise(FirstDay = first(Date), LastDay = last(Date))
# A tibble: 3 x 3
# yearMon First Last
#* <yearmon> <date> <date>
#1 Jan 2020 2020-01-01 2020-01-31
#2 Feb 2020 2020-02-01 2020-02-28
#3 Mar 2020 2020-03-02 2020-03-31
If it the first and last day irrespective of the data
library(lubridate)
dataframe %>%
group_by(yearMon = as.yearmon(Date)) %>%
summarise(First = floor_date(first(Date), 'month'),
Last = ceiling_date(last(Date), 'month')-1)
Related
I have a dataset (precipitation) with four columns. I want to summarize (in table format) the amount of rain that occurred on a monthly basis for each month in 2019 and 2020 (sum and difference between two years). I am struggling how to summarize heaps of daily data to give me a monthly summary AND filtering it out for quality that is "Good".
Columns in Dataset:
colnames(rain_file)
"ID" "deviceID" "remarks" "date" "amount_rain" "quality"
Date (The date column is formatted as follows and there are multiple readings for each date)
head(rain_file$date)
2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01
1096 Levels: 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05 ... 2020-12-31
Quality (5 types of Quality, I only want to filter for "Good")
head(rain_file$quality)
Good Good Good Good Good Good...
Levels: Absent Good Lost Poor Snow Trace
I have this so far but it's not correct and I'm not sure what to do next...
data=read.table("rain_file.csv", header=TRUE, sep=",", fill=T, quote="\"")
dates=apply(data,1, function(x) {strsplit(x["date"],"-")})
data=cbind(data, t(as.data.frame(dates, row.names=c("year", "month", "day"))))
m_rain_df=tapply(data$amount_rain, data[,c("year","month")], mean, na.rm=T)
data=data.table(data)
m_rain_dt=data[, list(month_rain=mean(amount_rain, na.rm=T)), by=list(year, month)]
Here a solution using dplyr:
library(tydiverse)
## create a dummy dataset
dat <-
tibble(
date = factor(c('2018-01-01', '2018-01-02', '2018-01-03', '2018-02-01', '2018-02-02', '2018-02-03')),
quality = factor(c('Absent', 'Good', 'Good', 'Snow', 'Good', 'Good')),
amount_rain = runif(6)
)
dat %>%
## split date column in year month day
mutate(date = as.character(date)) %>%
separate(date, c("year", "month", "day"), sep = '-') %>%
## keep only good quality data
filter(quality == 'Good') %>%
## summatize by year and month
group_by(year, month) %>%
summarise(
mean_amount_rain = mean(amount_rain)
)
Which gives:
# A tibble: 2 × 3
# Groups: year [1]
year month mean_amount_rain
<chr> <chr> <dbl>
1 2018 01 0.729
2 2018 02 0.466
I am trying to convert a column in my dataset that contains week numbers into weekly Dates. I was trying to use the lubridate package but could not find a solution. The dataset looks like the one below:
df <- tibble(week = c("202009", "202010", "202011","202012", "202013", "202014"),
Revenue = c(4543, 6764, 2324, 5674, 2232, 2323))
So I would like to create a Date column with in a weekly format e.g. (2020-03-07, 2020-03-14).
Would anyone know how to convert these week numbers into weekly dates?
Maybe there is a more automated way, but try something like this. I think this gets the right days, I looked at a 2020 calendar and counted. But if something is off, its a matter of playing with the (week - 1) * 7 - 1 component to return what you want.
This just grabs the first day of the year, adds x weeks worth of days, and then uses ceiling_date() to find the next Sunday.
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
separate(week, c("year", "week"), sep = 4, convert = TRUE) %>%
mutate(date = ceiling_date(ymd(paste(year, "01", "01", sep = "-")) +
(week - 1) * 7 - 1, "week", week_start = 7))
# # A tibble: 6 x 4
# year week Revenue date
# <int> <int> <dbl> <date>
# 1 2020 9 4543 2020-03-01
# 2 2020 10 6764 2020-03-08
# 3 2020 11 2324 2020-03-15
# 4 2020 12 5674 2020-03-22
# 5 2020 13 2232 2020-03-29
# 6 2020 14 2323 2020-04-05
I have some data in a format like the reproducible example below (code for data input below the question, at the end). Two things:
Not all dates have a value (i.e. many dates are missing).
Some dates have multiple values, eg 16 June 2020.
#> date value
#> 1 30-Jun-20 20
#> 2 29-Jun-20 -100
#> 3 26-Jun-20 -4
#> 4 16-Jun-20 -13
#> 5 16-Jun-20 40
#> 6 9-Jun-20 -6
For two week periods, ending on Tuesdays, I would like to take a sum of the value column.
So in the example data above, I want to sum ending on:
two weeks ending on Tuesday 16 June 2020 (i.e. from 3 June 2020 - 16 June 2020, inclusive)
two weeks ending on Tuesday 30 June 2020 (17 June 2020 - 30 June 2020 inclusive)
I'd ultimately like the code to continue summing all two week periods ending on every second Tuesday for when there's more data.
So my desired output is:
#2_weeks_end total
#30-Jun-20 -84
#16-Jun-20 21
Tidyverse and lubridate solutions would be my first preference.
Code for data input below:
df <- data.frame(
stringsAsFactors = FALSE,
date = c("30-Jun-20","29-Jun-20",
"26-Jun-20","16-Jun-20","16-Jun-20","9-Jun-20"),
value = c(20L, -100L, -4L, -13L, 40L, -6L)
)
df
Solution using findInterval().
df$date <- dmy(df$date)
df_intervals <- seq(as.Date("2020-06-03"), as.Date("2020-06-03")+14*3, 14)
df %>%
mutate(interval = findInterval(date, df_intervals)) %>%
mutate(`2_weeks_end` = df_intervals[interval+1]-1) %>%
group_by(`2_weeks_end`) %>%
summarise(total= sum(value))
Returns:
# A tibble: 2 x 2
2_weeks_end total
<date> <int>
1 2020-06-16 21
2 2020-06-30 -84
Here is an option if you like weekly or any other unit that is in lubridate by default:
library(dplyr)
library(lubridate)
df%>%
mutate(date = as.Date(date, format = "%d-%b-%y"))%>%
group_by(week_ceil = ceiling_date(date - 1L, unit = "week", week_start = 2L))%>%
summarize(sums = sum(value))
Here is a data.table approach that creates a reference table followed by a non-equi join:
library(data.table)
setDT(df)
df[, date := as.Date(date, format = "%d-%b-%y")]
ref_dt = df[, .(beg_date = seq.Date(from = floor_date(min(date), unit = "week", week_start = 3L),
to = max(date),
by = "2 weeks"))]
ref_dt[, end_date := beg_date +13L]
df[ref_dt,
on = .(date > beg_date,
date <= end_date),
sum(value),
by = .EACHI]
## date date V1
##1: 2020-06-03 2020-06-16 21
##2: 2020-06-17 2020-06-30 -84
I have a data.frame that doesn't account for leap year (ie all years are 365 days). I would like to repeat the last day value in February during the leap year. The DF in my code below has fake data set, I intentionally remove the leap day value in DF_NoLeapday. I would like to add a leap day value in DF_NoLeapday by repeating the value of the last day of February in a leap year (in our example it would Feb 28, 2004 value). I would rather like to have a general solution to apply this to many years data.
set.seed(55)
DF <- data.frame(date = seq(as.Date("2003-01-01"), to= as.Date("2005-12-31"), by="day"),
A = runif(1096, 0,10),
Z = runif(1096,5,15))
DF_NoLeapday <- DF[!(format(DF$date,"%m") == "02" & format(DF$date, "%d") == "29"), ,drop = FALSE]
We can use complete on the 'date' column which is already a Date class to expand the rows to fill in the missing dates
library(dplyr)
library(tidyr)
out <- DF_NoLeapday %>%
complete(date = seq(min(date), max(date), by = '1 day'))
dim(out)
#[1] 1096 3
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 NA NA
#3 2004-03-01 5.30 7.35
By default, the other columns values are filled with NA, if we need to change it to a different value, it can be done within complete with fill
If we need the previous values, then use fill
out <- out %>%
fill(A, Z)
out %>%
filter(date >= '2004-02-28', date <= '2004-03-01')
# A tibble: 3 x 3
# date A Z
# <date> <dbl> <dbl>
#1 2004-02-28 9.06 9.70
#2 2004-02-29 9.06 9.70
#3 2004-03-01 5.30 7.35
I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.