I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.
Related
I have weather data for summarize across different date intervals.
Here is the weather data:
library(dplyr)
rainfall_data <- read.csv(text = "
date,rainfall_daily_mm
01/01/2019,0
01/02/2019,1
01/03/2019,3
01/04/2019,45
01/05/2019,0
01/06/2019,0
01/07/2019,0
01/08/2019,43
01/09/2019,5
01/10/2019,0
01/11/2019,55
01/12/2019,6
01/13/2019,0
01/14/2019,7
01/15/2019,0
01/16/2019,7
01/17/2019,8
01/18/2019,89
01/19/2019,65
01/20/2019,3
01/21/2019,0
01/22/2019,0
01/23/2019,2
01/24/2019,0
01/25/2019,0
01/26/2019,0
01/27/2019,0
01/28/2019,22
01/29/2019,3
01/30/2019,0
01/31/2019,0
") %>%
mutate(date = as.Date(date, format = "%d/%m/%Y"))
And here is the date intervals I need to get summaries of from the weather file:
intervals <- read.csv(text= "
treatment,initial,final
A,01/01/2019,01/05/2019
B,01/13/2019,01/20/2019
C,01/12/2019,01/26/2019
D,01/30/2019,01/31/2019
E,01/11/2019,01/23/2019
F,01/03/2019,01/19/2019
G,01/01/2019,01/24/2019
H,01/26/2019,01/28/2019
") %>%
mutate(initial = as.Date(initial, format = "%d/%m/%Y"),
final = as.Date(final, format = "%d/%m/%Y"))
The expected outcome is this one:
This is what I've tried based on a similar question:
summary_by_date_interval <- rainfall_data %>%
mutate(group = cumsum(grepl(intervals$initial|intervals$final, date))) %>%
group_by(group) %>%
summarise(rainfall = sum(rainfall_daily_mm))
And this is the error I got:
Error in `mutate()`:
! Problem while computing `group = cumsum(grepl(intervals$initial |
intervals$final, date))`.
Caused by error in `Ops.Date()`:
! | not defined for "Date" objects
Run `rlang::last_error()` to see where the error occurred.
Any help will be really appreciated.
First %d/%m/%Y need to be %m/%d/%Y (or you'll have wrong dates and many NA's).
Then you could e.g. use lubridates interval and %within%:
library(dplyr)
library(lubridate)
intervals |>
group_by(treatment) |>
mutate(test = sum(rainfall_data$rainfall_daily_mm[rainfall_data$date %within% interval(initial, final)])) |>
ungroup()
Output:
# A tibble: 8 × 4
treatment initial final test
<chr> <date> <date> <int>
1 A 2019-01-01 2019-01-05 49
2 B 2019-01-13 2019-01-20 179
3 C 2019-01-12 2019-01-26 187
4 D 2019-01-30 2019-01-31 0
5 E 2019-01-11 2019-01-23 242
6 F 2019-01-03 2019-01-19 333
7 G 2019-01-01 2019-01-24 339
8 H 2019-01-26 2019-01-28 22
I have a dataset (precipitation) with four columns. I want to summarize (in table format) the amount of rain that occurred on a monthly basis for each month in 2019 and 2020 (sum and difference between two years). I am struggling how to summarize heaps of daily data to give me a monthly summary AND filtering it out for quality that is "Good".
Columns in Dataset:
colnames(rain_file)
"ID" "deviceID" "remarks" "date" "amount_rain" "quality"
Date (The date column is formatted as follows and there are multiple readings for each date)
head(rain_file$date)
2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01 2018-01-01
1096 Levels: 2018-01-01 2018-01-02 2018-01-03 2018-01-04 2018-01-05 ... 2020-12-31
Quality (5 types of Quality, I only want to filter for "Good")
head(rain_file$quality)
Good Good Good Good Good Good...
Levels: Absent Good Lost Poor Snow Trace
I have this so far but it's not correct and I'm not sure what to do next...
data=read.table("rain_file.csv", header=TRUE, sep=",", fill=T, quote="\"")
dates=apply(data,1, function(x) {strsplit(x["date"],"-")})
data=cbind(data, t(as.data.frame(dates, row.names=c("year", "month", "day"))))
m_rain_df=tapply(data$amount_rain, data[,c("year","month")], mean, na.rm=T)
data=data.table(data)
m_rain_dt=data[, list(month_rain=mean(amount_rain, na.rm=T)), by=list(year, month)]
Here a solution using dplyr:
library(tydiverse)
## create a dummy dataset
dat <-
tibble(
date = factor(c('2018-01-01', '2018-01-02', '2018-01-03', '2018-02-01', '2018-02-02', '2018-02-03')),
quality = factor(c('Absent', 'Good', 'Good', 'Snow', 'Good', 'Good')),
amount_rain = runif(6)
)
dat %>%
## split date column in year month day
mutate(date = as.character(date)) %>%
separate(date, c("year", "month", "day"), sep = '-') %>%
## keep only good quality data
filter(quality == 'Good') %>%
## summatize by year and month
group_by(year, month) %>%
summarise(
mean_amount_rain = mean(amount_rain)
)
Which gives:
# A tibble: 2 × 3
# Groups: year [1]
year month mean_amount_rain
<chr> <chr> <dbl>
1 2018 01 0.729
2 2018 02 0.466
I have some IDs with timestamps that are exhibiting behavior that is out of my control. This is causing new rows for the same ID, but the end_time for the first row of a given ID matches the start_time of the next row for that same ID. An example looks like this:
df <- data.frame(id = c("1", "1", "1", "2", "3", "3"),
start_time = c("7/4/2020 10:06:27", "7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/9/2020 03:27:37", "7/4/2020 02:01:49", "7/9/2020 00:00:00"),
end_time = c("7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/25/2020 18:17:46", "7/21/2020 20:13:16", "7/5/2020 09:17:54", "7/11/2020 15:43:22"))
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/16/2020 07:16:44
2 1 7/16/2020 07:16:44 7/16/2020 07:20:32
3 1 7/16/2020 07:20:32 7/25/2020 18:17:46
4 2 7/9/2020 03:27:37 7/21/2020 20:13:16
5 3 7/4/2020 02:01:49 7/5/2020 09:17:54
6 3 7/9/2020 00:00:00 7/11/2020 15:43:22
But I want to 'stitch' together the timestamps for id = 1 (and other possible rows exhibiting this behavior) to for one distinct row like below. Note how the end_time and start_time entries for this id match up identically.
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/25/2020 18:17:46
2 2 7/9/2020 03:27:37 7/21/2020 20:13:16
3 3 7/4/2020 02:01:49 7/5/2020 09:17:54
4 3 7/9/2020 00:00:00 7/11/2020 15:43:22
The entries for id = 3 are appropriate because the end_time for the first instance is instance from the start time of the second instance, so the desired behavior be for those rows to be unchanged
Here's a dplyr solution:
library(dplyr)
df %>%
mutate_at(vars(start_time, end_time), ~ as.POSIXct(., format = "%m/%d/%Y %T")) %>%
arrange(id, start_time) %>%
group_by(id) %>%
mutate(grp = cumsum(start_time != dplyr::lag(end_time, default=end_time[1]))) %>%
group_by(id, grp) %>%
summarize(start_time = start_time[1], end_time = end_time[n()]) %>%
ungroup() %>%
select(-grp)
# # A tibble: 4 x 3
# id start_time end_time
# <chr> <dttm> <dttm>
# 1 1 2020-07-04 10:06:27 2020-07-25 18:17:46
# 2 2 2020-07-09 03:27:37 2020-07-21 20:13:16
# 3 3 2020-07-04 02:01:49 2020-07-05 09:17:54
# 4 3 2020-07-09 00:00:00 2020-07-11 15:43:22
Edited to add as.POSIXct. One comment was to make sure to sort by start_time to make sure that comparisons were good, but realize that since those were originally strings, that there was no guarantee that they would sort correctly. For instance, sort(c("7/31/2020", "12/01/2020")) sorts incorrectly. Unlikely given this data, but still a good safeguard.
I'm running into an RStudio data issue regarding properly melting data. It currently is in the following form:
Campaign, ID, Start Date, End Date, Total Number of Days, Total Spend, Total Impressions, Total Conversions
I would like my data to look like the following:
Campaign, ID, Date, Spend, Impressions, Conversions
Each 'date' should contain a specific day the campaign was run while spend, impressions, and conversions should equal Total Spend / Total # of Days, Total Impressions / Total # of Days, and Total Conversions / Total # of Days, respectively.
I'm working in RStudio so a solution in R is needed. Does anyone have experience manipulating data like this?
This works, but it's not particularly efficient. If your data is millions of rows or more, I've had better luck using SQL and inequality joins.
library(tidyverse)
#create some bogus data
data <- data.frame(ID = 1:10,
StartDate = sample(seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day"), 10),
Total = runif(10)) %>%
mutate(EndDate = StartDate + floor(runif(10) * 14))
#generate all dates between the min and max in the dataset
AllDates = data.frame(Date = seq.Date(min(data$StartDate), max(data$EndDate), "day"),
Dummy = TRUE)
#join via a dummy variable to add rows for all dates to every ID
data %>%
mutate(Dummy = TRUE) %>%
inner_join(AllDates, by = c("Dummy" = "Dummy")) %>%
#filter to just the dates between the start and end
filter(Date >= StartDate, Date <= EndDate) %>%
#divide the total by the number of days
group_by(ID) %>%
mutate(TotalPerDay = Total / n()) %>%
select(ID, Date, TotalPerDay)
# A tibble: 91 x 3
# Groups: ID [10]
ID Date TotalPerDay
<int> <date> <dbl>
1 1 2018-06-21 0.00863
2 1 2018-06-22 0.00863
3 1 2018-06-23 0.00863
4 1 2018-06-24 0.00863
5 1 2018-06-25 0.00863
6 1 2018-06-26 0.00863
7 1 2018-06-27 0.00863
8 1 2018-06-28 0.00863
9 1 2018-06-29 0.00863
10 1 2018-06-30 0.00863
# ... with 81 more rows
Suppose I have a daily rain data.frame like this:
df.meteoro = data.frame(Dates = seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"),
rain = rnorm(length(seq(as.Date("2017/1/19"), as.Date("2018/1/18"), "days"))))
I'm trying to sum the accumulated rain between a 14 days interval with this code:
library(tidyverse)
library(lubridate)
df.rain <- df.meteoro %>%
mutate(TwoWeeks = round_date(df.meteoro$data, "14 days")) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
The problem is that it isn't starting on 2017-01-19 but on 2017-01-15 and I was expecting my output dates to be:
"2017-02-02" "2017-02-16" "2017-03-02" "2017-03-16" "2017-03-30" "2017-04-13"
"2017-04-27" "2017-05-11" "2017-05-25" "2017-06-08" "2017-06-22" "2017-07-06" "2017-07-20"
"2017-08-03" "2017-08-17" "2017-08-31" "2017-09-14" "2017-09-28" "2017-10-12" "2017-10-26"
"2017-11-09" "2017-11-23" "2017-12-07" "2017-12-21" "2018-01-04" "2018-01-18"
TL;DR I have a year long daily rain data.frame and want to sum the accumulate rain for the dates above.
Please help.
Use of round_date in the way you have shown it will not give you 14-day periods as you might expect. I have taken a different approach in this solution and generated a sequence of dates between your first and last dates and grouped these into 14-day periods then joined the dates to your observations.
startdate = min(df.meteoro$Dates)
enddate = max(df.meteoro$Dates)
dateseq =
data.frame(Dates = seq.Date(startdate, enddate, by = 1)) %>%
mutate(group = as.numeric(Dates - startdate) %/% 14) %>%
group_by(group) %>%
mutate(starts = min(Dates))
df.rain <- df.meteoro %>%
right_join(dateseq) %>%
group_by(starts) %>%
summarise(sum_rain = sum(rain))
head(df.rain)
> head(df.rain)
# A tibble: 6 x 2
starts sum_rain
<date> <dbl>
1 2017-01-19 6.09
2 2017-02-02 5.55
3 2017-02-16 -3.40
4 2017-03-02 2.55
5 2017-03-16 -0.12
6 2017-03-30 8.95
Using a right-join to the date sequence is to ensure that if there are missing observation days that spanned a complete time period you'd still get that period listed in the result (though in your case you have a complete year of dates anyway).
round_date rounds to the nearest multiple of unit (here, 14 days) since some epoch (probably the Unix epoch of 1970-01-01 00:00:00), which doesn't line up with your purpose.
To get what you want, you can do the following:
df.rain = df.meteoro %>%
mutate(days_since_start = as.numeric(Dates - as.Date("2017/1/18")),
TwoWeeks = as.Date("2017/1/18") + 14*ceiling(days_since_start/14)) %>%
group_by(TwoWeeks) %>%
summarise(sum_rain = sum(rain))
This computes days_since_start as the days since 2017/1/18 and then manually rounds to the next multiple of two weeks.
Assuming you want to round to the closest date from the ones you have specified I guess the following will work
targetDates<-seq(ymd("2017-02-02"),ymd("2018-01-18"),by='14 days')
df.meteoro$Dates=targetDates[sapply(df.meteoro$Dates,function(x) which.min(abs(interval(targetDates,x))))]
sum_rain=ddply(df.meteoro,.(Dates),summarize,sum_rain=sum(rain,na.rm=T))
as you can see not all dates have the same number of observations. Date "2017-02-02" for instance has all the records between "2017-01-19" until "2017-02-09", which are 22 records. From "2017-02-10" on dates are rounded to "2017-02-16" etc.
This may be a cheat, but assuming each row/observation is a separate day, then why not just group by every 14 rows and sum.
# Assign interval groups, each 14 rows
df.meteoro$my_group <-rep(1:100, each=14, length.out=nrow(df.meteoro))
# Grab Interval Names
my_interval_names <- df.meteoro %>%
select(-rain) %>%
group_by(my_group) %>%
slice(1)
# Summarise
df.meteoro %>%
group_by(my_group) %>%
summarise(rain = sum(rain)) %>%
left_join(., my_interval_names)
#> Joining, by = "my_group"
#> # A tibble: 27 x 3
#> my_group rain Dates
#> <int> <dbl> <date>
#> 1 1 3.86 2017-01-19
#> 2 2 -0.581 2017-02-02
#> 3 3 -0.876 2017-02-16
#> 4 4 1.80 2017-03-02
#> 5 5 3.79 2017-03-16
#> 6 6 -3.50 2017-03-30
#> 7 7 5.31 2017-04-13
#> 8 8 2.57 2017-04-27
#> 9 9 -1.33 2017-05-11
#> 10 10 5.41 2017-05-25
#> # ... with 17 more rows
Created on 2018-03-01 by the reprex package (v0.2.0).