How to calculate the total amount depend on the date in r? - r

I am new to R and I have problems with calculating the amount of bill for each month. I have the dataframe as below:
dat <- data.frame(
time = factor(c("Breakfast","Breakfast","Breakfast","Breakfast","Breakfast","Breakfast"), levels=c("Breakfast")), date=c("2020-01-20","2020-01-21","2020-01-22","2020-02-10","2020-02-11","2020-02-12"),
total_bill = c(12.7557,14.8,17.23,15.7,16.9,13.2)
)
My goal is to calculate the amount spending on the Breakfast for each month so here we have two months and I want to get the total sum of January and February separately.
Any help for this would be much appreciated. Thank you!

Does this answer your question?
sums <- tapply(dat$total_bill, format(as.Date(dat$date), "%B"), sum)
February January
45.8000 44.7857
sumsis a list: so if you want to access, for example, the datum for February, you can do this:
sums[1]
February
45.8
Alternatively, you can convert sums into a dataframe and access the monthly sums via the month names:
sums <- as.data.frame.list(tapply(dat$total_bill, format(as.Date(dat$date), "%B"), sum))
sums$February
45.8
Addition:
Another (fun) solution is via regex: you define the dates as a pattern and, using sub plus backreference \\1 to recall the two numbers between the dashes, reduce them to the months part:
tapply(dat$total_bill, sub("\\d{4}-(\\d{2})-\\d{2}", "\\1", dat$date), sum)
01 02
44.7857 45.8000

We can convert the 'date' to Date class, get the month, and use that as grouping column and sum the 'total_bill'
library(dplyr)
dat %>%
group_by(time, Month = format(as.Date(date), "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE))
# A tibble: 2 x 3
# Groups: time [1]
# time Month total_bill
# <fct> <chr> <dbl>
#1 Breakfast February 45.8
#2 Breakfast January 44.8
We can convert it to 'wide' format, if that is needed
library(tidyr)
out <- dat %>%
group_by(time, Month = format(as.Date(date), "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE)) %>%
pivot_wider(names_from = Month, values_from = total_bill)
out
# A tibble: 1 x 3
# Groups: time [1]
# time February January
# <fct> <dbl> <dbl>
# 1 Breakfast 45.8 44.8
If we also need to group by 'year'
out <- dat %>%
mutate(date = as.Date(date)) %>%
group_by(time, Year = format(date, "%Y"), Month = format(date, "%B")) %>%
summarise(total_bill = sum(total_bill, na.rm = TRUE))

library(dplyr)
d_sum <- dat %>%
group_by(substr(date, 0, 7)) %>%
summarise(sum = sum(total_bill))
d_sum
# A tibble: 2 x 2
`substr(date, 0, 7)` sum
<chr> <dbl>
1 2020-01 44.8
2 2020-02 45.8

Related

How find the row containing the maximum value and its associated year, when the Year Column contains multiple years in R

How find the row containing the maximum value and its associated year, when the Year Column contains multiple years. My data frame contains monthly river discharge data from january2013 till December 2020. for example if I wanted to find the row containing maximum discharge for the year 2013 or I would like to find out both maximum discharge for 2013 and the date (date/month/year) associated with that particular maximum discharge. How would I do that? in R?
Year
Discharge
1/1/2013
23
2/1/2013
45
- -
--
12/31/2020
80
We can convert the column to Date class, get the year as a separate column, do a group by and slice the max row
library(dplyr)
library(lubridate)
df1 %>%
group_by(year = year(mdy(Year))) %>%
slice_max(n = 1, order_by = Discharge) %>%
ungroup
-output
# A tibble: 2 x 3
Year Discharge year
<chr> <int> <dbl>
1 2/1/2013 45 2013
2 12/31/2020 80 2020
if there are multiple formats in the 'Year' column, use parse_date from parsedate
library(parsedate)
df1 %>%
group_by(year = year(parse_date(Year))) %>%
slice_max(n = 1, order_by = Discharge) %>%
ungroup
Update
Based on the dput in the comments, the 'Date' column is already in Date class
df1 %>%
group_by(year= year(Date)) %>%
slice_max(n = 1, order_by = Discharge, with_ties = FALSE) %>%
ungroup
-output
# A tibble: 1 x 3
Date Discharge year
<date> <dbl> <dbl>
1 2018-06-07 0.0116 2018
data
df1 <- structure(list(Year = c("1/1/2013", "2/1/2013", "12/31/2020"),
Discharge = c(23L, 45L, 80L)), class = "data.frame", row.names = c(NA,
-3L))

Mean of few months for a monthly data in r

I want to find the average of the months from Nov to March, say Nov 1982 to Mar 1983. Then, for my result, I want a column with year and mean in another. If the mean is taken till Mar 1983, I want the year to be shown as 1983 along with that mean.
This is how my data looks like.
I want my result to look like this.
1983 29.108
1984 26.012
I am not very good with R packages, If there is an easy way to do this. I would really appreciate any help. Thank you.
Here is one approach to get average of Nov-March every year.
library(dplyr)
df %>%
#Remove data for month April-October
filter(!between(month, 4, 10)) %>%
#arrange the data by year and month
arrange(year, month) %>%
#Remove 1st 3 months of the first year and
#last 2 months of last year
filter(!(year == min(year) & month %in% 1:3 |
year == max(year) & month %in% 11:12)) %>%
#Create a group column for every November entry
group_by(grp = cumsum(month == 11)) %>%
#Take average for each year
summarise(year = last(year),
value = mean(value)) %>%
select(-grp)
# A tibble: 2 x 2
# year value
# <int> <dbl>
#1 1982 0.308
#2 1983 -0.646
data
It is easier to help if you provide data in a reproducible format which can be copied easily.
set.seed(123)
df <- data.frame(year = rep(1981:1983, each = 12),month = 1:12,value = rnorm(36))
With dplyr
# remove the "#" before in the begining of the next line if dplyr or tidyverse is not installed
#install.packages("dplyr")
library(dplyr) #reading the library
colnames(df) <- c("year","month","value") #here I assumed your dataset is named df
df<- df%>%
group_by(year) %>%
summarize(av_value =mean(value))
You can do this as follow using tidyverse
require(tidyverse)
year <- rep(1982:1984, 3)
month <- rep(1:12, 3)
value <- runif(length(month))
dat <- data.frame(year, month, value)
head(dat)
dat looks like your data
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398
The trick then is to group_by and summarise
dat %>%
group_by(year) %>%
summarise(value = mean(value))
Which gives you
# A tibble: 3 × 2
year value
<int> <dbl>
1 1982 0.450
2 1983 0.574
3 1984 0.398

Creating intervals

I have a data set that I would like to split into 10-day intervals. The code that I included below does that, but for the last week or so there are days that (e.g., the 31st or 30th of a month) that remain end up by itself.
I would like to either remove the intervals that create this or include them in the previous intervals.
For example:
If I separate the month of January by 10-day intervals, it would put the first 10 days in a element of a list, the second 10 days into another element and the third 10 days into another one. It would then put January 31st into a element of list by itself.
My desired output would be to either remove these elements from the list or more preferably include them in the third 10-day interval. Can that be done? If so, what would be the best way to do so?
library(lubridate)
library(tidyverse)
date <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(date = date,
x = runif(length(date), min = 60000, max = 80000),
y = runif(length(date), min = 800000, max = 900000),
ID)
int <- df %>%
arrange(ID) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
# mutate(cut = data.table::rleid(cut(new, breaks = "10 day"))) %>%
group_by(new) %>%
group_split()
Here is a solution which splits the months by 10-day intervals but corrects new to assign day 31 of a month to the last period. So,
days 1 to 10 belong to the first third of a month,
days 11 to 20 to the second third, and
days 21 to 31 to the third third.
int <- df %>%
# arrange(ID) %>% # skipped for readability of result
mutate(new = floor_date(date, '10 day')) %>%
mutate(new = if_else(day(new) == 31, new - days(10), new)) %>%
group_by(new) %>%
group_split()
int[[1]]
# A tibble: 6 x 5
date x y ID new
<date> <dbl> <dbl> <int> <date>
1 2010-12-26 71469. 819084. 1 2010-12-21
2 2010-12-27 69417. 893227. 2 2010-12-21
3 2010-12-28 70865. 831341. 3 2010-12-21
4 2010-12-29 68322. 812423. 4 2010-12-21
5 2010-12-30 65643. 837395. 5 2010-12-21
6 2010-12-31 63638. 892200. 1 2010-12-21
Now, 2010-12-31 was assigned to the third third of December.
Note that new indicates the start of the interval by calling floor_date() instead of ceiling_date(). This is due to avoid potential problems with day arithmetic across month boundaries and to clarify to which month the interval belongs to. For instance, for the last day of February, ceiling_date(ymd('2011-02-28'), '10 day') returns "2011-03-03" which is a date in March.
If there is a single row in a group give it the previous new value. Try this -
library(dplyr)
library(lubridate)
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day')) %>%
add_count(new) %>%
mutate(new = if_else(n == 1, lag(new), new)) %>%
select(-n) %>%
group_split(new)
Above would only work to combine groups that has 1 observation in a group. If we want to combine more than 1 day use the below code which counts numbers of days in a group. It combines the group if number of day is less than n number of days.
n <- 2
df %>%
arrange(ID, date) %>%
mutate(new = ceiling_date(date, '10 day'),
ID = match(new, unique(new))) -> tmp
tmp %>%
group_by(new, ID) %>%
summarise(count_unique = n_distinct(date)) %>%
ungroup %>%
mutate(new = if_else(count_unique < n, lag(new), new)) %>%
inner_join(tmp, by = 'ID') %>%
select(new = new.x, date, x, y) %>%
group_split(new)
Alternative solution
library(lubridate)
library(tidyverse)
dt <- rep_len(seq(dmy("26-12-2010"), dmy("20-12-2013"), by = "days"), 500)
ID <- rep(seq(1, 5), 100)
df <- data.frame(dt = dt,
x = runif(length(dt), min = 60000, max = 80000),
y = runif(length(dt), min = 800000, max = 900000),
ID)
Include extra days (31st) into the last third
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
TRUE ~ 21
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 6 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
#> 6 2010-12-31 61004. 876191. 1 2010-12-21
Exclude extra days (31st)
int_df <- df %>%
# arrange(ID) %>%
mutate(day_date = day(dt),
day_new = case_when(
day_date <= 10 ~ 1,
day_date <= 20 ~ 11,
day_date <= 30 ~ 21,
TRUE ~ 31
),
new = ymd(paste(year(dt), month(dt), day_new, sep = "-"))) %>%
filter(day_date != 31) %>%
select(-c(day_date, day_new)) %>%
group_by(new) %>%
group_split()
int_df[[1]]
#> # A tibble: 5 x 5
#> dt x y ID new
#> <date> <dbl> <dbl> <int> <date>
#> 1 2010-12-26 62395. 837491. 1 2010-12-21
#> 2 2010-12-27 66236. 836481. 2 2010-12-21
#> 3 2010-12-28 79918. 818399. 3 2010-12-21
#> 4 2010-12-29 67613. 807213. 4 2010-12-21
#> 5 2010-12-30 72980. 899380. 5 2010-12-21
Created on 2021-07-03 by the reprex package (v2.0.0)

Addition of missing data after floor_date / detect and fill in missing data gaps

I would like to sum up a larger set of data per month. floor_date offers the right functionality to sum up the data from the individual days on a monthly level. But unfortunately I need to make sure that all months are included in the final table. The initial data therefore does not always cover all months, but after floor_date there must be 0 in the corresponding months; the rows / months must not simply be missing. How can I ensure this automatically?
The following exemplary code clarifies my problem:
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
df2 <- df %>%
mutate(month = floor_date(time, unit = "month")) %>%
select(text, month) %>%
group_by(month, text) %>%
summarise(n = n())
df2
# A tibble: 2 x 3
# Groups: month [2]
month text n
<date> <fct> <int>
1 2020-01-01 A 2
2 2020-03-01 B 1
It should be recognized that there is no data for B in month 2020-01, no data for A&B in month 2020-02 and no data for A in month 2020-03: this rows should be added with value 0.
Unfortunately, so far I have not found a solution to solve the problem in an automated way.
Thanks in advance!
I cannot understand the need of using format while mutating the variable for a given month (floor_date). This formatting turns the variable into character type and hence no further calculations can be performed.
Remove that step, and use tidyr::complete you can fill missing months as shown under-
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
library(lubridate, warn.conflicts = F)
library(tidyverse, warn.conflicts = F)
df %>%
mutate(month = floor_date(time, unit = "month")) %>%
group_by(text, month) %>%
summarise(n = n(), .groups = 'drop') %>%
complete(nesting(text), month = seq.Date(from = min(month), to = max(month), by = '1 month'), fill = list(n = 0))
# A tibble: 6 x 3
text month n
<chr> <date> <dbl>
1 A 2020-01-01 2
2 A 2020-02-01 0
3 A 2020-03-01 0
4 B 2020-01-01 0
5 B 2020-02-01 0
6 B 2020-03-01 1
Created on 2021-07-06 by the reprex package (v2.0.0)
Base R option using cut -
stack(table(cut(df$time,'month')))[2:1]
# ind values
#1 2020-01-01 2
#2 2020-02-01 0
#3 2020-03-01 1

Summing the number of occurrences from m/d/y to y/m

I have data of from each of the avalanches that occurred. I need to calculate the number of avalanches that occurred by each year and month but the data just gives the exact days that an avalanche occurred. How do I sum the number of occurrences that occurred during each year-month? I also only need the winter related year-months (Dec (12) - March (3)). Please help!
library(XML)
library(RCurl)
library(dplyr)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep="")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T,stringsAsFactors=F)
names(thispage.avalanche)<-c('Date','Region','Location','Observer')
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
The output should look something like:
Date AvalancheTotal
2000-01 1
2000-02 2
2000-03 8
2000-12 23
2001-01 16
.
.
.
.
.
2019-03 45
Using dplyr, you could get the variable of interest ("year-month") from the Date column, group by this variable, and then compute the number of rows in each group.
In a similar way, you can filter to only get the months you like:
library(dplyr)
winter_months <- c(1:3, 12)
avalancheslc %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
mutate(YearMonth = format(Date,"%Y-%m"),
Month = as.numeric(format(Date,"%m"))) %>%
filter(Month %in% winter_months) %>%
group_by(YearMonth) %>%
summarise(AvalancheTotal = n())
We can convert to yearmon from zoo and use that in the group_by to get the number of rows
library(dplyr)
library(zoo)
dim(avalancheslc)
#[1] 5494 4
out <- avalancheslc %>%
group_by(Date = format(as.yearmon(Date, "%m/%d/%Y"), "%Y-%m")) %>%
summarise(AvalancheTotal = n())
If we need only output from December to March, then filter the data
subOut <- out %>%
filter(as.integer(substr(Date, 6, 7)) %in% c(12, 1:3))
Or it can be filtered earlier in the chain
library(lubridate)
out <- avalancheslc %>%
mutate(Date = as.yearmon(Date, "%m/%d/%Y")) %>%
filter(month(Date) %in% c(12, 1:3)) %>%
count(Date)
dim(out)
#[1] 67 2
Now, for filling with 0's
mths <- month.abb[c(12, 1:3)]
out1 <- crossing(Months = mths,
Year = year(min(out$Date)):year(max(out$Date))) %>%
unite(Date, Months, Year, sep= " ") %>%
mutate(Date = as.yearmon(Date)) %>%
left_join(out) %>%
mutate(n = replace_na(n, 0))
tail(out1)
# A tibble: 6 x 2
# Date n
# <S3: yearmon> <dbl>
#1 Mar 2014 100
#2 Mar 2015 94
#3 Mar 2016 96
#4 Mar 2017 93
#5 Mar 2018 126
#6 Mar 2019 163

Resources