R Calculate change in Weekly values Year on Year (with additional complication) - r

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).

We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

Related

Sum table values ​per day

I have a table as shown in the image, where each comment has a publication date, with year, month, day and time, I would like to add the sentiment values ​​by day.
this is how the table is composed
serie <- data.frame(comments$created_time,sentiment2$positive-sentiment2$negative)
Using dplyr you can do:
library(dplyr)
df %>%
group_by(as.Date(comments.created_time)) %>%
summarize(total = sum(sentiment))
Here is some sample data that will help others to troubleshoot and understand the data:
df <- tibble(comments.created_time = c("2015-01-26 22:43:00",
"2015-01-26 22:44:00",
"2015-01-27 22:43:00",
"2015-01-27 22:44:00",
"2015-01-28 22:43:00",
"2015-01-28 22:44:00"),
sentiment = c(1,3,5,1,9,1))
Using the sample data will yield:
# A tibble: 3 × 2
`as.Date(comments.created_time)` total
<date> <dbl>
1 2015-01-26 4
2 2015-01-27 6
3 2015-01-28 10

Start and end of events assign to month of start based on condition

I have a data for 2000 events with start and end date of each event and the length.
What I am trying to do is finding the frequency of events by month and year. But several of events are split between two consecutive months (say May and June) and want for these events to be reported to the month over which they stay longer. But if an event split equally between tow month then it will be reported to the month of start.
Eg:
> date01[1:5,9:11]
# A tibble: 5 x 3
StrD EndD EvLength
<date> <date> <drtn>
1 1993-12-30 1994-01-01 3 days # this would be reported Dec frequency
2 2000-07-23 2000-08-02 11 days # this would be reported July frequency
3 2001-02-28 2001-03-01 2 days # this would be reported Feb frequency (as it started in Feb)
4 2006-05-29 2006-06-01 4 days # this would be reported May frequency (as it started in May)
5 2010-07-30 2010-08-04 6 days
I tried to use group_by (from dplyr), but still not able to figure it out.
dates to date format with ymd() from lubridate package.
mutate days in previous and next Month with days_in_month function and basic arichmetic. Note the start day is count therefore +1 to start date.
get the month depending on which month has more days with an ifelse
get the abbreviation of Months with month.abb[Month]
get the Year from start date.
group and summarise
library(dplyr)
library(lubridate)
df %>%
mutate(across(1:2, ymd)) %>%
mutate(prev_month_days = days_in_month(StrD)-day(StrD)+1,
next_month_days = day(EndD)) %>%
mutate(Month = ifelse(prev_month_days>= next_month_days, month(StrD), month(EndD))) %>%
mutate(Month = month.abb[Month]) %>%
mutate(Year = year(StrD)) %>%
group_by(Year, Month) %>%
summarise(n = n())
Output:
Year Month n
<int> <chr> <int>
1 1993 Dec 1
2 2000 Jul 1
3 2001 Feb 1
4 2006 May 1
5 2010 Aug 1

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Using filter in dplyr to generate values for all rows

library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources