Days since a variable changed dplyr - r

Does anyone know of a dplyr method for calculating the number of days since a variable changed (by groups)? For example, consider the number of days since a particular store last changed its price.
library(dplyr)
df <- data.frame(store = c(34, 34, 34, 34, 34, 28, 28, 28, 81, 81),
date = c(20111231, 20111224, 20111217, 20111210, 20111203,
20111224, 20111217, 20111203, 20111231, 20111224),
price = c(3.45, 3.45, 3.45, 3.36, 3.45, 3.17, 3.25, 3.15,
3.49, 3.17))
df <- df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, desc(date)) %>% group_by(store) %>%
mutate(pchange = price - lead(price))
df$days.since.change <- c(7, 14, 0, 21, 14, 7, 7, 0, 7, 0)
I'm trying to use dplyr to generate a variable called days.since.change. For example, store 34 charged $3.45 on 2012-12-31, a price which had been in effect for 21 days (since it charged $3.36 on 2012-12-10). The variable appears manually above. The challenge is that a store might change its price back to an earlier price level, which invalidates some grouping strategies.

One option is to calculate the number of days between each price listing for each store and then adding a second grouping variable to group together consecutive dates during which the price didn't change. Then just take the cumulative sum over the days that passed.
I did this with the dataset sorted by date in ascending order with lag instead of lead to avoid using arrange twice but of course you could change this around. I also left the group variable in the dataset, which you likely won't want and could remove by ungrouping and then using select.
df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, date) %>%
group_by(store) %>%
mutate(pchange = price - lag(price), dchange = as.numeric(date - lag(date))) %>%
group_by(store, group = cumsum(c(1, diff(price) != 0))) %>%
mutate(dchange = cumsum(dchange))
Source: local data frame [10 x 6]
Groups: store, group
store date price pchange dchange group
1 28 2011-12-03 3.15 NA NA 1
2 28 2011-12-17 3.25 0.10 14 2
3 28 2011-12-24 3.17 -0.08 7 3
4 34 2011-12-03 3.45 NA NA 1
5 34 2011-12-10 3.36 -0.09 7 2
6 34 2011-12-17 3.45 0.09 7 3
7 34 2011-12-24 3.45 0.00 14 3
8 34 2011-12-31 3.45 0.00 21 3
9 81 2011-12-24 3.17 NA NA 1
10 81 2011-12-31 3.49 0.32 7 2

Related

Conditional Mutating using specific identifiers in a data frame

I have a dataset that employs a similar approach to this one.
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
My standard way of deriving a new column totalkgs using conditional mutating is the code below.
DF <- DF %>%
mutate(totalkgs =
ifelse(quantity == "tin", quantity * 5,
ifelse(quantity =="kg", quantity*1, quantity)))
However, the dataset had erroneous entries in the column quantity so I'd like to perform a division on those specific identifiers. All the final values of both the multiplication and division should be store in the column totalkgs. How do I go about this?
Let's assume the id's with the error data are 3,5,9,7. and I'd like to divide the values found in the column quantity with 10.
You could use case_when:
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
library(dplyr)
DF %>%
mutate(quantity2 = ifelse(ID %in% c(3,5,9,7), quantity/10, quantity)) %>%
mutate(totalkgs = case_when(measurementvalue == "tin" ~ quantity2 * 5,
measurementvalue == "kg" ~ quantity2 * 1,
TRUE ~ quantity2)) %>%
select(-quantity2) #if you want
#> ID quantity measurementvalue totalkgs
#> 1 4 20 tin 100.0
#> 2 5 300 kg 30.0
#> 3 6 350 kg 350.0
#> 4 7 21 tin 10.5
#> 5 3 44 tin 22.0
#> 6 8 20 tin 100.0
#> 7 9 230 kg 23.0
Created on 2022-07-05 by the reprex package (v2.0.1)

Create a new dataframe showing the sum of each column

I have a dataframe that looks like this
Date Food Utility Travel
01 1.2 12.00 0
02 10.52 0 12.50
03 9.24 0 2.7
04 3.25 0 2.7
I want to create a new dataframe that shows in the first column the type of spending (e.g. food, utility) and then have the sum in another column. I do not need the date column in the new frame but don't want to omit it from the original.
I hope to have the below output.
Category Total
Utility 12.00
Food 24.21
Transport 17.9
I have tried creating a new value for each category, and then trying to pull them together in a dataframe but it has the transposed version, and seems a little long winded if I was to have lots of categories.
You could do this:
library(tidyverse)
test_data <- read_table2("Date Food Utility Travel
01 1.2 12.00 0
02 10.52 0 12.50
03 9.24 0 2.7
04 3.25 0 2.7")
test_data%>%
select(Food:Travel) %>%
pivot_longer(cols = everything(), names_to = "Category", values_to = "val") %>%
group_by(Category) %>%
summarise(Total = sum(val))
#> # A tibble: 3 x 2
#> Category Total
#> <chr> <dbl>
#> 1 Food 24.2
#> 2 Travel 17.9
#> 3 Utility 12
First select the rows you want, then go long, then summarize the categories by sum.
With base R, we can stack the columns except the first to a two column data.frame, and then do a group by sum with aggregate
aggregate(values ~ ind, stack(dat[-1]), sum)
# ind values
#1 Food 24.21
#2 Utility 12.00
#3 Travel 17.90
Or do colSums on the subset of columns and stack it
stack(colSums(dat[-1]))[2:1]
data
dat <- structure(list(Date = 1:4, Food = c(1.2, 10.52, 9.24, 3.25),
Utility = c(12, 0, 0, 0), Travel = c(0, 12.5, 2.7, 2.7)),
class = "data.frame", row.names = c(NA,
-4L))

rolling 30-day geometric mean with variable width

The solution to this question by #ShirinYavari was almost what I needed except for the use of the static averaging window width of 2. I have a dataset with random samples from multiple stations that I want to calculate a rolling 30-day geomean. I want all samples within a 30-day window of a given sample to be averaged and the width may change if preceding samples are farther or closer together in time, for instance whether you would need to average 2, 3, or more samples if 1, 2, or more preceding samples were within 30 days of a given sample.
Here is some example data, plus my code attempt:
RESULT = c(50,900,25,25,125,50,25,25,2000,25,25,
25,25,25,25,25,25,325,25,300,475,25)
DATE = as.Date(c("2018-05-23","2018-06-05","2018-06-17",
"2018-08-20","2018-10-05","2016-05-22",
"2016-06-20","2016-07-25","2016-08-11",
"2017-07-21","2017-08-08","2017-09-18",
"2017-10-12","2011-04-19","2011-06-29",
"2011-08-24","2011-10-23","2012-06-28",
"2012-07-16","2012-08-14","2012-09-29",
"2012-10-24"))
FINAL_SITEID = c(rep("A", 5), rep("B", 8), rep("C", 9))
df=data.frame(FINAL_SITEID,DATE,RESULT)
data_roll <- df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(day=DATE-dplyr::lag(DATE, n=1),
day=replace_na(day, 1),
rnk=cumsum(c(TRUE, day > 30))) %>%
group_by(FINAL_SITEID, rnk) %>%
mutate(count=rowid(rnk)) %>%
mutate(GM30=rollapply(RESULT, width=count, geometric.mean, fill=RESULT, align="right"))
I get this error message, which seems like it should be an easy fix, but I can't figure it out:
Error: Column `rnk` must be length 5 (the group size) or one, not 6
Easiest way to compute rolling statistics depending on datetime windows is runner package. You don't have to hack around to get just 30-days windows. Function runner allows you to apply any R function in rolling window. Below example of 30-days geometric.mean within FINAL_SITEID group:
library(psych)
library(runner)
df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(GM30 = runner(RESULT, k = 30, idx = DATE, f = geometric.mean))
# FINAL_SITEID DATE RESULT GM30
# <fct> <date> <dbl> <dbl>
# 1 C 2011-04-19 25 25.0
# 2 C 2011-06-29 25 25.0
# 3 C 2011-08-24 25 25.0
# 4 C 2011-10-23 25 25.0
# 5 C 2012-06-28 325 325.
# 6 C 2012-07-16 25 90.1
# 7 C 2012-08-14 300 86.6
# 8 C 2012-09-29 475 475.
# 9 C 2012-10-24 25 109.
# 10 B 2016-05-22 50 50.0
The width argument of rollapply can be a vector of widths which can be set using findInterval. An example of this is shown in the Examples section of the rollapply help file and we use that below.
library(dplyr)
library(psych)
library(zoo)
data_roll <- df %>%
arrange(FINAL_SITEID, DATE) %>%
group_by(FINAL_SITEID) %>%
mutate(GM30 = rollapplyr(RESULT, 1:n() - findInterval(DATE - 30, DATE),
geometric.mean, fill = NA)) %>%
ungroup
giving:
# A tibble: 22 x 4
FINAL_SITEID DATE RESULT GM30
<fct> <date> <dbl> <dbl>
1 A 2018-05-23 50 50.0
2 A 2018-06-05 900 212.
3 A 2018-06-17 25 104.
4 A 2018-08-20 25 25.0
5 A 2018-10-05 125 125.
6 B 2016-05-22 50 50.0
7 B 2016-06-20 25 35.4
8 B 2016-07-25 25 25.0
9 B 2016-08-11 2000 224.
10 B 2017-07-21 25 25.0
# ... with 12 more rows

calculation average from a few month at the turn of the year for 3 diffrent indexs and for 30 years

I do not have a really date values. I have one column with Year and another with Month. And 3 more columns for 3 diffrent indexes.There is one index value for one month. (so 12 months per year for 30 years,. It is lots numbers) So I´d like to see the average value from a few month.
I need the information about this index to predict pollen season in summer time. So I would like to have a average for winter months (like Dec-Jan_Feb_Mars) for NAO and also average for winter months for AO and SO. (so 3 average for 3 index). But also I ´d like to receive this value not only for one year but for all years. I think the complicate story is because Dec 1988 - Jan 1989- Feb 1989 (so it is a average for a few month at the turn of the years). If I succsse with this I will do diffrent combination of months.
Year Month NAO AO SO
1 1988 1 1.02 0.26 -0.1
2 1988 2 0.76 -1.07 -0.4
3 1988 3 -0.17 -0.20 0.6
4 1988 4 -1.17 -0.56 0.1
5 1988 5 0.63 -0.85 0.9
6 1988 6 0.88 0.06 0.1
7 1988 7 -0.35 -0.14 1.0
8 1988 8 0.04 0.25 1.5
9 1988 9 -0.99 1.04 1.8
10 1988 10 -1.08 0.03 1.4
11 1988 11 -0.34 -0.03 1.7
12 1988 12 0.61 1.68 1.2
13 1989 1 1.17 3.11 1.5
14 1989 2 2.00 3.28 1.2
...
366 2018 6 1.09 0.38 -0.1
367 2018 7 1.39 0.61 0.2
368 2018 8 1.97 0.84 -0.3
index$Month<-as.character(index$Month)
#define function to compute average by consecutive season of interest/month_combination
compute_avg_season <- function(index, month_combination){
index<-index%>%
mutate(date=paste(Year,Month, "01",sep="-")) %>%
mutate(date=as.Date(date,"%Y-%b-%d")) %>%
arrange(date)%>%
mutate(winter_mths=ifelse(Month %in% month_combination, 1, NA))
index<-setDT(index)[,id :=rleid(winter_mths)]%>%
filter(!is.na(winter_mths))%>%
group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),
Error: unexpected ',' in:
"group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),"
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE),
+ mean_winter_AO=mean(AO, na.rm = TRUE),
+ mean_winter_SO=mean(SO, na.rm=TRUE))
Error in mean(NAO, na.rm = TRUE) : object 'NAO' not found
View(index)
Why do I have such error?
I updated the answer to the new insights from your comments:
# load libraries
library(dplyr)
library(data.table)
# pre-processing
index$Month <- as.character(index$Month) # Month is factor, make it character
colnames(index)[1] <- "Year" # simplify name of the Year column
# define a function to compute average by consecutive season of interest/month_combination (do not modify this function)
compute_avg_season <- function(df, month_combination) {
# mark combination of months as 1, else NA
df <- df %>%
# correction month MAY
mutate(Month = replace(Month, Month=="MAI", "MAY")) %>%
# create date
mutate(date = paste(Year, Month, "01", sep="-")) %>%
mutate(date = as.Date(date, "%Y-%b-%d")) %>%
# sort by date (you want average by consecutive months: DEC, JAN, FEB, MAR)
arrange(date) %>%
mutate(winter_mths = ifelse(Month %in% month_combination, 1, NA))
# add index for each set of months of interest and compute mean by index value
df <- setDT(df)[, id := rleid(winter_mths)] %>%
filter(!is.na(winter_mths)) %>%
group_by(id) %>%
summarise(mean_winter_NAO = mean(NAO, na.rm = TRUE),
mean_winter_AO = mean(AO, na.rm = TRUE),
mean_winter_SO = mean(SO, na.rm = TRUE))
return(df)
}
# Use the above-defined function to compute mean values by desired month combination:
# set the month combination
month_combination <- c("DEC", "JAN", "FEB", "MAR")
# compute mean values by month combination
compute_avg_season(index, month_combination)

How to sum parts of a column?

I am doing a project on working out Flashiness index for a 15-minute flow data.
I have got code on how to work out flow data.
# new variable for lag time
flow_lagged_S <- S %>% mutate(
flow_lag = lag(flow, n = 1), #1st claculate lag
Qi_Qi1 = abs(flow - flow_lag))# calculate the abs value of the diff
# calculate sums following the formula
RB_index_S <- flow_lagged_S %>%
summarise(RB_index = sum(,Qi_Qi1, na.rm = T) / sum(flow, na.rm = T))
The data is for different years and at the moment I can calculate the flashiness for the whole station but not for ever year.
For the last bit of the code I need to change it so that it calculates the sum for each year. How do I do that? so instead of the whole column Qi_Qi1 i need t sum Qi_Qi1 for year 2002.
so my table flow_lagged_S looks like this:
time_stamp flow year flow_lag Qi_Qi1
2002-10-24 22:45:00 9.50 2002 N/A N/a
2002-10-24 23:00:00 10.00 2002 9.50 0.50
2002-10-24 23:15:00 10.50 2002 10.00 0.50
2002-10-24 23:30:00 11.00 2002 10.50 0.70
You can use group_by() function from dplyr package:
df <- data.frame(time_stamp = c("2002-10-24 22:45:00", "2002-10-24 23:00:00", "2002-10-24 23:15:00", "2002-10-24 23:30:00"),
flow = c(9.5, 10, 10.5, 11),
year = c(2002, 2002, 2002, 2002),
flow_lag = c(NA, 9.5, 10, 10.5),
Qi_Qi = c(NA, .5, .5, .7))
df %>%
group_by(year) %>%
summarize(total = sum(Qi_Qi, na.rm = T))
The answer is:
# A tibble: 1 x 2
year total
<dbl> <dbl>
1 2002 1.7

Resources