I have a table of house prices and sale dates. I want to calculate the rolling median price over a time window of 365 days using the runner package. I only want one median price per date.
My problem is when I try the below code, I get more than one median price for a date if that date appears more than once. This isn't what I expected to occur. I thought there'd be one result for each day if I used group_by/summarise.
library(runner)
library(tidyverse)
library(lubridate)
startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")
# Create data
soldData <- tibble(
price = round(rnorm(100, mean=500000, sd=100000),-3),
date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))
# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
arrange(date)
# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date
# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
group_by(date) %>%
summarise(medianPrice = runner(
price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
# These are the problem rows.
duplicatedResults <- results %>%
filter(date %in% duplicatedDates)
Any idea where I'm going wrong?
From dplyr 1.0.0, you can have output that returns multiple rows from summarise.
First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.
library(dplyr)
library(runner)
soldData %>%
group_by(date) %>%
summarise(price = median(price, na.rm = TRUE)) -> df
So now in df we only have one value for each date. You can now apply the runner function.
df %>%
mutate(medianPrice = runner(price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
There is also zoo:rollmedianr which helps in calculating rolling median.
Related
I'm trying to use dplyr in R to difference a variable between two dates.
An simplified example:
# Simple script to test calculating the difference of a column between two dates
library(dplyr)
library(lubridate)
library(tibble)
dataA <- as.tibble(ymd('2020-01-01') + days(seq(0:45)))
colnames(dataA) = c('date')
dataA <- dataA %>% mutate(xvar = seq(0:45))
#add the difference in xvar between two dates
dataA <- dataA %>% mutate(startd = date, endd=date+days(3))
dataA <- dataA %>% group_by(date) %>%
filter(date >= startd & date <= endd) %>% mutate(vardiff = last(xvar)-first(xvar))
I've tried a number of different possibilities for this last statement but can't get the calculation I'm looking for. What I'm trying to achieve is the difference in xvar between January 5th and January 2nd and so on for the entire time series. How can this be achieved using dplyr statements?
Thanks!
We can use findInterval and this should also work when there are no exact matches
library(dplyr)
dataA %>%
mutate(vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
Or in base R
transform(dataA, vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
You can use match to get index of startd and endd to get corresponding xvar and subtract them:
library(dplyr)
dataA %>%
mutate(vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
This can also be written in base R using transform :
transform(dataA, vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
I have a data set with customers' expenses by date. I want to have the last three months expense and avg. expense based on the list visit of each customer. How can I do that in R?
below is the dataset
library(tidyverse)
library(lubridate)
name <- c('Mary','Sue','Peter','Mary','Mary','John','Sue',
'Peter','Peter','John','John','John','Mary','Mary',
'John','Mary','Peter','Sue')
date <- c('01/04/2018','03/02/2017','01/01/2019','24/04/2017',
'02/03/2019','31/05/2019','08/09/2019','17/12/2019',
'02/08/2017','10/11/2017','30/12/2017','18/02/2018',
'18/02/2018','18/10/2019','30/04/2019','18/09/2019',
'17/11/2019','08/08/2019'
)
expense <- c('300','450','550','980',
'787','300','2343','233',
'932','44','332','432',
'786','345','567','290','345','876')
data <- data.frame(name,
date=lubridate::dmy(date),expense)
Considering 3 months as 90 days we can subtract 90 days from the max date for each Name and take mean of expense only for dates which fall within the range.
library(dplyr)
data %>%
group_by(name) %>%
summarise(last_3_month_expense = mean(expense[date > max(date) - 90], na.rm = TRUE),
mean_expense = mean(expense, na.rm = TRUE))
data
Read the expense data as numeric not as factor/character.
data$expense <- as.numeric(as.character(data$expense))
We arrange by 'name', 'date', convert the 'expense' to numeric, calculate the sum of last 3 values of 'expense' and the mean of the 'expense' grouped by 'name' (assuming there is only data point per month)
library(dplyr)
data %>%
arrange(name, date) %>%
mutate(expense = as.numeric(as.character(expense))) %>%
group_by(name) %>%
summarise(last_three = sum(tail(expense, 3), na.rm = TRUE),
average_expense = mean(expense, na.rm = TRUE))
I have a data frame like DF below which will be imported directly from the database (as tibble).
library(tidyverse)
library(lubridate)
date_until <- dmy("31.05.2019")
date_val <- dmy("30.06.2018")
DF <- data.frame( date_bal = as.Date(c("2018-04-30", "2018-05-31", "2018-06-30", "2018-05-31", "2018-06-30")),
department = c("A","A","A","B","B"),
amount = c(10,20,30,40,50)
)
DF <- DF %>%
as_tibble()
DF
It represents the amount of money spent by each department in a specific month. My task is to project how much money will be spent by each department in the following months until a specified date in the future (in this case date_until=31.05.2019)
I would like to use tidyverse in order to generate additional rows for each department where the first column date_bal would be a sequence of dates from the last one from "original" DF up until date_until which is predefined. Then I would like to add additional column called "DIFF" which would represent the difference between DATE_BAL and DATE_VAL, where DATE_VAL is also predefined. My final result would look like this:
Final result
I have managed to do this in the following way:
first filter data from DF for department A
Create another DF2 by populating it with date sequence from min(dat_bal) to date_until from 1.
Merge data frames from 1. and 2. and then add calculated columns using mutate
Since I will have to repeat this procedure for many departments I wonder if it's possible to add rows (create date sequence) in existing DF (without creating a second DF and then merging).
Thanks in advance for your help and time.
I add one day to the dates, create a sequence and then rollback to the last day of the previous month.
seq(min(date_val + days(1)), date_until + days(1), by = 'months')[-1] %>%
rollback() %>%
tibble(date_bal = .) %>%
crossing(DF %>% distinct(department)) %>%
bind_rows(DF %>% select(date_bal, department)) %>%
left_join(DF) %>%
arrange(department, date_bal) %>%
mutate(
amount = if_else(is.na(amount), 0, amount),
DIFF = interval(
rollback(date_val, roll_to_first = TRUE),
rollback(date_bal, roll_to_first = TRUE)) %/% months(1)
)
I am working with data that stops in a specific year and is NA afterwards. And I need to calculate allot of variables based on lagged values of other variables. I would like to find a way that a whole series is calculated instead of each time one year when one of the variables is NA. I was looking at dplyr given that I am working with panel data and thus need to group it by ID.
I provide the example below:
set.seed(1)
df <- data.frame( year = c(seq(2000, 2018), seq(2000, 2018)) , id = c(rep(1, 19),rep(2, 19)), varA = floor(rnorm(38)*100), varB= floor(rnorm(38)*100), varC= floor(rnorm(38)*100))
df <- df %>% mutate(varA = if_else(year>2010, as.double(NA) , varA) ,
varB = if_else(year>2010, as.double(NA) , varB),
varC = if_else(year>2010, as.double(NA) , varC)) %>% group_by(id) %>% arrange(year)
What I would like is to find a way to calculate a variable that is equal to variable C when it is available, but afterwards is equal to a formula based on lagged values of variable C, B and A. When executing the code below, varResult and D are ony calculated for one year given that the lags are only available for one year:
df <- df %>% mutate( varD = lag(varA)*lag(varB),
varRESULT = if_else(is.na(varC), lag(varC, 1)/lag(varD, 2)*lag(varD, 1), varC))
But I would like to find a way to calculate immidiatly the whole serries (taking into account the panel dimension of the data) instead of heaving to repeat the code 7 times. Preferably a solution where you can calculate varD seperatly from varResults, given that in the final application I have multiple variables that are linked to each other.
Proposed solution:
Starting with the first NA, the "recursive" lags of vars varA, varB, and varC are equal to the last value of these variables.
Thus, starting from these initial variables, we can create new variables: varA1, varB1, and varC1 where we fill the NAs with the last value, by id:
library(dplyr)
library(tidyr) # for the function `fill`
df <- df %>%
mutate(varA1 = varA, varB1 = varB, varC1 = varC) %>%
group_by(id) %>%
arrange(year) %>%
fill(varA1, varB1, varC1) # fills with last value
Then, we apply the formula:
df <- df %>%
mutate( varD = lag(varA1)*lag(varB1),
varRESULT = if_else(is.na(varC), lag(varC1, 1)/lag(varD, 2)*lag(varD, 1), varC)) %>%
select(-varA1, -varB1, -varC1)
I extract my data
fluo <- read.csv("data/ctd_SOMLIT.csv", sep=";", stringsAsFactors=FALSE)
I display in three columns : the day, the month and the year based on the original date : Y - m - d
fluo$day <- day(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
fluo$month <- month(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
fluo$year <- year(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
This is a part of my data_frame:
Then, I do summarise and group_by in order to apply the function :
prof_DCM = fluo[max(fluo$FLUORESCENCE..Fluorescence.),2]
=> I want the depth of the max of FLUORESCENCE measured for each month, for each year.
mean_fluo <- summarise(group_by(fluo, month, year),
prof_DCM = fluo[max(fluo$FLUORESCENCE..Fluorescence.),2])
mean_fluo <- arrange(mean_fluo, year, month)
View(mean_fluo)
But it's not working ...
The values of prof_DCM still the same all along the column 3 of the data_frame:
Maybe try the following code.
library(dplyr)
mean_fluo <- fluo %>%
group_by(month,year) %>%
filter(FLUORESCENCE..Fluorescence. == max(FLUORESCENCE..Fluorescence.)) %>%
arrange(year,month)
View(mean_fluo)
You can select the variables you want to keep with 'select'
mean_fluo <- fluo %>%
group_by(month,year) %>%
filter(FLUORESCENCE..Fluorescence. == max(FLUORESCENCE..Fluorescence.)) %>%
arrange(year,month)%>%
select(c(month,year,PROFONDEUR))