I'm trying to use dplyr in R to difference a variable between two dates.
An simplified example:
# Simple script to test calculating the difference of a column between two dates
library(dplyr)
library(lubridate)
library(tibble)
dataA <- as.tibble(ymd('2020-01-01') + days(seq(0:45)))
colnames(dataA) = c('date')
dataA <- dataA %>% mutate(xvar = seq(0:45))
#add the difference in xvar between two dates
dataA <- dataA %>% mutate(startd = date, endd=date+days(3))
dataA <- dataA %>% group_by(date) %>%
filter(date >= startd & date <= endd) %>% mutate(vardiff = last(xvar)-first(xvar))
I've tried a number of different possibilities for this last statement but can't get the calculation I'm looking for. What I'm trying to achieve is the difference in xvar between January 5th and January 2nd and so on for the entire time series. How can this be achieved using dplyr statements?
Thanks!
We can use findInterval and this should also work when there are no exact matches
library(dplyr)
dataA %>%
mutate(vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
Or in base R
transform(dataA, vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
You can use match to get index of startd and endd to get corresponding xvar and subtract them:
library(dplyr)
dataA %>%
mutate(vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
This can also be written in base R using transform :
transform(dataA, vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
Related
I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1
I have a table of house prices and sale dates. I want to calculate the rolling median price over a time window of 365 days using the runner package. I only want one median price per date.
My problem is when I try the below code, I get more than one median price for a date if that date appears more than once. This isn't what I expected to occur. I thought there'd be one result for each day if I used group_by/summarise.
library(runner)
library(tidyverse)
library(lubridate)
startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")
# Create data
soldData <- tibble(
price = round(rnorm(100, mean=500000, sd=100000),-3),
date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))
# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
arrange(date)
# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date
# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
group_by(date) %>%
summarise(medianPrice = runner(
price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
# These are the problem rows.
duplicatedResults <- results %>%
filter(date %in% duplicatedDates)
Any idea where I'm going wrong?
From dplyr 1.0.0, you can have output that returns multiple rows from summarise.
First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.
library(dplyr)
library(runner)
soldData %>%
group_by(date) %>%
summarise(price = median(price, na.rm = TRUE)) -> df
So now in df we only have one value for each date. You can now apply the runner function.
df %>%
mutate(medianPrice = runner(price,
k = "365 days",
idx = date,
f = function(x) {median(x,na.rm=T)}))
There is also zoo:rollmedianr which helps in calculating rolling median.
I've got a data frame (df1) with an ID variable and two date variables (dat1 and dat2).
I'd like to subset the data frame so that I get the observations for which the difference between dat2 and dat1 is less than or equal to 30 days.
I'm trying to use dplyr() but I can't get it to work.
Any help would be much appreciated.
Starting point (df):
df1 <- data.frame(ID=c("a","b","c","d","e","f"),dat1=c("01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017","01/05/2017"),dat2=c("14/05/2017","05/06/2017","23/05/2017","15/10/2017","15/11/2017","15/12/2017"), stringsAsFactors = FALSE)
Desired outcome (df):
dfgoal <- data.frame(ID=c("a","c"),dat1=c("01/05/2017","01/05/2017"),dat2=c("14/05/2017","23/05/2017"),newvar=c(13,22))
Current code:
library(dplyr)
df2 <- df1 %>% mutate(newvar = as.Date(dat2) - as.Date(dat1)) %>%
filter(newvar <= 30)
We need to convert to Date class before doing the subtraction
library(dplyr)
library(lubridate)
df1 %>%
mutate_at(2:3, dmy) %>%
mutate(newvar = as.numeric(dat2- dat1)) %>%
filter(newvar <=30)
The as.Date also needs to include the format argument, otherwise, it will think that the format is in the accepted %Y-%m-%d. Here, it is in %d/%m/%Y
df1 %>%
mutate(newvar = as.numeric(as.Date(dat2, "%d/%m/%Y") - as.Date(dat1, "%d/%m/%Y"))) %>%
filter(newvar <= 30)
# ID dat1 dat2 newvar
#1 a 01/05/2017 14/05/2017 13
#2 c 01/05/2017 23/05/2017 22
dplyr::group_by() fails to group the variables of the following data.frame contained in a pc-axis file:
library("pacman")
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
pxR::read.px(base::url(px_file))$DATA$value %>% # the data.frame
janitor::clean_names() %>%
dplyr::select (student_level = studienstufe,
year = jahr,
counts = value) %>% # dplyr::rename() also fails
dplyr::group_by (year, student_level) %>% # not grouping!
dplyr::summarise(totals = sum (counts))
I believe it could be due to an encoding issue, but I cannot find the problem. Any ideas? Thanks.
The only fault I could find was that you use select instead of rename. You wrote that rename didn't work for you. This worked for me:
library("pacman")
library("dplyr")
library("janitor")
# Loading your data
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
px <- pxR::read.px(base::url(px_file))$DATA$value
# Cleaning the column names
px1 <- px %>% janitor::clean_names()
# Rename the columns
px2 <- px1 %>%
dplyr::rename (student_level = studienstufe,
sex = geschlecht,
year = jahr,
counts = value)
# Grouping data
px3 <- px2 %>%
dplyr::group_by (year, student_level) %>%
dplyr::summarise(totals = sum (counts))
I split every step into an own dataframe to see the result. This is not necessary.
If this doesn't work, you may upload your session info.
P.S. I also renamed the column geschlecht :)
I extract my data
fluo <- read.csv("data/ctd_SOMLIT.csv", sep=";", stringsAsFactors=FALSE)
I display in three columns : the day, the month and the year based on the original date : Y - m - d
fluo$day <- day(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
fluo$month <- month(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
fluo$year <- year(as.POSIXlt(fluo$DATE, format = "%Y-%m-%d"))
This is a part of my data_frame:
Then, I do summarise and group_by in order to apply the function :
prof_DCM = fluo[max(fluo$FLUORESCENCE..Fluorescence.),2]
=> I want the depth of the max of FLUORESCENCE measured for each month, for each year.
mean_fluo <- summarise(group_by(fluo, month, year),
prof_DCM = fluo[max(fluo$FLUORESCENCE..Fluorescence.),2])
mean_fluo <- arrange(mean_fluo, year, month)
View(mean_fluo)
But it's not working ...
The values of prof_DCM still the same all along the column 3 of the data_frame:
Maybe try the following code.
library(dplyr)
mean_fluo <- fluo %>%
group_by(month,year) %>%
filter(FLUORESCENCE..Fluorescence. == max(FLUORESCENCE..Fluorescence.)) %>%
arrange(year,month)
View(mean_fluo)
You can select the variables you want to keep with 'select'
mean_fluo <- fluo %>%
group_by(month,year) %>%
filter(FLUORESCENCE..Fluorescence. == max(FLUORESCENCE..Fluorescence.)) %>%
arrange(year,month)%>%
select(c(month,year,PROFONDEUR))