I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1
I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22
I'm trying to replace vba code with R code. Currently in vba I use sumif in a range to find the total value of an ID depending on some dates. In R I'm using mutate an summarize but there's always an error. I donĀ“t know how to fix it.
If i want to find the value for ID=1 that made some value withing 2 days:
#sys.Date() = 2016-01-06
df
DATES ID VALUE
2016/01/01 1 10
2016/01/02 2 15
2016/01/05 1 13
the result must be:
ID Value
1 13
Currently, the code is:
df%>%
group_by(ID) %>%
mutate(Total_op = if (Sys.Date()-as.Date(Dates,format="%YYYY-%mm-
%dd")>=1) Value else 0)))%>%
summarize(SumTotal = sum(Total_op))%>%
collect
But the error showed is:
Error: Column 'sumTotal' must be length X (the group size) or one, not Y
With lubridate we can convert the DATES string to a datetime object and filter accordingly:
library(lubridate)
library(tidyverse)
Dat <- ymd("2016-01-06") #Set a date. Can be done by Sys.Date()
df %>%
mutate_at("DATES",ymd) %>% #convert to datetime
filter(DATES %within% interval(Dat-2,Dat)) %>% #filter entries in the last 2 days
group_by(ID) %>% #group by ID
summarise(SumTotal = sum(VALUE)) #summarise value as Sum
Let's say that I have the following data.
df = data.frame(name = c("A","A","A","B","B","B","B"),
date = c("2011-01-01","2011-03-01","2011-05-01",
"2011-01-01","2011-05-01","2011-06-01",
"2011-07-01"))
df
I know the last date in the data set and only want to pick those names where data is available for the last date. So in the above example, the last date is only available for name B. Thus, I want to select only the rows for name B.
I can do simple hacks like this to get the desired result.
last_date = "2011-07-01"
#unique(df$name[df$date %in% last_date])
df[df$name %in% unique(df$name[df$date %in% last_date]),]
However, I was wondering if there was a dplyr/tidyverse or data.table solution for this task.
There are multiple ways you can do this, with dplyr we can filter only those groups which have the last_date
library(dplyr)
df %>%
group_by(name) %>%
filter(last_date %in% date)
# name date
# <fct> <fct>
#1 B 2011-01-01
#2 B 2011-05-01
#3 B 2011-06-01
#4 B 2011-07-01
Or similarly in base R :
df[ave(df$date, df$name, FUN = function(x) last_date %in% x) == TRUE,]
Also, we can get all the name where you find last_date and filter those names from the original dataframe.
df[with(df, name %in% name[date %in% last_date]), ]
I looked at many posts with similar, but I believe less complex questions, and just cant seem to work out an answer for this.
I have a >1000000 lines of data, for example in this form:
date<-c("9/30/2012","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013","10/31/2012","11/30/2012","12/31/2012","1/31/2013","2/28/2013","3/31/2013")
name<-c("a","a","a","a","a","a","a","b","b","b","b","b","b")
amount<-c(100,200,300,400,500,600,700,800,900,800,700,600,500)
data<-data.frame(name,date,amount)
View(data)
What I need is, for entries of the same name, sum the amount for dates that are in jan-mar, apr-jun, jul-sep, oct-dec in the same year.
This is my ideal output:
date2<-c("9/30/2012","12/31/2012","3/31/2013","12/31/2012","3/13/2013")
name2<-c("a","a","a","b","b")
amount2<-c(100,900,1800,2500,1800)
data2<-data.frame(name2,date2,amount2)
View(data2)
Will appreciate any input at all, to lead me towards the correct direction.
Thank you very much!
1. Using dplyr/zoo
We can convert the 'date' class from 'character' to 'Date', get the sum of 'amount' and last value of 'date' grouped by columns 'name' and 'Qtr' (from converting the 'date' to year quarter (as.yearqtr).
library(dplyr)
library(zoo)
data %>%
mutate(date=as.Date(date, format='%m/%d/%Y')) %>%
group_by(name, Qtr=as.character(as.yearqtr(date))) %>%
summarise(amount= sum(amount), date=last(date))
# name Qtr amount date
#1 a 2012 Q3 100 2012-09-30
#2 a 2012 Q4 900 2012-12-31
#3 a 2013 Q1 1800 2013-03-31
#4 b 2012 Q4 2500 2012-12-31
#5 b 2013 Q1 1800 2013-03-31
NOTE: Also added #docendo discimus suggestion to use last and changing the class of 'date' column. The Qtr column is 'character' as the as.yearqtr class is unsupported by dplyr (from the errors). The 'Qtr' column was not in the expected dataset 'data2'. So, I guess it doesn't matter whether it is 'character' or 'as.yearqtr'. If we don't change the 'date' column to 'Date' class, and do the change in the group_by step, this will give the same result as the 'data2'. The extra 'Qtr' column can be deleted.
2. Without using zoo
data %>%
mutate(date1 = as.Date(date, format = '%m/%d/%Y')) %>%
group_by(name, Qtr= sprintf('%s %s', format(date1, '%Y'),
quarters(date1))) %>%
summarise(amount = sum(amount), date=last(date)) %>%
ungroup() %>%
select(-Qtr) %>%
as.data.frame()
# name amount date
#1 a 100 9/30/2012
#2 a 900 12/31/2012
#3 a 1800 3/31/2013
#4 b 2500 12/31/2012
#5 b 1800 3/31/2013
NOTE2: Added a solution without using as.yearqtr, kept the same format for 'date' as in the expected output 'data2'
Here are a few approaches:
1) aggregate & zoo
library(zoo)
aggregate(amount ~ name + yearqtr,
transform(data, yearqtr = as.yearqtr(date, "%m/%d/%Y")),
sum)
2) data.table & zoo
library(data.table)
library(zoo)
dt <- data.table(data, key = "name,date")
dt[, date := as.yearqtr(date, "%m/%d/%Y")][, list(sum = sum(amount)), by = "name,date"]
Note that both these solutions convert the date to a real "yearqtr" object and not just to a character string. I haven't benchmarked these but typically data.table is very fast. You could create the data.table from data by reference using setDT for every greater performance but might prefer to keep them separate as well so we left them separate here.