Using ifelse statement when concatenating elements of a date variable - r

I am attempting to use two ifelse statements to create a new date variable that makes a series of assumptions to fill in the gaps of an existing date variable. Here is an example of what I mean:
id EffectiveDate EffectiveYear ED_NA EY_NA NewEffectiveDate
1 a 1972-10-05 1972 FALSE FALSE 1972-10-05
2 a <NA> 1985 TRUE FALSE 1985-01-01
3 a 1988-11-12 1988 FALSE FALSE 1988-11-12
4 b 2011-09-05 2011 FALSE FALSE 2011-09-05
5 b <NA> NA TRUE TRUE 2011-09-05
6 b <NA> 2012 TRUE FALSE 2012-01-01
7 c 2012-11-11 2012 FALSE FALSE 2012-11-11
8 c 2013-05-15 2013 FALSE FALSE 2013-05-15
quick code for id:EY_NA =
id <- c("a","a","a","b","b","b","c","c")
EffectiveDate <- c("1972-10-05",NA,"1988-11-12","2011-09-05",NA,NA,"2012-11-11","2013-05-15")
EffectiveYear <- c(1972,1985,1988,2011,NA,2012,2012,2013)
tdat <- data.frame(id, EffectiveDate, EffectiveYear)
tdat$ED_NA <- is.na(tdat$EffectiveDate)
tdat$EY_NA <- is.na(tdat$EffectiveYear)
What I'm trying to create in this example is the "NewEffectiveDate" variable. In plain English, what I want is, where EffectiveDate data are missing BUT EffectiveYear data are not missing, assume NewEffectiveDate is equal to January 1 of the EffectiveYear. If EffectiveDate AND EffectiveYear data are missing, assume the prior observation's EffectiveDate. Last, of course, if EffectiveDate data are not missing, select EffectiveDate.
Here is the latest code I used to attempt to solve the problem:
tdat %>% mutate(NewEffectiveDate = ifelse(ED_NA == 1 & EY_NA == 0,
as.Date(paste(EffectiveYear, 1, 1, sep="-")),
ifelse(ED_NA == 1 & EY_NA == 1),
as.Date(lag(EffectiveDate)),
EffectiveDate
))
When I try this particular code, I get an error message that reads: Error: unused arguments (as.Date(c(NA, 1, NA, 2, 3, NA, NA, 4)), c(1, NA, 2, 3, NA, NA, 4, 5))
I searched for similar questions with queries like "ifelse concatenate date" and some variations thereof, but haven't found anything that seems to apply to this particular problem.
I am very new to R (and CLIs, for that matter), so I apologize in advance if I'm overlooking a perfectly obvious solution. The transition from Excel to R has been interesting, but often painful when it comes to doing what seem like relatively straightforward tasks (though the dplyr package has been tremendously helpful).

id <- c("a","a","a","b","b","b","c","c")
EffectiveDate <- c("1972-10-05",NA,"1988-11-12","2011-09-05",NA,NA,"2012-11-11","2013-05-15")
EffectiveYear <- c(1972,1985,1988,2011,NA,2012,2012,2013)
tdat <- data.frame(id, EffectiveDate, EffectiveYear,
stringsAsFactors=FALSE)
library(zoo)
tdat %>%
mutate(NewEffectiveDate = ifelse(!is.na(EffectiveDate),
EffectiveDate,
ifelse(is.na(EffectiveDate) & !is.na(EffectiveYear),
paste0(EffectiveYear, "-01-01"),
NA)),
NewEffecitveDate = na.locf(NewEffectiveDate))
This should give you what you need. I recommend using na.locf (last one carried forward) from the zoo package rather than trying to deal with the previous date issue.

You can do
tdat$EffectiveDate <- as.Date(tdat$EffectiveDate)
tdat %>% mutate(NewEffectiveDate = as.Date(
ifelse(!is.na(EffectiveDate), EffectiveDate,
ifelse(!is.na(EffectiveYear), as.Date(paste(EffectiveYear, 1, 1, sep="-")),
lag(EffectiveDate)))
)) -> res
res
# id EffectiveDate EffectiveYear NewEffectiveDate
# 1 a 1972-10-05 1972 1972-10-05
# 2 a <NA> 1985 1985-01-01
# 3 a 1988-11-12 1988 1988-11-12
# 4 b 2011-09-05 2011 2011-09-05
# 5 b <NA> NA 2011-09-05
# 6 b <NA> 2012 2012-01-01
# 7 c 2012-11-11 2012 2012-11-11
# 8 c 2013-05-15 2013 2013-05-15

There appears to be a problem with your ifelse block you closed the bracket for the second block early and didn't give a yes or no argument and you gave an extra argument to the first ifelse block.
This should work:
tdat %>% mutate(NewEffectiveDate = ifelse(ED_NA == 1 & EY_NA == 0,
as.Date(paste(EffectiveYear, 1, 1, sep="-")),
ifelse(ED_NA == 1 & EY_NA == 1,
as.Date(lag(EffectiveDate))),
EffectiveDate))

Related

Check if two values within consecutive dates are identical

Let's say I have a tibble like
df <- tribble(
~date, ~place, ~wthr,
#------------/-----/--------
"2017-05-06","NY","sun",
"2017-05-06","CA","cloud",
"2017-05-07","NY","sun",
"2017-05-07","CA","rain",
"2017-05-08","NY","cloud",
"2017-05-08","CA","rain",
"2017-05-09","NY","cloud",
"2017-05-09","CA",NA,
"2017-05-10","NY","cloud",
"2017-05-10","CA","rain"
)
I want to check if the weather in a specific region on a specific day was same as yesterday, and attach the boolean column to df, so that
tribble(
~date, ~place, ~wthr, ~same,
#------------/-----/------/------
"2017-05-06","NY","sun", NA,
"2017-05-06","CA","cloud", NA,
"2017-05-07","NY","sun", TRUE,
"2017-05-07","CA","rain", FALSE,
"2017-05-08","NY","cloud", FALSE,
"2017-05-08","CA","rain", TRUE,
"2017-05-09","NY","cloud", TRUE,
"2017-05-09","CA", NA, NA,
"2017-05-10","NY","cloud", TRUE,
"2017-05-10","CA","rain", NA
)
Is there a good way to do this?
To get a logical column, you check wthr value if equal to row before using lag after grouping by place. I added arrange for date to make sure in chronological order.
library(dplyr)
df %>%
arrange(date) %>%
group_by(place) %>%
mutate(same = wthr == lag(wthr, default = NA))
Edit: If you want to make sure dates are consecutive (1 day apart), you can include an ifelse to see if the difference is 1 between date and lag(date). If is not 1 day apart, it can be coded as NA.
Note: Also, make sure your date is a Date:
df$date <- as.Date(df$date)
df %>%
arrange(date) %>%
group_by(place) %>%
mutate(same = ifelse(
date - lag(date) == 1,
wthr == lag(wthr, default = NA),
NA))
Output
date place wthr same
<chr> <chr> <chr> <lgl>
1 2017-05-06 NY sun NA
2 2017-05-06 CA cloud NA
3 2017-05-07 NY sun TRUE
4 2017-05-07 CA rain FALSE
5 2017-05-08 NY cloud FALSE
6 2017-05-08 CA rain TRUE
7 2017-05-09 NY cloud TRUE
8 2017-05-09 CA NA NA
9 2017-05-10 NY cloud TRUE
10 2017-05-10 CA rain NA

How to deal with outliers within and between observations in a panel data in R?

I have a dataset that shows the revenue over 20 years of around 100.000 companies. The data has many other variables, but, below, I'm writing a reproducible version of a simplified sample of this dataset.
my_data <- data.frame(Company = c("A","B","C","D"), CITY = c("Paris", "Paris", "Quimper", "Nice"), year_creation = c("2010", "2009", "2008", "2009"), revenue_2008 = c(NA, NA, 10, NA),
revenue_2009 = c(NA,10, 20, 15000), revenue_2010 = c(02, 10, 2500, 20000), revenue_2011 = c(14, 16, 10, 30000),
size = c(2, 3, 5, 1))
As you can see, I'm dealing with an unbalanced panel data that has outliers both within the observations (e.g., the sudden revenue of company C in the year 2010) and in between the observations (e.g., the company D that has much higher revenues than the others, even considering I've selected companies that were supposed to be similar)...
So, my question is, what is the best way to deal with these two types of outliers in R? I imagined that for the within outliers, the data in the wide-format should be better, right? But which code can run to check the outliers line by line (i.e., observation by observation)?
And for the second type of outliers? Is it better to convert the data for the long format? If yes, how could I test the outliers in the long format?
Thank you so much for your help!
Best,
How to detect is mostly statistical question. One way you could use is Hampel filter (its pros and cons are not in the scope of this answer).
It considers values outside of median ± 3*(median absolute deviation) to be outliers.
Let's imagine that we will use this criteria. You could do within and between tests through by argument of data.table.
Is it better to convert the data for the long format?
It would make analysis easier, hence I have converted it via melt
my_data <- data.frame(Company = c("A","B","C","D"), CITY = c("Paris", "Paris", "Quimper", "Nice"), year_creation = c("2010", "2009", "2008", "2009"), revenue_2008 = c(NA, NA, 10, NA),
revenue_2009 = c(NA,10, 20, 15000), revenue_2010 = c(02, 10, 2500, 20000), revenue_2011 = c(14, 16, 10, 30000),
size = c(2, 3, 5, 1))
library(data.table)
my_data <- as.data.table(my_data)
my_data <- melt(my_data, id.vars = c("Company", "CITY", "year_creation", "size"))
hampel_filter <- function(x){
x_med <- median(x, na.rm = TRUE)
x_mad <- mad(x, na.rm = TRUE)
(x > x_med + 3*x_mad | x < x_med - 3*x_mad)
}
my_data[, between_out := hampel_filter(value), by = variable]
my_data[, within_out := hampel_filter(value), by = Company]
> my_data
Company CITY year_creation size variable value between_out within_out
1: A Paris 2010 2 revenue_2008 NA NA NA
2: B Paris 2009 3 revenue_2008 NA NA NA
3: C Quimper 2008 5 revenue_2008 10 FALSE FALSE
4: D Nice 2009 1 revenue_2008 NA NA NA
5: A Paris 2010 2 revenue_2009 NA NA NA
6: B Paris 2009 3 revenue_2009 10 FALSE FALSE
7: C Quimper 2008 5 revenue_2009 20 FALSE FALSE
8: D Nice 2009 1 revenue_2009 15000 TRUE FALSE
9: A Paris 2010 2 revenue_2010 2 FALSE FALSE
10: B Paris 2009 3 revenue_2010 10 FALSE FALSE
11: C Quimper 2008 5 revenue_2010 2500 FALSE TRUE
12: D Nice 2009 1 revenue_2010 20000 TRUE FALSE
13: A Paris 2010 2 revenue_2011 14 FALSE FALSE
14: B Paris 2009 3 revenue_2011 16 FALSE TRUE
15: C Quimper 2008 5 revenue_2011 10 FALSE FALSE
16: D Nice 2009 1 revenue_2011 30000 TRUE FALSE
You could also detect and treat outliers at the same time with Winsorize() from DescTools. See details: https://en.wikipedia.org/wiki/Winsorizing

Mutate based on two conditions in R dataframe

I have a R dataframe which can be generated from the code below
DF <- data.frame("Person_id" = c(1,1,1,1,2,2,2,2,3,3), "Type" = c("IN","OUT","IN","ANC","IN","OUT","IN","ANC","EM","ANC"), "Name" = c("Nara","Nara","Nara","Nara","Dora","Dora","Dora","Dora","Sara","Sara"),"day_1" = c("21/1/2002","21/4/2002","21/6/2002","21/9/2002","28/1/2012","28/4/2012","28/6/2012","28/9/2012","30/06/2004","30/06/2005"),"day_2" = c("23/1/2002","21/4/2002","","","30/1/2012","28/4/2012","","28/9/2012","",""))
What I would like to do is create two new columns as admit_start_date and admit_end_date based on few conditions which are given below
Rule 1
admit_start_date = day_1
admit_end_date = day_2 (sometimes day_2 can be NA. So refer Rule 2 below)
Rule 2
if day_2 is (null or blank or na) and Type is (Out or ANC or EM) then
admit_end_date = day_1
else (if Type is IN)
admit_end_date = day_1 + 5 (days)
This is what I am trying but doesn't seem to help
transform_dates = function(DF){ # this function is to create 'date' columns
DF %>%
mutate(admit_start_date = day_1) %>%
mutate(admit_end_date = day_2) %>%
admit_end_date = if_else(((Type == 'Out' & admit_end_date.isna() ==True|Type == 'ANC' & admit_end_date.isna() ==True|Type == 'EM' & admit_end_date.isna() ==True),day_1,day_1 + 5)
)
}
As you can see, I am not sure how to check for NA for a newly created column and replace those NAs with day_1 or day_1 + 5(days) based on Type column.
Can you please help?
I expect my output to be like as shown below
We can use case_when to specify each condition separately after converting "day" columns to actual date objects.
library(dplyr)
DF %>%
mutate_at(vars(starts_with('day')), as.Date, "%d/%m/%Y") %>%
mutate(admit_start_date = day_1,
admit_end_date = case_when(
!is.na(day_2) ~day_2,
is.na(day_2) & Type %in% c('OUT', 'ANC', 'EM') ~ day_1,
Type == 'IN' ~ day_1 + 5))
# Person_id Type Name day_1 day_2 admit_start_date admit_end_date
#1 1 IN Nara 2002-01-21 2002-01-23 2002-01-21 2002-01-23
#2 1 OUT Nara 2002-04-21 2002-04-21 2002-04-21 2002-04-21
#3 1 IN Nara 2002-06-21 <NA> 2002-06-21 2002-06-26
#4 1 ANC Nara 2002-09-21 <NA> 2002-09-21 2002-09-21
#5 2 IN Dora 2012-01-28 2012-01-30 2012-01-28 2012-01-30
#6 2 OUT Dora 2012-04-28 2012-04-28 2012-04-28 2012-04-28
#7 2 IN Dora 2012-06-28 <NA> 2012-06-28 2012-07-03
#8 2 ANC Dora 2012-09-28 2012-09-28 2012-09-28 2012-09-28
#9 3 EM Sara 2004-06-30 <NA> 2004-06-30 2004-06-30
#10 3 ANC Sara 2005-06-30 <NA> 2005-06-30 2005-06-30
The dates in the dataframe are not of class "Date", (class(DF$day_1)), using mutate_at we change their class to "Date" so we can perform mathematical calculations on it. starts_with('day') means that any column whose name starts with "day" would be converted to "Date" class. We use mutate_at when we want to apply the same function to multiple columns.
case_when is an alternative to nested ifelse statements. They execute in sequential order. So first condition is checked, if the condition is satisfied it doesn't check the remaining conditions. If the first condition is not satisfied, it checks for the second condition and so on. Hence, no else is required here. If none of the conditions are satisfied it returns NA. Check ?case_when.

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

general lag in time series panel data

I have a dataset akin to this
User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5
I want to create a lag of Value, respecting User.
User Date Value Value.lag
A 2012-01-01 4 NA
A 2012-01-02 5 4
A 2012-01-03 6 5
A 2012-01-04 7 6
B 2012-01-01 2 NA
B 2012-01-02 3 2
B 2012-01-03 4 3
B 2012-01-04 5 4
I've done it very inefficiently in a loop
df$value.lag1<-NA
levs<-levels(as.factor(df$User))
levs
for (i in 1:length(levs)) {
temper<- subset(df,User==as.numeric(levs[i]))
temper<- rbind(NA,temper[-nrow(temper),])
df$value.lag1[df$User==as.numeric(as.character(levs[i]))]<- temper
}
But this is very slow. I've looked at using by and tapply, but not figured out how to make them work.
I don't think XTS or TS will work because of the User element.
Any suggestions?
You can use ddply: it cuts a data.frame into pieces and transforms each piece.
d <- data.frame(
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)
I think the easiest way, especially considering doing further analysis, is to convert your data frame to pdata.frame class from plm package.
After the conversion from diff() and lag() operators can be used to create panel differences and lags.
df<-pdata.frame(df,index=c("id","date"))
df<-transform(df, l_value=lag(value,1))
For a panel without missing obs this is an intuitive solution:
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2))
df<-df[with(df, order(id,date)), ] # sort by id and then by date
df$l_value=c(NA,df$value[-length(df$value)]) # create a new var with data displaced by 1 unit
df$l_value[df$id != c(NA, df$id[-length(df$id)])] =NA # NA data with different current and lagged id.
df
id date value l_value
4 1 1990 5.3 NA
3 1 1991 3.3 5.3
1 1 1992 4.1 3.3
2 1 1993 4.5 4.1
5 1 1994 3.0 4.5
7 2 1991 5.2 NA
6 2 1992 3.2 5.2
I stumbled over a similar problem and wrote a function.
#df needs to be a structured balanced paneldata set sorted by id and date
#OBS the function deletes the row where the NA value would have been.
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2,2,2,2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991
,1994,1990,1993),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2,5.3,3.4,5.6))
# sort paneldata set
library(dplyr)
df<-arrange(df,id,date)
#Function
# a=df
# b=colname of variable/variables that you want to lag
# q=number of lag years
# t=colname of date/time column
retraso<-function(a,b,q,t){
sto<-max(as.numeric(unique(a[[t]])))
sta<-min(as.numeric(unique(a[[t]])))
yo<-a[which(a[[t]]>=(sta+q)),]
la<-function(a,d,t,sto,sta){
ja<-data.frame(a[[d]],a[[t]])
colnames(ja)<-c(d,t)
ja<-ja[which(ja[[t]]<=(sto-q)),1]
return(ja)
}
for (i in 1:length(b)){
yo[[b[i]]] <-la(a,b[i],t,sto,sta)
}
return(yo)
}
#lag df 1 year
df<-retraso(df,"value",1,"date")
If you don't have gaps in the time variable , do
df %>% group_by(User) %>% mutate(value_lag = lag(value, order_by =Date)
If you do have gaps in the time variable, see this answer
https://stackoverflow.com/a/26108191/3662288
Similarly, you could use tapply
# Create Data
user = c(rep('A',4),rep('B',4))
date = rep(seq(as.Date('2012-01-01'),as.Date('2012-01-04'),1),2)
value = c(4:7,2:5)
df = data.frame(user,date,value)
# Get lagged values
df$value.lag = unlist(tapply(df$value, df$user, function(x) c(NA,x[-length(df$value)])))
The idea is exactly the same: take value, split it by user, and then run a function on each subset. The unlist brings it back into vector format.
Provided the table is ordered by User and Date, this can be done with zoo. The trick is not to specify an index at this point.
library(zoo)
df <-read.table(text="User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5", header=TRUE, as.is=TRUE,sep = " ")
out <-zoo(df)
Value.lag <-lag(out,-1)[out$User==lag(out$User)]
res <-merge.zoo(out,Value.lag)
res <-res[,-(4:5)] # to remove extra columns
User.out Date.out Value.out Value.Value.lag
1 A 2012-01-01 4 <NA>
2 A 2012-01-02 5 4
3 A 2012-01-03 6 5
4 A 2012-01-04 7 6
5 B 2012-01-01 2 <NA>
6 B 2012-01-02 3 2
7 B 2012-01-03 4 3
8 B 2012-01-04 5 4
The collapse package now available on CRAN provides the most general C/C++ based solution to (fully-identified) panel-lags, leads, differences and growth rates / log differences. It has the generic functions flag, fdiff and fgrowth and associated lag / lead, difference and growth operators L, F, D and G. So to lag a panel dataset, it is sufficient to type:
L(data, n = 1, by = ~ idvar, t = ~ timevar, cols = 4:8)
which means: Compute 1 lag of columns 4 through 8 of data, identified by idvar and timevar. Multiple ID and time-variables can be supplied i.e. ~ id1 + id2, and sequences of lags and leads can also be computed on each column (i.e. n = -1:3 computes one lead and 3 lags). The same thing can also be done more programmatically with flag:
flag(data[4:8], 1, data$idvar, data$timevar)
Both of these options compute below 1 millisecond on typical datasets (<30,000 obs.). Large data performance is similar to data.tables shift. Similar programming applies to differences fdiff / D and growth rates fgrowth / G. These functions are all S3 generic and have vector / time-series, matrix / ts-matrix, data.frame, as well as plm::pseries and plm::pdata.frame and grouped_df methods. Thus they can be used together with plm classes for panel data and with dplyr.

Resources