Find row of the next instance of the value in R - r

I have two columns Time and Event. There are two events A and B. Once an event A takes place, I want to find when the next event B occurs. Column Time_EventB is the desired output.
This is the data frame:
df <- data.frame(Event = sample(c("A", "B", ""), 20, replace = TRUE), Time = paste("t", seq(1,20)))
What is the code in R for finding the next instance of a value (B in this case)?
What is the code for once the instance of B is found, return the value of the corresponding Time Column?
The code should be something like this:
data$Time_EventB <- ifelse(data$Event == "A", <Code for returning time of next instance of B>, "")
In Excel this can be done using VLOOKUP.

Here's a simple solution:
set.seed(1)
df <- data.frame(Event = sample(c("A", "B", ""),size=20, replace=T), time = 1:20)
as <- which(df$Event == "A")
bs <- which(df$Event == "B")
next_b <- sapply(as, function(a) {
diff <- bs-a
if(all(diff < 0)) return(NA)
bs[min(diff[diff > 0]) == diff]
})
df$next_b <- NA
df$next_b[as] <- df$time[next_b]
> df
Event time next_b
1 A 1 2
2 B 2 NA
3 B 3 NA
4 4 NA
5 A 5 8
6 6 NA
7 7 NA
8 B 8 NA
9 B 9 NA
10 A 10 14
11 A 11 14
12 A 12 14
13 13 NA
14 B 14 NA
15 15 NA
16 B 16 NA
17 17 NA
18 18 NA
19 B 19 NA
20 20 NA

Here's an attempt using a "rolling join" from the data.table package:
library(data.table)
setDT(df)
df[Event=="B", .(time, nextb=time)][df, on="time", roll=-Inf][Event != "A", nextb := NA][]
# time nextb Event
# 1: 1 2 A
# 2: 2 NA B
# 3: 3 NA B
# 4: 4 NA
# 5: 5 8 A
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA B
# 9: 9 NA B
#10: 10 14 A
#11: 11 14 A
#12: 12 14 A
#13: 13 NA
#14: 14 NA B
#15: 15 NA
#16: 16 NA B
#17: 17 NA
#18: 18 NA
#19: 19 NA B
#20: 20 NA
Using data as borrowed from #thc

Related

how to get previous matching value

how to get value of "a" if value of b matches with most recent previous value e.g - row$3 of b matches with previous row$1 ,row$6 matches with row$4
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30.133,29,30.1223,33,17,33,11,17,14,13.913,14))
year a b *NEW*
2013 10 30.133 NA
2013 11 29 NA
2014 NA 30.1223 10
2014 13 33 NA
2014 22 17 NA
2015 NA 33 13
2015 19 11 NA
2015 NA 17 22
2016 10 14 NA
2016 15 13.913 10
2016 NA 14 15
Thanks
For OPs example case
One way could be is to use duplicated() function.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,33,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 33 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 13
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 NA
# 10 2016 15 14 10
# 11 2016 NA 14 15
General purpose usage
The above solution fails when the duplicates are not in sequential order. As per #DavidArenburg's advice. I have changed the fourth element df$b[4] <- 14. The general solution would require the usage of another handy function order() and should work for different possible cases.
# Input dataframe
df <- data.frame(year = c(2013,2013,2014,2014,2014,2015,2015,2015,2016,2016,2016),
a = c(10,11,NA,13,22,NA,19,NA,10,15,NA),
b = c(30,29,30,14,17,33,11,17,14,14,14))
# creating a new column with default values
df$NEW <- NA
# sort the matching column
df <- df[order(df$b),]
# updating the value using the previous matching position
df$NEW[duplicated(df$b)] <- df$a[duplicated(df$b,fromLast = TRUE)]
# To original order
df <- df[order(as.integer(rownames(df))),]
# expected output
df
# year a b NEW
# 1 2013 10 30 NA
# 2 2013 11 29 NA
# 3 2014 NA 30 10
# 4 2014 13 14 NA
# 5 2014 22 17 NA
# 6 2015 NA 33 NA
# 7 2015 19 11 NA
# 8 2015 NA 17 22
# 9 2016 10 14 13
# 10 2016 15 14 10
# 11 2016 NA 14 15
Here, the solution is based on the base package' functions. I am sure there should other ways of doing this using other packages.

In R, use mutate() to create a new column based on conditions by group

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.
p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)
So now the dataset looks like this.
> d
p type date
1 1 15 2014-02-03
2 1 20 2014-02-04
3 1 20 2014-02-06
4 2 15 2014-01-28
5 2 20 2014-02-03
6 2 15 2014-03-03
7 2 20 2014-03-13
8 3 20 2014-04-03
9 3 15 2014-04-09
10 3 15 2014-12-03
Now, I'd like to create three new columns.
indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)
I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.
group_by(p)%>%
for(i in i:length(date)){
*if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
indicator = 1
first_date =
days =* #Have no idea how to check in this part
} else {
indicator = 0
first_date = NA
days = NA
}
The expected output is as below.
p type date ind first_date days
1 1 15 2014-02-03 1 2014-02-04 1 # = 2014-02-04 - 2014-02-03
2 1 20 2014-02-04 NA <NA> NA
3 1 20 2014-02-06 NA <NA> NA
4 2 15 2014-01-28 1 2014-02-03 6 # = 2014-02-03 - 2014-01-28
5 2 20 2014-02-03 NA <NA> NA
6 2 15 2014-03-03 0 <NA> NA # since (2014-03-13 - 2014-03-03) > 7
7 2 20 2014-03-13 NA <NA> NA
8 3 20 2014-04-03 NA <NA> NA #I don't care about the value for type 20 lines
9 3 15 2014-04-09 0 <NA> NA
10 3 15 2014-12-03 0 <NA> NA
So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.
I found one way in doing this:
d[,group:= cumsum(type ==15)]
However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?
I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:
library(dplyr)
library(tidyr)
library(lubridate)
d %>%
mutate(rownum = 1:n()) %>%
spread(type, date, sep="_") %>%
group_by(p) %>%
mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
mutate(days = first_date - type_15) %>%
gather("type", "date", type_15, type_20) %>%
filter(!is.na(date)) %>%
arrange(p, date) %>%
select(p, type, date, ind, first_date, days)
# p type date ind first_date days
# <dbl> <chr> <date> <dbl> <date> <time>
#1 1 type_15 2014-02-03 1 2014-02-04 1 days
#2 1 type_20 2014-02-04 NA <NA> NA days
#3 1 type_20 2014-02-06 NA <NA> NA days
#4 2 type_15 2014-01-28 1 2014-02-03 6 days
#5 2 type_20 2014-02-03 NA <NA> NA days
#6 2 type_15 2014-03-03 0 <NA> NA days
#7 2 type_20 2014-03-13 NA <NA> NA days
#8 3 type_20 2014-04-03 NA <NA> NA days
#9 3 type_15 2014-04-09 0 <NA> NA days
#10 3 type_15 2014-12-03 0 <NA> NA days
Let me try to explain what I'm doing:
First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.
The next three mutate lines add the columns outlined in 2 and 3 in the question.
Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.
I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.
If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...
Packages you'll need
library(dplyr)
library(purrr)
Set up data (as per question)
p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)
Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.
# Function to manage ind
ind_manager <- function(type, date, dates_20) {
if (type == 20)
return (NA_integer_)
checks <- map_lgl(dates_20, between, date, date + 7)
return (as.integer(any(checks)))
}
# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
if (is.na(ind) || ind != 1)
return (NA_character_)
dates_20 <- dates_20[order(dates_20)]
as.character(dates_20[which.max(date < dates_20)])
}
Save a vector of dates where type == 20 to be used as comparisons
dates_20 <- d$date[d$type == 20]
The final mutate() call
# mutate() call to create variables
d %>%
mutate(
ind = map2_int(type, date, ind_manager, dates_20),
first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
days = as.integer(first_date - date)
)
#> p type date ind first_date days
#> 1 1 15 2014-02-03 1 2014-02-04 1
#> 2 1 20 2014-02-04 NA <NA> NA
#> 3 1 20 2014-02-06 NA <NA> NA
#> 4 2 15 2014-01-28 1 2014-02-03 6
#> 5 2 20 2014-02-03 NA <NA> NA
#> 6 2 15 2014-03-03 0 <NA> NA
#> 7 2 20 2014-03-13 NA <NA> NA
#> 8 3 20 2014-04-03 NA <NA> NA
#> 9 3 15 2014-04-09 0 <NA> NA
#> 10 3 15 2014-12-03 0 <NA> NA
Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.
The first step is to define the pieces:
d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
.Names = c("p", "type", "date"),
row.names = c(NA, -10L), class = "data.frame")
id <- with(d, {
id <- ave(type, p, FUN = function(x) cumsum(x == 15))
factor(paste0(p, id), unique(paste0(p, id)))
})
sp <- split(d, id)
So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.
The first two pieces are
sp[1:2]
# $`11`
# p type date
# 1 1 15 2014-02-03
# 2 1 20 2014-02-04
# 3 1 20 2014-02-06
#
# $`21`
# p type date
# 4 2 15 2014-01-28
# 5 2 20 2014-02-03
And we can apply the function below on each one
first_date(sp[[1]])
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
first_date(sp[[2]])
# p type date ind first_date days
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
Or all at once with a loop
(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)
# p type date ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA
# 7 2 20 2014-03-13 NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA
You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window
(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)
# p type date ind first_date days ind first_date days
# 1 1 15 2014-02-03 1 2014-02-04 1 1 2014-02-04 1
# 2 1 20 2014-02-04 NA <NA> NA NA <NA> NA
# 3 1 20 2014-02-06 NA <NA> NA NA <NA> NA
# 4 2 15 2014-01-28 1 2014-02-03 6 1 2014-02-03 6
# 5 2 20 2014-02-03 NA <NA> NA NA <NA> NA
# 6 2 15 2014-03-03 0 <NA> NA 1 2014-03-13 10
# 7 2 20 2014-03-13 NA <NA> NA NA <NA> NA
# 8 3 20 2014-04-03 NA <NA> NA NA <NA> NA
# 9 3 15 2014-04-09 0 <NA> NA 0 <NA> NA
# 10 3 15 2014-12-03 0 <NA> NA 0 <NA> NA
first_date <- function(data, window = 7) {
nr <- nrow(data)
## check at least one type 15 and > 1 row
ty15 <- data$type == 15
dt15 <- data$date[ty15]
if (!any(ty15) | nr == 1L)
return(cbind(data, ind = ifelse(any(ty15), 0, NA),
first_date = NA, days = NA))
## first date vector
dts <- rep(min(data$date[!ty15]), nr)
dts[!ty15] <- NA
## days from the type 15 date
days <- as.numeric(data$date[!ty15] - min(dt15))
days <- c(days, rep(NA, nr - length(days)))
## convert to NA if criteria not met
to_na <- days > window | is.na(dts)
days[to_na] <- dts[to_na] <- NA
## ind vector -- 1 or 0 if type 15, NA otherwise
ind <- rep(NA, nr)
ind[ty15] <- as.integer(!is.na(dts[ty15]))
## combine
cbind(data, ind = ind, first_date = dts, days = days)
}

Grouped moving average in r

I'm trying to calculate a moving average in r over a particular field BUT I need this moving average to be grouped by two or more other fields. The purpose of this new average is for predictive analysis so I need it to be trailing as well.
Any variables that do not have enough values to be averaged (such as student J) would ideally give either NA or its original Score value.
I've been trying rollapply and data.table and am having no luck!
I've provided the table of data and two moving averages (AVG2 with k=2 and AVG3 with k=3) to show exactly what I'm after. The moving average is on Score and the variables to group over are school, Student and area. Please help!
no school Student area Score **AVG2** **AVG3**
1 I S A 5 NA NA
2 B S A 2 NA NA
3 B S A 7 NA NA
4 B O A 3 NA NA
5 B O B 9 NA NA
6 I O A 6 NA NA
7 I O B 3 NA NA
8 I S A 7 NA NA
9 I O A 1 NA NA
10 B S A 7 4.5 NA
11 I S A 3 NA NA
12 I O A 8 3.5 NA
13 B S A 3 7 5.33
14 I O A 4 4.5 5
15 B O A 1 NA NA
16 I S A 9 5 5
17 B S A 4 5 5.67
18 B O A 6 2 NA
19 I S A 3 6 6.33
20 I O B 8 NA NA
21 B S A 3 3.5 4.67
22 I O A 4 6 4.33
23 B O A 1 3.5 3.33
24 I S A 9 6 5
25 B S A 4 3.5 3.33
26 B O A 6 3.5 2.67
27 I J A 6 NA NA
here is the code to recreate the initial table in r:
school <- c('I','B','B','B','B','I','I','I','I','B','I','I','B','I','B','I','B','B','I','I','B','I','B','I','B','B','I')
Student <- c('S','S','S','O','O','O','O','S','O','S','S','O','S','O','O','S','S','O','S','O','S','O','O','S','S','O','J')
area <- c('A','A','A','A','B','A','B','A','A','A','A','A','A','A','A','A','A','A','A','B','A','A','A','A','A','A','A')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1,9,4,6,3,8,3,4,1,9,4,6,6)
data.frame(school, Student, area, Score)
You can try solving the problem using dplyr and TTR but for student J from school I it is not possible to calculate a moving average as there's only one measurement.
AVG2 caluculated with stats:filter gives the result you wanted to have, but I also added AVG2b calculated with TTR::SMA to show a simple moving average calculation, where the current measurement is also taken into account.
library(dplyr)
library(TTR)
df <- data.frame(school, Student, Score)
df$AVG2 <- NA
df$AVG2b <- NA
df[!(df$school=="I" & df$Student=="J"),] <- df[!(df$school=="I" & df$Student=="J"),] %>%
group_by(school, Student) %>%
mutate(AVG2 = stats::filter(Score, c(0, 0.5, 0.5), sides = 1 ), AVG2b = SMA(Score, n= 2))
> df
school Student Score AVG2 AVG2b
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA 4.5
4 B O 3 NA NA
5 B O 9 NA 6.0
6 I O 6 NA NA
7 I O 3 NA 4.5
8 I S 7 NA 6.0
9 I O 1 4.5 2.0
10 B S 7 4.5 7.0
...
Here is a rollapply solution. Note that it appears that you want the average of the prior two or three rows in the same group, i.e. excluding the data on the current row.
library(zoo)
roll <- function(x, n) {
if (length(x) <= n) NA
else rollapply(x, list(-seq(n)), mean, fill = NA)
}
transform(DF, AVG2 = ave(Score, school, Student, FUN = function(x) roll(x, 2)),
AVG3 = ave(Score, school, Student, FUN = function(x) roll(x, 3)))
giving:
school Student Score AVG2 AVG3
1 I S 5 NA NA
2 B S 2 NA NA
3 B S 7 NA NA
4 B O 3 NA NA
5 B O 9 NA NA
6 I O 6 NA NA
7 I O 3 NA NA
8 I S 7 NA NA
9 I O 1 4.5 NA
10 B S 7 4.5 NA
11 I S 3 6.0 NA
12 I O 8 2.0 3.333333
13 B S 3 7.0 5.333333
14 I O 4 4.5 4.000000
15 B O 1 6.0 NA
16 I S 9 5.0 5.000000
17 B S 4 5.0 5.666667
18 B O 6 5.0 4.333333
19 I S 3 6.0 6.333333
20 I O 8 6.0 4.333333
21 B S 3 3.5 4.666667
22 I O 4 6.0 6.666667
23 B O 1 3.5 5.333333
24 I S 9 6.0 5.000000
25 B S 4 3.5 3.333333
26 B O 6 3.5 2.666667
27 I J 6 NA NA
Update: Fixed roll.
Here is AVG2 calculation with data.table, which is faster compared to other approaches:
library(data.table)
dt <- data.table(df)
setkey(dt, school, Student, area)
dt[, c("start", "len") := .(ifelse(.I + 1 > .I[.N], 0, .I +1), pmax(pmin(1, .I[.N] - .I -1), 0)), by = .(school, Student, area)][
, AVG2 := mean(dt$Score[start:(start+len)]), by = 1:nrow(dt)]
res$AVG2[res$len == 0] <- NA

r - data frame manipulation [duplicate]

This question already has answers here:
Reshape multiple value columns to wide format
(5 answers)
Closed 5 years ago.
Suppose I have this data frame:
df <- data.frame(ID = c("id1", "id1", "id1", "id2", "id2", "id3", "id3", "id3"),
Code = c("A", "B", "C", "A", "B", "A", "C", "D"),
Count = c(34,65,21,3,8,12,15,16), Value = c(3,1,8,2,3,3,5,8))
that looks like this:
df
ID Code Count Value
1 id1 A 34 3
2 id1 B 65 1
3 id1 C 21 8
4 id2 A 3 2
5 id2 B 8 3
6 id3 A 12 3
7 id3 C 15 5
8 id3 D 16 8
I would like to obtain this result data frame:
result <- data.frame(Code = c("A", "B", "C", "D"),
id1_count = c(34,65,21,NA), id1_value = c(3,1,8,NA),
id2_count = c(3, 8, NA, NA), id2_value = c(2, 3, NA, NA),
id3_count = c(12,NA,15,16), id3_value = c(3,NA,5,8))
that looks like this:
> result
Code id1_count id1_value id2_count id2_value id3_count id3_value
1 A 34 3 3 2 12 3
2 B 65 1 8 3 NA NA
3 C 21 8 NA NA 15 5
4 D NA NA NA NA 16 8
Is there a one liner in the R base package that can do that? I am able to achieve the result I need but not in the R way (i.e., with loops and so on). Any help is appreciated. Thank you.
You can try dcast from devel version of data.table (v1.9.5) which can take multiple value.var columns. Instructions to install are here
library(data.table)
dcast(setDT(df), Code~ID, value.var=c('Count', 'Value'))
# Code Count_id1 Count_id2 Count_id3 Value_id1 Value_id2 Value_id3
#1: A 34 3 12 3 2 3
#2: B 65 8 NA 1 3 NA
#3: C 21 NA 15 8 NA 5
#4: D NA NA 16 NA NA 8
Or using reshape from base R
reshape(df, idvar='Code', timevar='ID', direction='wide')
# Code Count.id1 Value.id1 Count.id2 Value.id2 Count.id3 Value.id3
#1 A 34 3 3 2 12 3
#2 B 65 1 8 3 NA NA
#3 C 21 8 NA NA 15 5
#8 D NA NA NA NA 16 8
You could also try:
library(tidyr)
library(dplyr)
df %>%
gather(key, value, -(ID:Code)) %>%
unite(id_key, ID, key) %>%
spread(id_key, value)
Which gives:
# Code id1_Count id1_Value id2_Count id2_Value id3_Count id3_Value
#1 A 34 3 3 2 12 3
#2 B 65 1 8 3 NA NA
#3 C 21 8 NA NA 15 5
#4 D NA NA NA NA 16 8

Retain and lag function in R as SAS

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.
I know there are functions like embed and lag in R but they don't return a single value or the previous value . They return a complete set of vectors.
Is there anything in R which I can use with data.table?
More info on the SAS functions :
Retain
Lag
You have to be aware that R works very different from the data step in SAS. The lag function in SAS is used in the data step, and is used within the implicit loop structure of that data step. The same goes for the retain function, which simply keeps the value constant when going through the data looping.
R on the other hand works completely vectorized. This means that you have to rethink what you want to do, and adapt accordingly.
retain is simply useless in R, as R recycles arguments by default. If you want to do this explicitly, you might look at eg rep() to construct a vector with constant values and a certain length.
lag is a matter of using indices, and just shifting position of all values in a vector. In order to keep a vector of the same length, you need to add some NA and remove some extra values.
A simple example: This SAS code lags a variable x and adds a variable year that has a constant value:
data one;
retain year 2013;
input x ##;
y=lag1(x);
z=lag2(x);
datalines;
1 2 3 4 5 6
;
In R, you could write your own lag function like this:
mylag <- function(x,k) c(rep(NA,k),head(x,-k))
This single line adds k times NA at the beginning of the vector, and drops the last k values from the vector. The result is a lagged vector as given by lag1 etc. in SAS.
this allows something like :
nrs <- 1:6 # equivalent to datalines
one <- data.frame(
x = nrs,
y = mylag(nrs,1),
z = mylag(nrs,2),
year = 2013 # R automatically loops, so no extra command needed
)
The result is :
> one
x y z year
1 1 NA NA 2013
2 2 1 NA 2013
3 3 2 1 2013
4 4 3 2 2013
5 5 4 3 2013
6 6 5 4 2013
Exactly the same would work with a data.table object. The important note here is to rethink your strategy: Instead of thinking loopwise as you do with the DATA step in SAS, you have to start thinking in terms of vectors and indices when using R.
I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.
It's very easy to use with data.tables. E.g.:
library(data.table)
library(quantmod)
d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
setkeyv(d, 'v1')
d[,new_var := Lag(v2, 1), by='v1']
d[,new_var2 := v2-Lag(v2, 3), by='v1']
d[,new_var3 := Next(v2, 2), by='v1']
This yields the following:
print(d)
v1 v2 new_var new_var2 new_var3
1: a 1 NA NA 3
2: a 2 1 NA 4
3: a 3 2 NA 5
4: a 4 3 3 6
5: a 5 4 3 7
6: a 6 5 3 8
7: a 7 6 3 9
8: a 8 7 3 10
9: a 9 8 3 NA
10: a 10 9 3 NA
11: b 11 NA NA 13
12: b 12 11 NA 14
13: b 13 12 NA 15
14: b 14 13 3 16
15: b 15 14 3 17
16: b 16 15 3 18
17: b 17 16 3 19
18: b 18 17 3 20
19: b 19 18 3 NA
20: b 20 19 3 NA
As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.
If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.
Start with a data.table:
library(data.table)
library(quantmod)
set.seed(42)
d1 <- data.table(
id=c(rep('a', 10), rep('b', 10)),
time=rep(1:10,2),
value=runif(20))
setkeyv(d1, c('id', 'time'))
print(d1)
id time value
1: a 1 0.9148060
2: a 2 0.9370754
3: a 3 0.2861395
4: a 4 0.8304476
5: a 5 0.6417455
6: a 6 0.5190959
7: a 7 0.7365883
8: a 8 0.1346666
9: a 9 0.6569923
10: a 10 0.7050648
11: b 1 0.4577418
12: b 2 0.7191123
13: b 3 0.9346722
14: b 4 0.2554288
15: b 5 0.4622928
16: b 6 0.9400145
17: b 7 0.9782264
18: b 8 0.1174874
19: b 9 0.4749971
20: b 10 0.5603327
You have another data.table you want to join, but not all time indexes are present in the second table:
d2 <- data.table(
id=sample(c('a', 'b'), 5, replace=TRUE),
time=sample(1:10, 5),
value2=runif(5))
setkeyv(d2, c('id', 'time'))
print(d2)
id time value2
1: a 4 0.811055141
2: a 10 0.003948339
3: b 6 0.737595618
4: b 8 0.388108283
5: b 9 0.685169729
A regular merge yields lots of missing values:
d2[d1,,roll=FALSE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 NA 0.6417455
6: a 6 NA 0.5190959
7: a 7 NA 0.7365883
8: a 8 NA 0.1346666
9: a 9 NA 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 NA 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 NA 0.5603327
However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!
d2[d1,,roll=TRUE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:
d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
id time value2 value
1: a 1 0.811055141 0.9148060
2: a 2 0.811055141 0.9370754
3: a 3 0.811055141 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 0.737595618 0.4577418
12: b 2 0.737595618 0.7191123
13: b 3 0.737595618 0.9346722
14: b 4 0.737595618 0.2554288
15: b 5 0.737595618 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.
For retain, try this :
retain<-function(x,event,outside=NA)
{
indices <- c(1,which(event==TRUE), nrow(df)+1)
values <- c(outside,x[event==TRUE])
y<- rep(values, diff(indices))
}
With data : I want to retain down the value when w==b
df <- data.frame(w = c("a","b","c","a","b","c"), x = 1:6, y = c(1,1,2,2,2,3), stringsAsFactors = FALSE)
df$z<-retain(df$x-df$y,df$w=="b")
df
And here's the contrary obtain, that does not exist in SAS:
obtain<-function(x,event,outside=NA)
{
indices <- c(0,which(event==TRUE), nrow(df))
values <- c(x[event==TRUE],outside)
y<- rep(values, diff(indices))
}
Here's an example. I want to obtain the value in advance where w==b
df$z2<-obtain(df$x-df$y,df$w=="b")
df
Thanks to Julien for helping.
here's an example: cumulate value with sqldf:
> w_cum <-
sqldf("select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as cum_sum
from w_cum t1
inner join w_cum t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
")
id SomeNumt cum_sum
1 11 11
2 12 23
3 13 36
4 14 50
5 15 65
6 16 81
7 17 98
8 18 116
9 19 135
10 20 155

Resources