Lagging single column in Time-Series - r

I am running 4.0.3. No access to the internet.
I want to lag a single column of a multicolumn Time-Series. I wasn't able to find a satisfactory answer anywhere else.
Intuitively this makes sense to me, but it just doesn't work:
library(tsbox)
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
timeseries = ts_ts(ts_long(data))
timeseries[,'col1_L1'] = lag(timeseries[,'col1'],1)
What I get:
col1 col2 col1_L1
Jan 2005 1 1 1
Feb 2005 2 2 2
Mar 2005 3 3 3
Apr 2005 4 4 4
May 2005 5 5 5
What I would expect from this code:
col1 col2 col1_L1
Jan 2005 1 1 NA
Feb 2005 2 2 1
Mar 2005 3 3 2
Apr 2005 4 4 3
May 2005 5 5 4

I wasn't able to reproduce your example (likely due to the reasons pointed out in the comments) but perhaps you could use the function from this post, e.g.
data=data.frame(Date=c('2005-01-01','2005-02-01','2005-03-01','2005-04-01','2005-05-01'),
col1 = c(1,2,3,4,5),
col2 = c(1,2,3,4,5))
data[,'Date']= as.POSIXct(data[,'Date'],format='%Y-%m-%d')
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] )
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)))
}
}
data$col_l1 <- lagpad(data$col2, 1)
data
#> Date col1 col2 col_l1
#> 1 2005-01-01 1 1 NA
#> 2 2005-02-01 2 2 1
#> 3 2005-03-01 3 3 2
#> 4 2005-04-01 4 4 3
#> 5 2005-05-01 5 5 4

Related

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
# which looks like this:
fish_1
year fishid agei
1 2012 a 1
2 2012 a 2
3 2015 b 1
4 2015 b 2
5 2015 b 3
6 2013 c 1
7 2013 c 2
8 2013 c 3
9 2013 c 4
10 2012 d 1
11 2012 d 2
12 2015 e 1
13 2015 e 2
14 2015 e 3
What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.
The desired output would be:
fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2
year fishid agei agec
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:
caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish
fishid year agei agec
1 a 2012 1 2
2 a 2012 2 2
3 b 2015 1 3
4 b 2015 2 3
5 b 2015 3 3
6 c 2013 1 4
7 c 2013 2 4
8 c 2013 3 4
9 c 2013 4 4
10 d 2012 1 2
11 d 2012 2 2
12 e 2015 1 3
13 e 2015 2 3
14 e 2015 3 3
I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.
UPDATE
For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate(). You can refer to my comment on #Thomas answer to see the error in my ways. Hope this helps.
Try dplyr instead of plyr
library(dplyr)
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
You can use group_by from dplyr to group your fish IDs and then simply call mutate (dplyr as well) with max:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups: fishid [5]
year fishid agei agec
<dbl> <chr> <dbl> <dbl>
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
An option with data.table
library(data.table)
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid]

How do you select a max of one column and not NA's in another column in R?

I'm looking for a way in R where I can select the max(col1) where col2 is not NA?
Example datafame named df1
#df1
Year col1 col2
2016 4 NA # has NA
2016 2 NA # has NA
2016 1 3 # this is the max for 2016
2017 3 NA
2017 2 3 # this is the max for 2017
2017 1 3
2018 2 4 # this is the max for 2018
2018 1 NA
I would like the new dataset to only return
Year col1 col2
2016 1 3
2017 2 3
2018 2 4
If any one can help, it would be very appreciated?
In base R
out <- na.omit(df1)
merge(aggregate(col1 ~ Year, out, max), out) # thanks to Rui
# Year col1 col2
#1 2016 1 3
#2 2017 2 3
#3 2018 2 4
Using dplyr:
library(dplyr)
df1 %>% filter(!is.na(col2)) %>%
group_by(year) %>%
arrange(desc(col1)) %>%
slice(1)
Using data.table:
library(data.table)
setDT(df1)
df1[!is.na(col2), .SD[which.max(col1)], by = Year]
This works in a fresh R session:
library(data.table)
dt = fread("Year col1 col2
2016 4 NA
2016 2 NA
2016 1 3
2017 3 NA
2017 2 3
2017 1 3
2018 2 4
2018 1 NA")
dt[!is.na(col2), .SD[which.max(col1)], by = Year]
# Year col1 col2
# 1: 2016 1 3
# 2: 2017 2 3
# 3: 2018 2 4

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Drop subgroup of obs in dataframe if first observation of group is na

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.
Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

expand.grid() based on values in two variables in R

I would like to expand a grid in R such that the expansion occurs for unique values of one variable but joint values for two variables. For example:
frame <- data.frame(id = seq(1:2),id2 = seq(1:2), year = c(2005, 2008))
I would like to expand the frame for each year, but such that id and id2 are considered jointly (e.g. (1,1), and (2,2) to generate an output like:
id id2 year
1 1 2005
1 1 2006
1 1 2007
1 1 2005
2 2 2006
2 2 2007
2 2 2008
Using expand.grid(), does someone know how to do this? I have not been able to wrangle the code past looking at each id uniquely and producing a frame with all combinations given the following code:
with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), id2 = unique(id2)))
Thanks for any and all help.
You could do this with reshape::expand.grid.df
require(reshape)
expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
> expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
id id2 year
1 1 1 2005
2 2 2 2005
3 1 1 2006
4 2 2 2006
5 1 1 2007
6 2 2 2007
7 1 1 2008
8 2 2 2008
Here is another way using base R
indx <- diff(frame$year)+1
indx1 <- rep(1:nrow(frame), each=indx)
frame1 <- transform(frame[indx1,1:2], year=seq(frame$year[1], length.out=indx, by=1))
row.names(frame1) <- NULL
frame1
# id id2 year
#1 1 1 2005
#2 1 1 2006
#3 1 1 2007
#4 1 1 2008
#5 2 2 2005
#6 2 2 2006
#7 2 2 2007
#8 2 2 2008

Resources