Basic lag in R vector/dataframe

Basic lag in R vector/dataframe - r

Will most likely expose that I am new to R, but in SPSS, running lags is very easy. Obviously this is user error, but what I am missing?
x <- sample(c(1:9), 10, replace = T)
y <- lag(x, 1)
ds <- cbind(x, y)
ds
Results in:
x y
[1,] 4 4
[2,] 6 6
[3,] 3 3
[4,] 4 4
[5,] 3 3
[6,] 5 5
[7,] 8 8
[8,] 9 9
[9,] 3 3
[10,] 7 7
I figured I would see:
x y
[1,] 4
[2,] 6 4
[3,] 3 6
[4,] 4 3
[5,] 3 4
[6,] 5 3
[7,] 8 5
[8,] 9 8
[9,] 3 9
[10,] 7 3
Any guidance will be much appreciated.

I had the same problem, but I didn't want to use zoo or xts, so I wrote a simple lag function for data frames:
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] );
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)));
}
}
This can lag forward or backwards:
x<-1:3;
(cbind(x, lagpad(x, 1), lagpad(x,-1)))
x
[1,] 1 NA 2
[2,] 2 1 3
[3,] 3 2 NA

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:
require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5
The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via
> data.frame(cbind(x, y))

lag does not shift the data, it only shifts the "time-base". x has no "time base", so cbind does not work as you expected. Try cbind(as.ts(x),lag(x)) and notice that a "lag" of 1 shifts the periods forward.
I would suggesting using zoo / xts for time series. The zoo vignettes are particularly helpful.

Using just standard R functions this can be achieved in a much simpler way:
x <- sample(c(1:9), 10, replace = T)
y <- c(NA, head(x, -1))
ds <- cbind(x, y)
ds

lag() works with time series, whereas you are trying to use bare matrices. This old question suggests using embed instead, like so:
lagmatrix <- function(x,max.lag) embed(c(rep(NA,max.lag), x), max.lag+1)
for instance
> x
[1] 8 2 3 9 8 5 6 8 5 8
> lagmatrix(x, 1)
[,1] [,2]
[1,] 8 NA
[2,] 2 8
[3,] 3 2
[4,] 9 3
[5,] 8 9
[6,] 5 8
[7,] 6 5
[8,] 8 6
[9,] 5 8
[10,] 8 5

The easiest way to me now appears to be the following:
require(dplyr)
df <- data.frame(x = sample(c(1:9), 10, replace = T))
df <- df %>% mutate(y = lag(x))

tmp<-rnorm(10)
tmp2<-c(NA,tmp[1:length(tmp)-1])
tmp
tmp2

This should accommodate vectors or matrices as well as negative lags:
lagpad <- function(x, k=1) {
i<-is.vector(x)
if(is.vector(x)) x<-matrix(x) else x<-matrix(x,nrow(x))
if(k>0) {
x <- rbind(matrix(rep(NA, k*ncol(x)),ncol=ncol(x)), matrix(x[1:(nrow(x)-k),], ncol=ncol(x)))
}
else {
x <- rbind(matrix(x[(-k+1):(nrow(x)),], ncol=ncol(x)),matrix(rep(NA, -k*ncol(x)),ncol=ncol(x)))
}
if(i) x[1:length(x)] else x
}

Using data.table:
> x <- sample(c(1:9), 10, replace = T)
> y <- data.table::shift(x)
> ds <- cbind(x, y)
> ds
x y
[1,] 5 NA
[2,] 4 5
[3,] 3 4
[4,] 3 3
[5,] 4 3
[6,] 8 4
[7,] 1 8
[8,] 7 1
[9,] 9 7
[10,] 7 9

a simple way to do the same may be copying the data to a new data
frame and changing the index number. Make sure the original table is indexed sequentially with no gaps
e.g.
tempData <- originalData
rownames(tempData) <- 2:(nrow(tempData)+1)
if you want it in the same data frame as the original use a cbind function

Two options, in base R and with data.table:
baseShiftBy1 <- function(x) c(NA, x[-length(x)])
baseShiftBy1(x)
[1] NA 3 8 4 8 9 1 5 9 5
data.table::shift(x)
[1] NA 3 8 4 8 9 1 5 9 5
Data:
set.seed(123)
(x <- sample(c(1:9), 10, replace = T))
[1] 3 8 4 8 9 1 5 9 5 5

I went with a similar solution to Andrew's (dedicated function instead of xts or zoo), but with a terser formulation that I find easier to reason about:
lagpad <- function(x, k) {
if (k == 0) { return(x) }
k.pos <- max(0, k)
k.neg <- max(0, -k)
c(rep(NA, k.pos), head(x, -k.pos), # empty if k<0, else lagging x
tail(x, -k.neg), rep(NA, k.neg)) # empty if k>0, else leading x
}

Just get rid of lag. Change your line for y to:
y <- c(NA, x[-1])

Related

How to use the base R lag-function? [duplicate]

Will most likely expose that I am new to R, but in SPSS, running lags is very easy. Obviously this is user error, but what I am missing?
x <- sample(c(1:9), 10, replace = T)
y <- lag(x, 1)
ds <- cbind(x, y)
ds
Results in:
x y
[1,] 4 4
[2,] 6 6
[3,] 3 3
[4,] 4 4
[5,] 3 3
[6,] 5 5
[7,] 8 8
[8,] 9 9
[9,] 3 3
[10,] 7 7
I figured I would see:
x y
[1,] 4
[2,] 6 4
[3,] 3 6
[4,] 4 3
[5,] 3 4
[6,] 5 3
[7,] 8 5
[8,] 9 8
[9,] 3 9
[10,] 7 3
Any guidance will be much appreciated.

I had the same problem, but I didn't want to use zoo or xts, so I wrote a simple lag function for data frames:
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] );
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)));
}
}
This can lag forward or backwards:
x<-1:3;
(cbind(x, lagpad(x, 1), lagpad(x,-1)))
x
[1,] 1 NA 2
[2,] 2 1 3
[3,] 3 2 NA

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:
require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5
The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via
> data.frame(cbind(x, y))

lag does not shift the data, it only shifts the "time-base". x has no "time base", so cbind does not work as you expected. Try cbind(as.ts(x),lag(x)) and notice that a "lag" of 1 shifts the periods forward.
I would suggesting using zoo / xts for time series. The zoo vignettes are particularly helpful.

Using just standard R functions this can be achieved in a much simpler way:
x <- sample(c(1:9), 10, replace = T)
y <- c(NA, head(x, -1))
ds <- cbind(x, y)
ds

lag() works with time series, whereas you are trying to use bare matrices. This old question suggests using embed instead, like so:
lagmatrix <- function(x,max.lag) embed(c(rep(NA,max.lag), x), max.lag+1)
for instance
> x
[1] 8 2 3 9 8 5 6 8 5 8
> lagmatrix(x, 1)
[,1] [,2]
[1,] 8 NA
[2,] 2 8
[3,] 3 2
[4,] 9 3
[5,] 8 9
[6,] 5 8
[7,] 6 5
[8,] 8 6
[9,] 5 8
[10,] 8 5

The easiest way to me now appears to be the following:
require(dplyr)
df <- data.frame(x = sample(c(1:9), 10, replace = T))
df <- df %>% mutate(y = lag(x))

tmp<-rnorm(10)
tmp2<-c(NA,tmp[1:length(tmp)-1])
tmp
tmp2

This should accommodate vectors or matrices as well as negative lags:
lagpad <- function(x, k=1) {
i<-is.vector(x)
if(is.vector(x)) x<-matrix(x) else x<-matrix(x,nrow(x))
if(k>0) {
x <- rbind(matrix(rep(NA, k*ncol(x)),ncol=ncol(x)), matrix(x[1:(nrow(x)-k),], ncol=ncol(x)))
}
else {
x <- rbind(matrix(x[(-k+1):(nrow(x)),], ncol=ncol(x)),matrix(rep(NA, -k*ncol(x)),ncol=ncol(x)))
}
if(i) x[1:length(x)] else x
}

Using data.table:
> x <- sample(c(1:9), 10, replace = T)
> y <- data.table::shift(x)
> ds <- cbind(x, y)
> ds
x y
[1,] 5 NA
[2,] 4 5
[3,] 3 4
[4,] 3 3
[5,] 4 3
[6,] 8 4
[7,] 1 8
[8,] 7 1
[9,] 9 7
[10,] 7 9

a simple way to do the same may be copying the data to a new data
frame and changing the index number. Make sure the original table is indexed sequentially with no gaps
e.g.
tempData <- originalData
rownames(tempData) <- 2:(nrow(tempData)+1)
if you want it in the same data frame as the original use a cbind function

Two options, in base R and with data.table:
baseShiftBy1 <- function(x) c(NA, x[-length(x)])
baseShiftBy1(x)
[1] NA 3 8 4 8 9 1 5 9 5
data.table::shift(x)
[1] NA 3 8 4 8 9 1 5 9 5
Data:
set.seed(123)
(x <- sample(c(1:9), 10, replace = T))
[1] 3 8 4 8 9 1 5 9 5 5

I went with a similar solution to Andrew's (dedicated function instead of xts or zoo), but with a terser formulation that I find easier to reason about:
lagpad <- function(x, k) {
if (k == 0) { return(x) }
k.pos <- max(0, k)
k.neg <- max(0, -k)
c(rep(NA, k.pos), head(x, -k.pos), # empty if k<0, else lagging x
tail(x, -k.neg), rep(NA, k.neg)) # empty if k>0, else leading x
}

Just get rid of lag. Change your line for y to:
y <- c(NA, x[-1])

Sorting year and month data in a customized manner in R [duplicate]

Will most likely expose that I am new to R, but in SPSS, running lags is very easy. Obviously this is user error, but what I am missing?
x <- sample(c(1:9), 10, replace = T)
y <- lag(x, 1)
ds <- cbind(x, y)
ds
Results in:
x y
[1,] 4 4
[2,] 6 6
[3,] 3 3
[4,] 4 4
[5,] 3 3
[6,] 5 5
[7,] 8 8
[8,] 9 9
[9,] 3 3
[10,] 7 7
I figured I would see:
x y
[1,] 4
[2,] 6 4
[3,] 3 6
[4,] 4 3
[5,] 3 4
[6,] 5 3
[7,] 8 5
[8,] 9 8
[9,] 3 9
[10,] 7 3
Any guidance will be much appreciated.

I had the same problem, but I didn't want to use zoo or xts, so I wrote a simple lag function for data frames:
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] );
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)));
}
}
This can lag forward or backwards:
x<-1:3;
(cbind(x, lagpad(x, 1), lagpad(x,-1)))
x
[1,] 1 NA 2
[2,] 2 1 3
[3,] 3 2 NA

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:
require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5
The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via
> data.frame(cbind(x, y))

lag does not shift the data, it only shifts the "time-base". x has no "time base", so cbind does not work as you expected. Try cbind(as.ts(x),lag(x)) and notice that a "lag" of 1 shifts the periods forward.
I would suggesting using zoo / xts for time series. The zoo vignettes are particularly helpful.

Using just standard R functions this can be achieved in a much simpler way:
x <- sample(c(1:9), 10, replace = T)
y <- c(NA, head(x, -1))
ds <- cbind(x, y)
ds

lag() works with time series, whereas you are trying to use bare matrices. This old question suggests using embed instead, like so:
lagmatrix <- function(x,max.lag) embed(c(rep(NA,max.lag), x), max.lag+1)
for instance
> x
[1] 8 2 3 9 8 5 6 8 5 8
> lagmatrix(x, 1)
[,1] [,2]
[1,] 8 NA
[2,] 2 8
[3,] 3 2
[4,] 9 3
[5,] 8 9
[6,] 5 8
[7,] 6 5
[8,] 8 6
[9,] 5 8
[10,] 8 5

The easiest way to me now appears to be the following:
require(dplyr)
df <- data.frame(x = sample(c(1:9), 10, replace = T))
df <- df %>% mutate(y = lag(x))

tmp<-rnorm(10)
tmp2<-c(NA,tmp[1:length(tmp)-1])
tmp
tmp2

This should accommodate vectors or matrices as well as negative lags:
lagpad <- function(x, k=1) {
i<-is.vector(x)
if(is.vector(x)) x<-matrix(x) else x<-matrix(x,nrow(x))
if(k>0) {
x <- rbind(matrix(rep(NA, k*ncol(x)),ncol=ncol(x)), matrix(x[1:(nrow(x)-k),], ncol=ncol(x)))
}
else {
x <- rbind(matrix(x[(-k+1):(nrow(x)),], ncol=ncol(x)),matrix(rep(NA, -k*ncol(x)),ncol=ncol(x)))
}
if(i) x[1:length(x)] else x
}

Using data.table:
> x <- sample(c(1:9), 10, replace = T)
> y <- data.table::shift(x)
> ds <- cbind(x, y)
> ds
x y
[1,] 5 NA
[2,] 4 5
[3,] 3 4
[4,] 3 3
[5,] 4 3
[6,] 8 4
[7,] 1 8
[8,] 7 1
[9,] 9 7
[10,] 7 9

a simple way to do the same may be copying the data to a new data
frame and changing the index number. Make sure the original table is indexed sequentially with no gaps
e.g.
tempData <- originalData
rownames(tempData) <- 2:(nrow(tempData)+1)
if you want it in the same data frame as the original use a cbind function

Two options, in base R and with data.table:
baseShiftBy1 <- function(x) c(NA, x[-length(x)])
baseShiftBy1(x)
[1] NA 3 8 4 8 9 1 5 9 5
data.table::shift(x)
[1] NA 3 8 4 8 9 1 5 9 5
Data:
set.seed(123)
(x <- sample(c(1:9), 10, replace = T))
[1] 3 8 4 8 9 1 5 9 5 5

I went with a similar solution to Andrew's (dedicated function instead of xts or zoo), but with a terser formulation that I find easier to reason about:
lagpad <- function(x, k) {
if (k == 0) { return(x) }
k.pos <- max(0, k)
k.neg <- max(0, -k)
c(rep(NA, k.pos), head(x, -k.pos), # empty if k<0, else lagging x
tail(x, -k.neg), rep(NA, k.neg)) # empty if k>0, else leading x
}

Just get rid of lag. Change your line for y to:
y <- c(NA, x[-1])

Copy column but drop observations one row in R [duplicate]

Will most likely expose that I am new to R, but in SPSS, running lags is very easy. Obviously this is user error, but what I am missing?
x <- sample(c(1:9), 10, replace = T)
y <- lag(x, 1)
ds <- cbind(x, y)
ds
Results in:
x y
[1,] 4 4
[2,] 6 6
[3,] 3 3
[4,] 4 4
[5,] 3 3
[6,] 5 5
[7,] 8 8
[8,] 9 9
[9,] 3 3
[10,] 7 7
I figured I would see:
x y
[1,] 4
[2,] 6 4
[3,] 3 6
[4,] 4 3
[5,] 3 4
[6,] 5 3
[7,] 8 5
[8,] 9 8
[9,] 3 9
[10,] 7 3
Any guidance will be much appreciated.

I had the same problem, but I didn't want to use zoo or xts, so I wrote a simple lag function for data frames:
lagpad <- function(x, k) {
if (k>0) {
return (c(rep(NA, k), x)[1 : length(x)] );
}
else {
return (c(x[(-k+1) : length(x)], rep(NA, -k)));
}
}
This can lag forward or backwards:
x<-1:3;
(cbind(x, lagpad(x, 1), lagpad(x,-1)))
x
[1,] 1 NA 2
[2,] 2 1 3
[3,] 3 2 NA

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:
require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5
The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via
> data.frame(cbind(x, y))

lag does not shift the data, it only shifts the "time-base". x has no "time base", so cbind does not work as you expected. Try cbind(as.ts(x),lag(x)) and notice that a "lag" of 1 shifts the periods forward.
I would suggesting using zoo / xts for time series. The zoo vignettes are particularly helpful.

Using just standard R functions this can be achieved in a much simpler way:
x <- sample(c(1:9), 10, replace = T)
y <- c(NA, head(x, -1))
ds <- cbind(x, y)
ds

lag() works with time series, whereas you are trying to use bare matrices. This old question suggests using embed instead, like so:
lagmatrix <- function(x,max.lag) embed(c(rep(NA,max.lag), x), max.lag+1)
for instance
> x
[1] 8 2 3 9 8 5 6 8 5 8
> lagmatrix(x, 1)
[,1] [,2]
[1,] 8 NA
[2,] 2 8
[3,] 3 2
[4,] 9 3
[5,] 8 9
[6,] 5 8
[7,] 6 5
[8,] 8 6
[9,] 5 8
[10,] 8 5

The easiest way to me now appears to be the following:
require(dplyr)
df <- data.frame(x = sample(c(1:9), 10, replace = T))
df <- df %>% mutate(y = lag(x))

tmp<-rnorm(10)
tmp2<-c(NA,tmp[1:length(tmp)-1])
tmp
tmp2

This should accommodate vectors or matrices as well as negative lags:
lagpad <- function(x, k=1) {
i<-is.vector(x)
if(is.vector(x)) x<-matrix(x) else x<-matrix(x,nrow(x))
if(k>0) {
x <- rbind(matrix(rep(NA, k*ncol(x)),ncol=ncol(x)), matrix(x[1:(nrow(x)-k),], ncol=ncol(x)))
}
else {
x <- rbind(matrix(x[(-k+1):(nrow(x)),], ncol=ncol(x)),matrix(rep(NA, -k*ncol(x)),ncol=ncol(x)))
}
if(i) x[1:length(x)] else x
}

Using data.table:
> x <- sample(c(1:9), 10, replace = T)
> y <- data.table::shift(x)
> ds <- cbind(x, y)
> ds
x y
[1,] 5 NA
[2,] 4 5
[3,] 3 4
[4,] 3 3
[5,] 4 3
[6,] 8 4
[7,] 1 8
[8,] 7 1
[9,] 9 7
[10,] 7 9

a simple way to do the same may be copying the data to a new data
frame and changing the index number. Make sure the original table is indexed sequentially with no gaps
e.g.
tempData <- originalData
rownames(tempData) <- 2:(nrow(tempData)+1)
if you want it in the same data frame as the original use a cbind function

Two options, in base R and with data.table:
baseShiftBy1 <- function(x) c(NA, x[-length(x)])
baseShiftBy1(x)
[1] NA 3 8 4 8 9 1 5 9 5
data.table::shift(x)
[1] NA 3 8 4 8 9 1 5 9 5
Data:
set.seed(123)
(x <- sample(c(1:9), 10, replace = T))
[1] 3 8 4 8 9 1 5 9 5 5

I went with a similar solution to Andrew's (dedicated function instead of xts or zoo), but with a terser formulation that I find easier to reason about:
lagpad <- function(x, k) {
if (k == 0) { return(x) }
k.pos <- max(0, k)
k.neg <- max(0, -k)
c(rep(NA, k.pos), head(x, -k.pos), # empty if k<0, else lagging x
tail(x, -k.neg), rep(NA, k.neg)) # empty if k>0, else leading x
}

Just get rid of lag. Change your line for y to:
y <- c(NA, x[-1])

Filtering data in R with same ID and determining rows which are in both data frames and which rows are not in both data frames

Does anyone know another method for filtering data when there is twice the same ID (Column X) in a data frame but with a different associate value (columns Y)?
Basically I wan to know which rows are in both data frame and after I want to know which row is not in both data frame (Actually I want the value of X and Y of this particular row)
Thank you in advance for your help!
> x <- seq(1:10)
> x[5] <- 4
> y <- (seq.int(1,19,2))
>
> x<- cbind(x,y)
> x
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 4 9
[6,] 6 11
[7,] 7 13
[8,] 8 15
[9,] 9 17
[10,] 10 19
>
> z <- x[1:4,]
> y <- x[6:10,]
>
> z <- rbind(z,y)
> z
x y
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> df1 <- z[z[,1] %in% x[,1]]
>
> matrix(df1,9,2) # As expected I'm getting 9 rows
[,1] [,2]
[1,] 1 1
[2,] 2 3
[3,] 3 5
[4,] 4 7
[5,] 6 11
[6,] 7 13
[7,] 8 15
[8,] 9 17
[9,] 10 19
>
> # Now I want to know what is the value inside the missing row
> df2 <- z[!z[,1] %in% x[,1]]
>
> matrix(df2,1,2) # I'm getting NA and NA, bu I was expecting the values 4 and 9
[,1] [,2]
[1,] NA NA

To use #hansjaneinvielleicht method:
xlist <- paste(x[,1], x[,2])
zlist <- paste(z[,1], z[,2])
setdiff(xlist, zlist)
# [1] "4 9"

What you're doing here is to filter for values that are not present in x[,1]. However, since 4 is in there, it's also filtered out.
Instead, I assume you'd probably want to work with setdiff method from dplyr (see the doc here)
Then use df2 <- setdiff(x, z)

I am using the cumcount here to adding another key for distinguish the duplicate value in x[,1]
v=ave(x[,1]==x[,1], x[,1], FUN=cumsum)
t=ave(z[,1]==z[,1], z[,1], FUN=cumsum)
df2 <- x[!paste(x[,1],v) %in% paste(z[,1],t)]
matrix(df2,1,2)
[,1] [,2]
[1,] 4 9

x <- data.frame(x)
z <- data.frame(z)
x$from <- "x"
z$from <- "z"
df2 <- merge(x, z, by = c("x", "y"), all.x = T)
df2
# x y from.x from.y
# 1 1 1 x z
# 2 2 3 x z
# 3 3 5 x z
# 4 4 7 x z
# 5 4 9 x <NA>
# 6 6 11 x z
# 7 7 13 x z
# 8 8 15 x z
# 9 9 17 x z
# 10 10 19 x z
df2 <- df2[is.na(df2$from.y),]
df2
# x y from.x from.y
# 5 4 9 x <NA>

Since my real problem was not the one posted since it was too complicated.
Basically, I was not able to apply any solution to my real problem since my real data frames were containing all data types and had a lot of columns.
But I was able to found a solution than work for my real problem but also for the problem posted in the question, so I post the answer than solved my real problem in case it can be useful for someone!
> dup <- which(duplicated(x[,1]) == TRUE)
> ans <- matrix(x[dup,],1,2)
> ans
[,1] [,2]
[1,] 4 9
> # I'm doing this in case the answer was not NA in df2 at the previous step, without
# providing the row "missing"
> df2 <- rbind(df2, ans)
> df2
[,1] [,2]
[1,] 4 9

Need to vectorize function that using loop (replace NA rows with values from vector)

How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9

Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.

Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.

The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9

Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).

Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}

Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Basic lag in R vector/dataframe - r

Using just standard R functions this can be achieved in a much simpler way: x <- sample(c(1:9), 10, replace = T) y <- c(NA, head(x, -1)) ds <- cbind(x, y) ds

The easiest way to me now appears to be the following: require(dplyr) df <- data.frame(x = sample(c(1:9), 10, replace = T)) df <- df %>% mutate(y = lag(x))

tmp<-rnorm(10) tmp2<-c(NA,tmp[1:length(tmp)-1]) tmp tmp2

Using data.table: > x <- sample(c(1:9), 10, replace = T) > y <- data.table::shift(x) > ds <- cbind(x, y) > ds x y [1,] 5 NA [2,] 4 5 [3,] 3 4 [4,] 3 3 [5,] 4 3 [6,] 8 4 [7,] 1 8 [8,] 7 1 [9,] 9 7 [10,] 7 9

Two options, in base R and with data.table: baseShiftBy1 <- function(x) c(NA, x[-length(x)]) baseShiftBy1(x) [1] NA 3 8 4 8 9 1 5 9 5 data.table::shift(x) [1] NA 3 8 4 8 9 1 5 9 5 Data: set.seed(123) (x <- sample(c(1:9), 10, replace = T)) [1] 3 8 4 8 9 1 5 9 5 5

Just get rid of lag. Change your line for y to: y <- c(NA, x[-1])

Related

How to use the base R lag-function? [duplicate]

Sorting year and month data in a customized manner in R [duplicate]

Copy column but drop observations one row in R [duplicate]

Filtering data in R with same ID and determining rows which are in both data frames and which rows are not in both data frames

Need to vectorize function that using loop (replace NA rows with values from vector)

Categories

Resources