I am trying to use the lag function from the dplyr package. However when I give a lag > 0 I want the missing values to be replaced by the first value in x. How can we achieve this
library(dplyr)
x<-c(1,2,3,4)
z<-lag(x,2)
z
## [1] NA NA 1 2
Since you are using the lag function dplyr, there is an argument default. So you can specify that you want x[1] to be the default.
lag(x, 2, default=x[1])
Here's a modified function mylag:
mylag <- function(x, k = 1, ...)
replace(lag(x, k, ...), seq(k), x[1])
x <- 1:4
mylag(x, k = 2)
# [1] 1 1 1 2
May I suggest adapting the function so that it works both ways: for lag and lead (positive AND negative lags).
shift = function(x, lag, fill=FALSE) {
require(dplyr)
switch(sign(lag)/2+1.5,
lead( x, n=abs(lag), default=switch(fill+1, NA, tail(x, 1)) ),
lag( x, n=abs(lag), default=switch(fill+1, NA, head(x, 1)) )
)
}
It has a "fill" argument that automatically fills with first of last value depending on the sign of the lag.
> shift(1:10, -1)
#### [1] 2 3 4 5 6 7 8 9 10 NA
> shift(1:10, +1, fill=TRUE)
#### [1] 1 1 2 3 4 5 6 7 8 9
Related
I have a vector that I want to modify so that it contains only elements that are equal too or larger than the previous element. The vector represents a phenomena that should only increase or stay the same (i.e. cumulative deaths by day), but reporting errors result in elements that are less than the previous element. I want to correct this by replacing elements with previous ones until the vector meets the aforementioned criteria.
raw data : 1 3 3 6 8 10 7 9 15 12
desired modified data: 1 3 3 6 6 6 7 9 9 12
library(zoo)
raw <- c(1, 3, 3, 6, 8, 10, 7, 9, 15, 12)
replace.errors <- function(x){
x %>%
replace(diff(x) < 0, NA) %>%
na.locf(na.rm=FALSE)
}
replace.errors(raw)
# [1] 1 3 3 6 8 8 7 9 9 12
My function does not work if multiple sequential elements in a row need to be replaced (8 and 10), as it just pulls forward an element that is still greater than the next one.
A data.table option using nafill along with cummin
nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
gives
> nafill(replace(raw, rev(cummin(rev(raw))) != raw, NA), type = "locf")
[1] 1 3 3 6 6 6 7 9 9 12
Following the similar idea from above approach, your function replace.errors can be defined as
replace.errors <- function(x){
x %>%
replace(rev(cummin(rev(.))) != (.), NA) %>%
na.locf()
}
such that
> replace.errors(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Another option is to define a user function like below
f <- function(v) {
for (k in which(c(FALSE, diff(v) < 0))) {
p <- max(v[v < v[k]])
v <- replace(v, tail(which(v == p), 1):(k - 1), p)
}
v
}
which gives
> f(raw)
[1] 1 3 3 6 6 6 7 9 9 12
Base R using #ThomasIsCoding brilliant replace logic:
# Replace values breaching condition with NA: scrubbed => integer vector
scrubbed <- replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_)
# i) Interpolate constants:
res <- na.omit(scrubbed)[cumsum(!is.na(scrubbed))]
# OR
# ii) Interpolate constants using approx()
res <- approx(scrubbed, method = "constant", n = length(scrubbed))$y
Or in one expression:
approx(
replace(raw, rev(cummin(rev(raw))) != raw, NA_integer_),
method = "constant",
n = length(raw)
)$y
This smells a bit inefficient, but it may still be the best option:
replace_errors <- function(raw) {
while (is.unsorted(raw)) {
raw <- raw[c(TRUE, diff(raw) >= 0)]
}
raw
}
I need to take an existing vector and create a new vector that contains the values;
(x1+2x2−x3, x2+2x3−x4, . . . , xn−2+2xn−1 − xn)
I've tried using xVec[n-2] + 2* xVec[n-1] - xVec[n] but this doesn't work!
Without zoo:
n <- 10
xVec <- seq(n)
idx <- seq(1, n-2)
xVec[idx] + 2* xVec[idx+1] - xVec[idx+2]
[1] 2 4 6 8 10 12 14 16
You need a rolling calculation, something that the zoo package provides:
vec <- 1:10
zoo::rollapply(vec, width = 3, FUN = function(z) z[1]+2*z[2]-z[3])
# [1] 2 4 6 8 10 12 14 16
Validation, using first three and last three:
1 + 2*2 - 3
# [1] 2
8 + 2*9 - 10
# [1] 16
Explanation: each time the function (passed to FUN=) is called, it is given a vector with width= elements in it. The first call is effectively z=1:3, the second call z=2:4, third z=3:5, etc.
You should know that by default it will return length(vec) - width + 1 elements in its return value. You can control this with fill= and align= arguments:
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA)
# [1] NA 2 4 6 8 10 12 14 16 NA
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA, align = "right")
# [1] NA NA 2 4 6 8 10 12 14 16
In a comment, B. Go has suggested to "reshape" the vector and wonders if this can be done in R as well.
In R, two packages provide functions to shift the elements of a vector: data.table and dplyr. (The lag() function from base R deals with times series objects.)
data.table
x <- 1:10
library(data.table)
shift(x, 2L) + 2 * shift(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
dplyr
x <- 1:10
library(dplyr)
lag(x, 2L) + 2 * lag(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
By default, both functions do fill up missing values after shifting with NA. This explains why the first two elements of the result vector are NA.
To get rid of the leading NAs, the tail() function can be used, e.g.,
tail(shift(x, 2L) + 2 * shift(x) - x, -2L)
[1] 2 4 6 8 10 12 14 16
If you are up for a bit of matrix math:
xVec <- 1:10
linear_combo <- c(1, 2, -1)
m <- matrix(0, length(xVec), length(xVec))
for (index in seq_along(linear_combo)) {
m[row(m) == col(m) - index + 1] <- linear_combo[index]
}
m %*% xVec
Note in this case the last two elements are incomplete and should probably be dropped or replaced by NA.
head(m %*% xVec, -(length(linear_combo) - 1))
I would like to do a last observation carried forward for a variable, but only up to 2 observations. That is, for gaps of data of 3 or more NA, I would only carry the last observation forward for the next 2 observations and leave the rest as NA.
If I do this with the zoo::na.locf, the maxgap parameter implies that if the gap is larger than 2, no NA is replaced. Not even the last 2. Is there any alternative?
x <- c(NA,3,4,5,6,NA,NA,NA,7,8)
zoo::na.locf(x, maxgap = 2) # Doesn't replace the first 2 NAs of after the 6 as the gap of NA is 3.
Desired_output <- c(NA,3,4,5,6,6,6,NA,7,8)
First apply na.locf0 with maxgap = 2 giving x0 and define a grouping variable g using rleid from the data.table package. For each such group use ave to apply keeper which if the group is all NA replaces it with c(1, 1, NA, ..., NA) and otherwise outputs all 1s. Multiply na.locf0(x) by that.
library(data.table)
library(zoo)
mg <- 2
x0 <- na.locf0(x, maxgap = mg)
g <- rleid(is.na(x0))
keeper <- function(x) if (all(is.na(x))) ifelse(seq_along(x) <= mg, 1, NA) else 1
na.locf0(x) * ave(x0, g, FUN = keeper)
## [1] NA 3 4 5 6 6 6 NA 7 8
A solution using base R:
ave(x, cumsum(!is.na(x)), FUN = function(i){ i[1:pmin(length(i), 3)] <- i[1]; i })
# [1] NA 3 4 5 6 6 6 NA 7 8
cumsum(!is.na(x)) groups each run of NAs with most recent non-NA value.
function(i){ i[1:pmin(length(i), 3)] <- i[1]; i } transforms the first two NAs of each group into the leading non-NA value of this group.
I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.
I'm learning R and I'm curious... I need a function that does this:
> fillInTheBlanks(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
> fillInTheBlanks(c(1, 2, 3, 4))
[1] 1 2 3 4
and I produced this one... but I suspect there's a more R way to do this.
fillInTheBlanks <- function(v) {
## replace each NA with the latest preceding available value
orig <- v
result <- v
for(i in 1:length(v)) {
value <- v[i]
if (!is.na(value))
result[i:length(v)] <- value
}
return(result)
}
Package zoo has a function na.locf():
R> library("zoo")
R> na.locf(c(1, 2, 3, 4))
[1] 1 2 3 4
R> na.locf(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
na.locf: Last Observation Carried Forward;
Generic function for replacing each ‘NA’ with the most recent non-‘NA’ prior to it.
See the source code of the function na.locf.default, it doesn't need a for-loop.
I'm doing some minimal copy&paste from the zoo library (thanks again rcs for pointing me at it) and this is what I really needed:
fillInTheBlanks <- function(S) {
## NA in S are replaced with observed values
## accepts a vector possibly holding NA values and returns a vector
## where all observed values are carried forward and the first is
## also carried backward. cfr na.locf from zoo library.
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}
Just for fun (since it's slower than fillInTheBlanks), here's a version of na.locf relying on rle function:
my.na.locf <- function(v,fromLast=F){
if(fromLast){
return(rev(my.na.locf(rev(v))))
}
nas <- is.na(v)
e <- rle(nas)
v[nas] <- rep.int(c(NA,v[head(cumsum(e$lengths),-1)]),e$lengths)[nas]
return(v)
}
e.g.
v1 <- c(3,NA,NA,NA,1,2,NA,NA,5)
v2 <- c(NA,NA,NA,1,7,NA,NA,5,NA)
my.na.locf(v1)
#[1] 3 3 3 3 1 2 2 2 5
my.na.locf(v2)
#[1] NA NA NA 1 7 7 7 5 5
my.na.locf(v1,fromLast=T)
#[1] 3 1 1 1 1 2 5 5 5
my.na.locf(v2,fromLast=T)
#[1] 1 1 1 1 7 5 5 5 NA
another simple answer. This one takes care of 1st value being NA. Thats a dead end so my loop stats from index 2.
my_vec <- c(1, NA, NA, 2, 3, NA, 4)
fill.it <- function(vector){
new_vec <- vector
for (i in 2:length(new_vec)){
if(is.na(new_vec[i])) {
new_vec[i] <- new_vec[i-1]
} else {
next
}
}
return(new_vec)
}
Multiple R packages have a na.locf function included, which exactly does that. (imputeTS, zoo, spacetime,...)
Here is a example with imputeTS:
library("imputeTS")
x <- c(1, NA, NA, 2, 3, NA, 4)
na.locf(x)
There are also more advanced methods for replacing missing values provided by the imputeTS package. (and by zoo also)