R: How to manage heading and trailing NAs when using rollmean() - r

I have this vector
vec <- c(NA, 1, 2, 3, 4, NA)
for which I wish to calculate a rollmean of a window of size 3 aligned to the right (that is, if I understand correctly looking backwards)
The expected rolling mean of my vector would be
# [1] NA NA NA 2 3 NA #
and yet if I do
rollmean(vec, 3, align='right', fill=NA)
I obtain
# [1] NA NA NA NA NA NA

You can use apply function instead.
rollapply(vec,3,mean,fill=NA,align="right")
[1] NA NA NA 2 3 NA

Related

Index TRUE occurrences preserving NA in a new vector

I have what some of you might categorise as a dumb question, but I cannot solve it. I have this vector:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
And I want to get this in a new vector:
b <- c(NA,NA,1,NA,2,NA,3)
That simple. All the ways I am trying do not preserve the NA and I need them untouched. I would prefer if there would be a way in base R.
In base R, use cumsum() while excluding the NA values:
a <- c(NA,NA,TRUE,NA,TRUE,NA,TRUE)
a[!is.na(a)] <- cumsum(a[!is.na(a)])
Output:
[1] NA NA 1 NA 2 NA 3
Using replace from base R
b <- replace(a, !is.na(a), seq_len(sum(a, na.rm = TRUE)))
b
[1] NA NA 1 NA 2 NA 3
Or slightly more compact option (if the values are logical/numeric)
cumsum(!is.na(a)) * a
[1] NA NA 1 NA 2 NA 3
Update
If the OP's vector is
a <- c(NA,NA,TRUE,NA,FALSE,NA,TRUE)
(a|!a) * cumsum(replace(a, is.na(a), 0))
[1] NA NA 1 NA 1 NA 2
replaceing the non-NAs with the cumsum.
replace(a, !is.na(a), cumsum(na.omit(a)))
# [1] NA NA 1 NA 2 NA 3

Why does R 'sample' some columns more than others?

I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.

Specify class of NA in R (for if_else, dplyr)

In the if_else() function in dplyr, it requires that both the if:TRUE and if:FALSE elements are of the same class.
I wish to return NA from my if_else() statement.
But e.g.
if_else(mtcars$cyl > 5, NA, 1)
returns
Error: false has type 'double' not 'logical'
Because simply reading in NA is logical, and 1 is numeric (double).
Wrapping as.numeric() around the NA works fine: e.g.
if_else(mtcars$cyl > 5, as.numeric(NA), 1)
returns
1 NA NA 1 NA NA NA NA 1 1 NA NA NA NA NA NA NA NA 1 1 1 1 NA NA NA NA 1 1 1 NA NA NA 1
As what I am hoping for.
But this feels kinda silly/unnecessary. Is there a better way of inputting NA as a "numeric NA" than wrapping it like this?
NB this only applies to the stricter dplyr::if_else not base::ifelse.
you can use NA_real_
if_else(mtcars$cyl > 5, NA_real_, 1)
try the base function
ifelse(mtcars$cyl > 5, NA, 1)
Or you can use if_else_ from the package hablar. It is as rigid as if_else from dplyr about types, but allows for generic NA. See,
library(hablar)
if_else_(mtcars$cyl > 5, NA, 1)

Methods to correct for lookahead bias

This is not a regex problem.
I am trying to correct for lookahead bias in data, basically to move the values up by 1. This is what I came up with. Does anyone have a better/faster/ built-in method to do this?
d<-c(1,2,3,4)
#correct for lookahead bias, move values up by 1
e<-d[-c(1)]
length(e)<-length(d)
cbind(d,e)
> cbind(d,e)
d e
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 NA
There are a few ways I could think of to do this. Both are fairly concise one liners:
base
cbind(d, c(d[-1], NA))
data.table
rev(data.table::shift(rev(d), 1))
If we want to write it as a function, we can do that too. Note that this function does not attempt to error handle anything.
shift_up <- function(x, n = 1) c(x[-(1:n)], rep(NA, n))
Which is very useful for fans of the comic series Batman:
d <- 1:16
shift_up(d, 16)
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #BATMAN!

Replacing each element of any object

Is there any clever way to replace each part of any object with some values (for example NA's).
Let's take those objects
obj1 <- t.test(1:10)
obj2 <- matrix(1:9, 3)
obj3 <- 1:10
obj4 <- list(a = 1:10, b = letters[1:5], c = as.factor(1:10))
the expected output would be similar to
for (i in 1:length(obj1)) obj1[[i]] <- rep(NA, length(obj1[[i]]))
obj2 <- matrix(rep(NA, 9), 3)
obj3 <- rep(NA, 10)
obj4 <- list(a = rep(NA, 10), b = rep(NA, 5), c = rep(NA, 10))
So no matter if an object is a list, matrix, data.frame, vector etc. each part of the object is to be replaced with NA.
Is there any clever way to do so that does not need multiple loops, checking for object type every time and lots of exceptions (if (is.list(part)) ... etc.)?
You can take advantage of the fact that using an empty extraction index during assignment (i.e., x[] <- NA) replaces all elements with the right-hand side value. In your case, you could do something like this using rapply to attack all elements of all objects:
> rapply(mget(ls()), function(x) x[] <- rep(NA, length(x)), how = "replace")
$obj1
$obj1$statistic
[1] NA
$obj1$parameter
[1] NA
$obj1$p.value
[1] NA
$obj1$conf.int
[1] NA NA
$obj1$estimate
[1] NA
$obj1$null.value
[1] NA
$obj1$alternative
[1] NA
$obj1$method
[1] NA
$obj1$data.name
[1] NA
$obj2
[1] NA NA NA NA NA NA NA NA NA
$obj3
[1] NA NA NA NA NA NA NA NA NA NA
$obj4
$obj4$a
[1] NA NA NA NA NA NA NA NA NA NA
$obj4$b
[1] NA NA NA NA NA
$obj4$c
[1] NA NA NA NA NA NA NA NA NA NA
That's a very simple solution, though. You could probably complicate the function being passed to rapply so that it used S3 method dispatch to identify what class of object it was seeing and possibly return a different data structure (e.g., data.frame or matrix) accordingly, rather than just a vector of NAs.

Resources