propagating data within a vector - r

I'm learning R and I'm curious... I need a function that does this:
> fillInTheBlanks(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
> fillInTheBlanks(c(1, 2, 3, 4))
[1] 1 2 3 4
and I produced this one... but I suspect there's a more R way to do this.
fillInTheBlanks <- function(v) {
## replace each NA with the latest preceding available value
orig <- v
result <- v
for(i in 1:length(v)) {
value <- v[i]
if (!is.na(value))
result[i:length(v)] <- value
}
return(result)
}

Package zoo has a function na.locf():
R> library("zoo")
R> na.locf(c(1, 2, 3, 4))
[1] 1 2 3 4
R> na.locf(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
na.locf: Last Observation Carried Forward;
Generic function for replacing each ‘NA’ with the most recent non-‘NA’ prior to it.
See the source code of the function na.locf.default, it doesn't need a for-loop.

I'm doing some minimal copy&paste from the zoo library (thanks again rcs for pointing me at it) and this is what I really needed:
fillInTheBlanks <- function(S) {
## NA in S are replaced with observed values
## accepts a vector possibly holding NA values and returns a vector
## where all observed values are carried forward and the first is
## also carried backward. cfr na.locf from zoo library.
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}

Just for fun (since it's slower than fillInTheBlanks), here's a version of na.locf relying on rle function:
my.na.locf <- function(v,fromLast=F){
if(fromLast){
return(rev(my.na.locf(rev(v))))
}
nas <- is.na(v)
e <- rle(nas)
v[nas] <- rep.int(c(NA,v[head(cumsum(e$lengths),-1)]),e$lengths)[nas]
return(v)
}
e.g.
v1 <- c(3,NA,NA,NA,1,2,NA,NA,5)
v2 <- c(NA,NA,NA,1,7,NA,NA,5,NA)
my.na.locf(v1)
#[1] 3 3 3 3 1 2 2 2 5
my.na.locf(v2)
#[1] NA NA NA 1 7 7 7 5 5
my.na.locf(v1,fromLast=T)
#[1] 3 1 1 1 1 2 5 5 5
my.na.locf(v2,fromLast=T)
#[1] 1 1 1 1 7 5 5 5 NA

another simple answer. This one takes care of 1st value being NA. Thats a dead end so my loop stats from index 2.
my_vec <- c(1, NA, NA, 2, 3, NA, 4)
fill.it <- function(vector){
new_vec <- vector
for (i in 2:length(new_vec)){
if(is.na(new_vec[i])) {
new_vec[i] <- new_vec[i-1]
} else {
next
}
}
return(new_vec)
}

Multiple R packages have a na.locf function included, which exactly does that. (imputeTS, zoo, spacetime,...)
Here is a example with imputeTS:
library("imputeTS")
x <- c(1, NA, NA, 2, 3, NA, 4)
na.locf(x)
There are also more advanced methods for replacing missing values provided by the imputeTS package. (and by zoo also)

Related

How to apply coalesce to a matrix

I know I can coalesce three vectors in this way:
y <- c(1, NA, NA, NA, 5)
z <- c(NA, NA, 3, 4, 6)
k <- c(NA, 1, NA, NA, 8)
coalesce(y,z,k)
However, in real data, I only get the whole dataset like:
d <- rbind(y,z,k)
And in real life every time d has different row length (sometimes 3 rows, sometimes 4 rows) which means we cannot just write:
coalesce(d[1,],d[2,],d[3,])
I have thought about split function:
split(t(d),rep(1:nrow(d),each = ncol(d)))
And got:
$`1`
[1] 1 NA NA NA 5
$`2`
[1] NA NA 3 4 6
$`3`
[1] NA 1 NA NA 8
But the results by split function cannot be applied to coalesce function.
Is there a way to realize my thoughts?
You can use it within do.call(), i.e.
do.call(coalesce, as.data.frame(t(d)))
#[1] 1 1 3 4 5

R fuction composition for the substitution of values in dataframe

given the following reproducible example
my objective is to row-wise substitute the original values with NA in adjacent columns of a data frame; I know it's a problem (with so many variants) already posted but I've not yet found the solution with the approach I'm trying to accomplish: i.e. by applying a function composition
in the reproducible example the column driving the substitution with NA of the original values is column a
this is what I've done so far
the very last code snippet is a failing attempt of what I'm actually searching for...
#-----------------------------------------------------------
# ifelse approach, it works but...
# it's error prone: i.e. copy and paste for all columns can introduce a lot of troubles
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b<-ifelse(is.na(df$a), NA, df$b)
df$c<-ifelse(is.na(df$a), NA, df$c)
df
#--------------------------------------------------------
# extraction and subsitution approach
# same as above
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b[is.na(df$a)]<-NA
df$c[is.na(df$a)]<-NA
df
#----------------------------------------------------------
# definition of a function
# it's a bit better, but still error prone because of the copy and paste
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix<-function(x,y){
ifelse(is.na(x), NA, y)
}
df$b<-fix(df$a, df$b)
df$c<-fix(df$a, df$c)
df
#------------------------------------------------------------
# this approach is not working as expected!
# the idea behind is of function composition;
# lapply does the fix to some columns of data frame
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix2<-function(x){
x[is.na(x[1])]<-NA
x
}
df[]<-lapply(df, fix2)
df
any help for this particular approach?
I'm stuck on how to properly conceive the substitute function passed to lapply
thanx
Using lexical closure
If you use lexical closureing - you define a function which generates first the function you need.
And then you can use this function as you wish.
# given a column all other columns' values at that row should become NA
# if the driver column's value at that row is NA
# using lexical scoping of R function definitions, one can reach that.
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# whatever vector given, this vector's value should be changed
# according to first column's value
na_accustomizer <- function(df, driver_col) {
## Returns a function which will accustomize any vector/column
## to driver column's NAs
function(vec) {
vec[is.na(df[, driver_col])] <- NA
vec
}
}
df[] <- lapply(df, na_accustomizer(df, "a"))
df
## a b c
## 1 1 3 NA
## 2 2 NA 5
## 3 NA NA NA
#
# na_accustomizer(df, "a") returns
#
# function(vec) {
# vec[is.na(df[, "a"])] <- NA
# vec
# }
#
# which then can be used like you want:
# df[] <- lapply(df, na_accustomize(df, "a"))
Using normal functions
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# define it for one column
overtake_NA <- function(df, driver_col, target_col) {
df[, target_col] <- ifelse(is.na(df[, driver_col]), NA, df[, target_col])
df
}
# define it for all columns of df
overtake_driver_col_NAs <- function(df, driver_col) {
for (i in 1:ncol(df)) {
df <- overtake_NA(df, driver_col, i)
}
df
}
overtake_driver_col_NAs(df, "a")
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Generalize for any predicate function
driver_col_to_other_cols <- function(df, driver_col, pred) {
## overtake any value of the driver column to the other columns of df,
## whenever predicate function (pred) is fulfilled.
# define it for one column
overtake_ <- function(df, driver_col, target_col, pred) {
selectors <- do.call(pred, list(df[, driver_col]))
if (deparse(substitute(pred)) != "is.na") {
# this is to 'recorrect' NA's which intrude into the selector vector
# then driver_col has NAs. For sure "is.na" is not the only possible
# way to check for NA - so this edge case is not covered fully
selectors[is.na(selectors)] <- FALSE
}
df[, target_col] <- ifelse(selectors, df[, driver_col], df[, target_col])
df
}
for (i in 1:ncol(df)) {
df <- overtake_(df, driver_col, i, pred)
}
df
}
driver_col_to_other_cols(df, "a", function(x) x == 1)
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA 4 6
## if the "is.na" check is not done, then this would give
## (because of NA in selectorvector):
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA NA NA
## hence in the case that pred doesn't check for NA in 'a',
## these NA vlaues have to be reverted to the original columns' value.
driver_col_to_other_cols(df, "a", is.na)
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Try this function, in input you have your original dataset and in output the cleaned one:
Input
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
> df
a b c
1 1 3 NA
2 2 NA 5
3 NA 4 6
Function
fix<-function(df,var_x,list_y)
{
df[is.na(df[,var_x]),list_y]<-NA
return(df)
}
Output
fix(df,"a",c("b","c"))
a b c
1 1 3 NA
2 2 NA 5
3 NA NA NA

Aggregate NAs in R

I'm having trouble handling NAs while calculating aggregated means. Please see the following code:
tab=data.frame(a=c(1:3,1:3), b=c(1,2,NA,3,NA,NA))
tab
a b
1 1 1
2 2 2
3 3 NA
4 1 3
5 2 NA
6 3 NA
attach(tab)
aggregate(b, by=list(a), data=tab, FUN=mean, na.rm=TRUE)
Group.1 x
1 1 2
2 2 2
3 3 NaN
I want NA instead of NaN if the vector has all NAs i.e. I want the output to be
Group.1 x
1 1 2
2 2 2
3 3 NA
I tried using a custom function:
adjmean=function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
However, I get the following error:
aggregate(b, by=list(a), data=tab, FUN=adjmean)
Error in FUN(X[[1L]], ...) :
unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
In short, if the column has all NAs I want NA as an output instead of NaN. If it has few NAs, then it should compute the mean ignoring the NAs.
Any help would be appreciated.
Thanks
This is very close to what you had, but replaces mean(x, na.rm=TRUE) with a custom function which either computes the mean of the non-NA values, or supplies NA itself:
R> with(tab,
aggregate(b, by=list(a), FUN=function(x)
if (any(is.finite(z<-na.omit(x)))) mean(z) else NA))
Group.1 x
1 1 2
2 2 2
3 3 NA
R>
That is really one line, but I broke it up to make it fit into the SO display.
And you already had a similar idea, but I altered the function a bit more to return suitable values in all cases.
There is nothing wrong with your function. What is wrong is that you are using an argument in the default method for aggregate that doesn't exist:
adjmean = function(x) {if(all(is.na(x))) NA else mean(x,na.rm=TRUE)}
attach(tab) ## Just because you did it. I don't recommend this.
## Your error
aggregate(b, by=list(a), data=tab, FUN=adjmean)
# Error in FUN(X[[i]], ...) :
# unused argument (data = list(a = c(1, 2, 3, 1, 2, 3), b = c(1, 2, NA, 3, NA, NA)))
## Dropping the "data" argument
aggregate(b, list(a), FUN = adjmean)
# Group.1 x
# 1 1 2
# 2 2 2
# 3 3 NA
If you wanted to use the data argument, you should use the formula method for aggregate. However, this method treats NA differently, so you need an additional argument, na.action.
Example:
detach(tab) ## I don't like having things attached
aggregate(b ~ a, data = tab, adjmean)
# a b
# 1 1 2
# 2 2 2
aggregate(b ~ a, data = tab, adjmean, na.action = na.pass)
# a b
# 1 1 2
# 2 2 2
# 3 3 NA

Modifying dplyr::lag function

I am trying to use the lag function from the dplyr package. However when I give a lag > 0 I want the missing values to be replaced by the first value in x. How can we achieve this
library(dplyr)
x<-c(1,2,3,4)
z<-lag(x,2)
z
## [1] NA NA 1 2
Since you are using the lag function dplyr, there is an argument default. So you can specify that you want x[1] to be the default.
lag(x, 2, default=x[1])
Here's a modified function mylag:
mylag <- function(x, k = 1, ...)
replace(lag(x, k, ...), seq(k), x[1])
x <- 1:4
mylag(x, k = 2)
# [1] 1 1 1 2
May I suggest adapting the function so that it works both ways: for lag and lead (positive AND negative lags).
shift = function(x, lag, fill=FALSE) {
require(dplyr)
switch(sign(lag)/2+1.5,
lead( x, n=abs(lag), default=switch(fill+1, NA, tail(x, 1)) ),
lag( x, n=abs(lag), default=switch(fill+1, NA, head(x, 1)) )
)
}
It has a "fill" argument that automatically fills with first of last value depending on the sign of the lag.
> shift(1:10, -1)
#### [1] 2 3 4 5 6 7 8 9 10 NA
> shift(1:10, +1, fill=TRUE)
#### [1] 1 1 2 3 4 5 6 7 8 9

Subset of ESet /dividing ESet

Is it possible to subset a ExpressionSet like this:
SUB=ESet[,ESet#phenoData#data$x==c(0,1)]
in X are values from 0-9, and I just want the entries when x=0 or x=1.
Try the following:
SUB=ESet[, ESet$x %in% c(0,1)]
At first glance, the difference between == and %in% seems only subtle.
x <- 0:9
x[x==c(0, 1)]
[1] 0 1
> x[x %in% c(0, 1)]
[1] 0 1
But %in% will never return NA, and this could be useful, or even essential, depending on what you want to do. In the following constructed example, == returns NA, whilst %in% returns the expected result:
x <- c(NA, 0:9)
x[x==c(0, 1)]
[1] NA
x[x %in% c(0, 1)]
[1] 0 1
But the difference is much deeper than this. From the help files for ?== it is apparent that when making binary comparisons between vectors of unequal length, the elements of shorter vectors are recycled as necessary.
Try for example the following:
x <- 0:9
x[x==c(1, 2)]
integer(0)
This results in an empty vector. If you recycle the vector c(1, 2), it quickly becomes apparent why:
x: 0 1 2 3 4 5 6 7 8 9
c(1, 2): 1 2 1 2 1 2 1 2 1 2
'==': F F F F F F F F F F

Resources