R fuction composition for the substitution of values in dataframe

R fuction composition for the substitution of values in dataframe - r

given the following reproducible example
my objective is to row-wise substitute the original values with NA in adjacent columns of a data frame; I know it's a problem (with so many variants) already posted but I've not yet found the solution with the approach I'm trying to accomplish: i.e. by applying a function composition
in the reproducible example the column driving the substitution with NA of the original values is column a
this is what I've done so far
the very last code snippet is a failing attempt of what I'm actually searching for...
#-----------------------------------------------------------
# ifelse approach, it works but...
# it's error prone: i.e. copy and paste for all columns can introduce a lot of troubles
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b<-ifelse(is.na(df$a), NA, df$b)
df$c<-ifelse(is.na(df$a), NA, df$c)
df
#--------------------------------------------------------
# extraction and subsitution approach
# same as above
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
df$b[is.na(df$a)]<-NA
df$c[is.na(df$a)]<-NA
df
#----------------------------------------------------------
# definition of a function
# it's a bit better, but still error prone because of the copy and paste
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix<-function(x,y){
ifelse(is.na(x), NA, y)
}
df$b<-fix(df$a, df$b)
df$c<-fix(df$a, df$c)
df
#------------------------------------------------------------
# this approach is not working as expected!
# the idea behind is of function composition;
# lapply does the fix to some columns of data frame
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
fix2<-function(x){
x[is.na(x[1])]<-NA
x
}
df[]<-lapply(df, fix2)
df
any help for this particular approach?
I'm stuck on how to properly conceive the substitute function passed to lapply
thanx

Using lexical closure
If you use lexical closureing - you define a function which generates first the function you need.
And then you can use this function as you wish.
# given a column all other columns' values at that row should become NA
# if the driver column's value at that row is NA
# using lexical scoping of R function definitions, one can reach that.
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# whatever vector given, this vector's value should be changed
# according to first column's value
na_accustomizer <- function(df, driver_col) {
## Returns a function which will accustomize any vector/column
## to driver column's NAs
function(vec) {
vec[is.na(df[, driver_col])] <- NA
vec
}
}
df[] <- lapply(df, na_accustomizer(df, "a"))
df
## a b c
## 1 1 3 NA
## 2 2 NA 5
## 3 NA NA NA
#
# na_accustomizer(df, "a") returns
#
# function(vec) {
# vec[is.na(df[, "a"])] <- NA
# vec
# }
#
# which then can be used like you want:
# df[] <- lapply(df, na_accustomize(df, "a"))
Using normal functions
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
df
# define it for one column
overtake_NA <- function(df, driver_col, target_col) {
df[, target_col] <- ifelse(is.na(df[, driver_col]), NA, df[, target_col])
df
}
# define it for all columns of df
overtake_driver_col_NAs <- function(df, driver_col) {
for (i in 1:ncol(df)) {
df <- overtake_NA(df, driver_col, i)
}
df
}
overtake_driver_col_NAs(df, "a")
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA
Generalize for any predicate function
driver_col_to_other_cols <- function(df, driver_col, pred) {
## overtake any value of the driver column to the other columns of df,
## whenever predicate function (pred) is fulfilled.
# define it for one column
overtake_ <- function(df, driver_col, target_col, pred) {
selectors <- do.call(pred, list(df[, driver_col]))
if (deparse(substitute(pred)) != "is.na") {
# this is to 'recorrect' NA's which intrude into the selector vector
# then driver_col has NAs. For sure "is.na" is not the only possible
# way to check for NA - so this edge case is not covered fully
selectors[is.na(selectors)] <- FALSE
}
df[, target_col] <- ifelse(selectors, df[, driver_col], df[, target_col])
df
}
for (i in 1:ncol(df)) {
df <- overtake_(df, driver_col, i, pred)
}
df
}
driver_col_to_other_cols(df, "a", function(x) x == 1)
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA 4 6
## if the "is.na" check is not done, then this would give
## (because of NA in selectorvector):
# a b c
# 1 1 1 1
# 2 2 NA 5
# 3 NA NA NA
## hence in the case that pred doesn't check for NA in 'a',
## these NA vlaues have to be reverted to the original columns' value.
driver_col_to_other_cols(df, "a", is.na)
# a b c
# 1 1 3 NA
# 2 2 NA 5
# 3 NA NA NA

Try this function, in input you have your original dataset and in output the cleaned one:
Input
df<-data.frame(a=c(1, 2, NA), b=c(3, NA, 4), c=c(NA, 5, 6))
> df
a b c
1 1 3 NA
2 2 NA 5
3 NA 4 6
Function
fix<-function(df,var_x,list_y)
{
df[is.na(df[,var_x]),list_y]<-NA
return(df)
}
Output
fix(df,"a",c("b","c"))
a b c
1 1 3 NA
2 2 NA 5
3 NA NA NA

Related

Extend runs of certain length

I have a 640 x 2500 dataframe with numeric values and several NA values. My goal is to find a minimum of 75 consecutive NA values in each row. For each such run, I want to replace the previous and following 50 cells with NA values too.
Here's a scaled down example of one row:
x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
I want to detect the run of four consecutive NA, and then replace three values before and three values after the run with NA:
c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
I have tried to first identify the consecutive NAs with rle, but running rle(is.na(df)) gives the error 'x' must be a vector of an atomic type. This occurs even when I select a single row.
Unfortunately, I do not know what the next steps to take would be in converting the previous and following 50 cells to NA.
Would highly appreciate any help on this, thanks in advance.

Because you comment that in your data "some [rows] begin and end with several NAs", hopefully this better represents the real data:
A B C D E F G H I J
1 1 2 3 NA NA 6 7 8 NA 10
2 1 NA NA NA 5 6 7 NA NA NA
3 1 2 3 4 NA NA NA 8 9 10
Let's assume that the minimum run length of NA to be expanded with NA is 2, and that two values before and two values after the run should be replaced with NA. In this example, row 2 would represent the case you mentioned in comment.
First some data wrangling. I prefer to work with a data.table in long format. With data.table we have access to the useful constants .I and .N, and can easily create run IDs with rleid.
# convert data.frame to data.table
library(data.table)
setDT(d)
# set minimum length of runs to be expanded
len = 2L
# set number of values to replace on each side of run
n = 2L
# number of columns of original data (for truncation of indices)
nc = ncol(d)
# create a row index to keep track of the original rows in the long format
d[ , ri := 1:.N]
# melt from wide to long format
d2 = melt(d, id.vars = "ri")
# order by row index
setorder(d2, ri)
Now the actual calculations on the runs and their indices:
d2[
# check if the run is an "NA run" and has sufficient length
d2[ , if(anyNA(value) & .N >= len){
# get indices before and after run, where values should be changed to NA
ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
# truncate indices to keep them within (original) rows
ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
# perform the calculation by row index and run index
# grab replacement indices
by = .(ri, rleid(is.na(value)))]$V1,
# at replacement indices, set value to NA
value := NA]
If desired, cast back to wide format
dcast(d2, ri ~ variable, value.vars = "value")
# ri A B C D E F G H I J
# 1: 1 1 NA NA NA NA NA NA 8 NA 10
# 2: 2 NA NA NA NA NA NA NA NA NA NA
# 3: 3 1 2 NA NA NA NA NA NA NA 10

Type coercion worked for me:
rle(as.logical(is.na(x[MyRow, ])))

Here is my solution for this. I wonder if there is a tidier solution than mine though.
library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
as.data.table() # dataset created
# A function to do what you need
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times
if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
Vector[seq(NAs[1]-3,
NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
}
Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3
As an aside, the speed is quite good for a hypothetical dataframe sized 640*2500 where a stretch of 75 or more NAs have to be located and the 50 values before and after must be replaced with an NA.
df <- matrix(nrow = 640,ncol = 2500)
for(i in 1:nrow(df)){
df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times
if(length(NAs >= 75)){ # Check if the condition is met
Vector[seq(NAs[1]-50,
NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
}
Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
as_tibble() %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose())
user system elapsed
0.216 0.002 0.220

recoding integers in a vector so they register as NA instead?

I hope this isn't a silly question but I am REALLY struggling to recode a variable in R so that certain values register as NA instead of the placeholder integer that got read in. respondents who did not answer the question for that column were originally coded as -88, -89 and -99 instead of NA and I only know how to remove them completely from that column.
I want to keep that row, just have those inputs registered as missing. Recode doesn't seem to work b/c NA isn't a value
Thanks!

Maybe you can try replace
v <- replace(v,v%in%c(-88,-89,-99),NA)
such that
> v
[1] 1 2 NA NA -1 NA NA
Dummy Data
v <- c(1,2,-88,-89,-1,-99,-89)

You can use the %in% operator to find all positions in a vector which match with another vector, and then set them to NA as follows:
dat = data.frame(V1 = c(10, 20, 30, -88, -89, -99))
dat$V1[dat$V1 %in% c(-88, -89, -99)] = NA
dat
V1
1 10
2 20
3 30
4 NA
5 NA
6 NA

Here's one way to do it, which will replace all values of -88, -89 and -99 in your data:
for (i in c(-88, -89, -99)){
data.df[data.df == i] <- NA
}
If you need to just replace in one column (e.g. column 'x'):
for (i in c(-88, -89, -99)){
data.df$x[data.df$x == i] <- NA
}

The correct/ most adequat answer to that question is depending on the exact specifics of your data, in case you have a numeric variable and all other values are positive, this would work.
somedata <-
tibble::tribble(
~v1, ~v2,
1, 2,
3, 4,
-88, 5,
6, -89,
-99, 1
)
library(tidyverse)
somedata %>%
mutate(v1 = ifelse(v1 < 0, NA, v1))
# A tibble: 5 x 2
v1 v2
<dbl> <dbl>
1 1 2
2 3 4
3 NA 5
4 6 -89
5 NA 1

Thanks so much again to everyone for your help!
I first converted the variable to numeric, then this seemed to work for me:
anesCSV$clinton.withNA <- replace(anesCSV$clintonthermo_numeric,anesCSV$clintonthermo_numeric%in%c(-88,-89,-99),NA)
As someone initially suggested:
v <- replace(v,v%in%c(-88,-89,-99),NA)
I did create a new variable to store the results personally!

Subtract rows with numeric values and ignore NAs

I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.

Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18

So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))

Sort vector keeping NAs position in R

Problem 1 (solved)
How can I sort vector DoB:
DoB <- c(NA, 9, NA, 2, 1, NA)
while keeping the NAs in the same position?
I would like to get:
> DoB
[1] NA 1 NA 2 9 NA
I have tried this (borrowing from this answer)
NAs_index <- which(is.na(DoB))
DoB <- sort(DoB, na.last = NA)
for(i in 0:(length(NAs_index)-1))
DoB <- append(DoB, NA, after=(NAs_index[i+1]+i))
but
> DoB
[1] 1 NA 2 9 NA NA
Answer is
DoB[!is.na(DoB)] <- sort(DoB)
Thanks to #BigDataScientist and #akrun
Now, Problem 2
Say, I have a vector id
id <- 1:6
That I would also like to sort by the same principle, so that the values of id are ordered according to order(DoB), but keeping the NAs fixed in the same position?:
> id
[1] 1 5 3 4 2 6

You could do:
DoB[!is.na(DoB)] <- sort(DoB)
Edit: Concerning the follow up question in the comments:
You can use order() for that and take care of the NAs with the na.last parameter,..
data <- data.frame(DoB = c(NA, 9, NA, 2, 1, NA), id = 1:6)
data$id[!is.na(data$DoB)] <- order(data$DoB, na.last = NA)
data$DoB[!is.na(data$DoB)] <- sort(data$DoB)

We create a logical index and then do the sort
i1 <- is.na(DoB)
DoB[!i1] <- sort(DoB[!i1])
DoB
#[1] NA 1 NA 2 9 NA

propagating data within a vector

I'm learning R and I'm curious... I need a function that does this:
> fillInTheBlanks(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
> fillInTheBlanks(c(1, 2, 3, 4))
[1] 1 2 3 4
and I produced this one... but I suspect there's a more R way to do this.
fillInTheBlanks <- function(v) {
## replace each NA with the latest preceding available value
orig <- v
result <- v
for(i in 1:length(v)) {
value <- v[i]
if (!is.na(value))
result[i:length(v)] <- value
}
return(result)
}

Package zoo has a function na.locf():
R> library("zoo")
R> na.locf(c(1, 2, 3, 4))
[1] 1 2 3 4
R> na.locf(c(1, NA, NA, 2, 3, NA, 4))
[1] 1 1 1 2 3 3 4
na.locf: Last Observation Carried Forward;
Generic function for replacing each ‘NA’ with the most recent non-‘NA’ prior to it.
See the source code of the function na.locf.default, it doesn't need a for-loop.

I'm doing some minimal copy&paste from the zoo library (thanks again rcs for pointing me at it) and this is what I really needed:
fillInTheBlanks <- function(S) {
## NA in S are replaced with observed values
## accepts a vector possibly holding NA values and returns a vector
## where all observed values are carried forward and the first is
## also carried backward. cfr na.locf from zoo library.
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}

Just for fun (since it's slower than fillInTheBlanks), here's a version of na.locf relying on rle function:
my.na.locf <- function(v,fromLast=F){
if(fromLast){
return(rev(my.na.locf(rev(v))))
}
nas <- is.na(v)
e <- rle(nas)
v[nas] <- rep.int(c(NA,v[head(cumsum(e$lengths),-1)]),e$lengths)[nas]
return(v)
}
e.g.
v1 <- c(3,NA,NA,NA,1,2,NA,NA,5)
v2 <- c(NA,NA,NA,1,7,NA,NA,5,NA)
my.na.locf(v1)
#[1] 3 3 3 3 1 2 2 2 5
my.na.locf(v2)
#[1] NA NA NA 1 7 7 7 5 5
my.na.locf(v1,fromLast=T)
#[1] 3 1 1 1 1 2 5 5 5
my.na.locf(v2,fromLast=T)
#[1] 1 1 1 1 7 5 5 5 NA

another simple answer. This one takes care of 1st value being NA. Thats a dead end so my loop stats from index 2.
my_vec <- c(1, NA, NA, 2, 3, NA, 4)
fill.it <- function(vector){
new_vec <- vector
for (i in 2:length(new_vec)){
if(is.na(new_vec[i])) {
new_vec[i] <- new_vec[i-1]
} else {
next
}
}
return(new_vec)
}

Multiple R packages have a na.locf function included, which exactly does that. (imputeTS, zoo, spacetime,...)
Here is a example with imputeTS:
library("imputeTS")
x <- c(1, NA, NA, 2, 3, NA, 4)
na.locf(x)
There are also more advanced methods for replacing missing values provided by the imputeTS package. (and by zoo also)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R fuction composition for the substitution of values in dataframe - r

Related

Extend runs of certain length

recoding integers in a vector so they register as NA instead?

Subtract rows with numeric values and ignore NAs

Sort vector keeping NAs position in R

propagating data within a vector

Categories

Resources