Recode values omitting NA's - r

I want to recode the values in a matrix in such a way that all values <=.2 become 2, <=.4 become 3 etc. However, there are missings in my data, which I do not want to change (keep them NA). Here you find a simplified version of my code. Using na.omit works perfectly for the first changes
try <- matrix(c(0.78,0.62,0.29,0.47,0.30,0.63,0.30,0.20,0.15,0.58,0.52,0.64,
0.76,0.32,0.64,0.50,0.67,0.27, NA), nrow = 19)
try[na.omit(try <= .2)] <- 2 #Indeed changes .20 and .15 to 2 and leaves the NA as NA
However, when I do the same for a higher category, the NA is also changed:
try[na.omit(try <= .8)] <- 5 #changes all other values including the NA to 5
Can someone explain to me what is the difference between the two and why the second one also changes the NA-value while the first one does not? Or am I doing something else wrong?

You can do
try[try <= .8] <- 5
The NA values will remain as NA
Or create a logical condition to exclude the NA values
try[try <=.8 & !is.na(try)] <- 5

Related

how to fill missing values in a vector with the mean of value before and after the missing one

Currently I am trying to impute values in a vector in R. The conditions
of the imputation are.
Find all NA values
Then check if they have an existing value before and after them
Also check if the value which follows the NA is larger than
the value before the NA
If the conditions are met, calculate a mean taking the values before
and after.
Replace the NA value with the imputed one
# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)
# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)
# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)
I started out to write code to detect the values which can
be imputed. But I got stuck with the following.
# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]),
rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)
This however only detects the NAs which might be
imputable and it only works with example one. It is incomplete and
unfortunately super hard to read and understand.
Any help with this would be highly appreciated.
We can use dplyrs lag and lead functions for that:
input_three = c(NA,NA,3,4,NA,6,NA,NA)
library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
(lag(input_three) + lead(input_three))/ 2,
input_three)
Retrurns:
[1] NA NA 3 4 5 6 NA NA
Edit
Explanation:
We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors.
First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:
lag offsets a vector to the right (default is 1 step):
lag(1:5)
Returns:
[1] NA 1 2 3 4
lead offsets a vector to the left:
lead(1:5)
Returns:
[1] 2 3 4 5 NA
Now to the 'test' clause of ifelse:
is.na(input_three) & lead(input_three) > lag(input_three)
Which returns:
[1] NA NA FALSE FALSE TRUE FALSE NA NA
Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element
Here's an example using the imputeTS library. It takes account of more than one NA in the sequence, ensures that the mean is calculated if the next valid observation is greater than the last valid observation and also ignores NA at the beginning and end.
library(imputeTS)
myimpute <- function(series) {
# Find where each NA is
nalocations <- is.na(series)
# Find the last and the previous observation for each row
last1 <- lag(series)
next1 <- lead(series)
# Carry forward the last and next observations over sequences of NA
# Each row will then get a last and next that can be averaged
cflast <- na_locf(last1, na_remaining = 'keep')
cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
# Make a data frame
df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
# Calculate the mean where there is currently a NA
# making sure that the next is greater than the last
df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
#list(df, imputedseries) # comment this in and return it to see the intermediate data frame for debugging
imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))
# [1] NA NA 3 4 5 5 6 7 7 8 NA 7 8 8 9 10 11 NA NA
There is also the na_ma function in the imputeTS package for imputing moving averages.
In your case this would be with the following settings:
na_ma(x, k = 1, weighting = "simple")
k = 1 (meaning 1 value before and 1 after the NA are taken into account)
weighting = "simple" (the mean of these two values is calculated)
This can be applied quite easy with basically 1 line of code:
library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple")
You could also choose to take more values before and after the NA into account e.g. k=3. Interesting feature if you take more than 1 value to each side into account is the possibility to choose a different weighting e.g. with weighting = "linear" weights decrease in arithmetical progression (a Linear Weighted Moving Average) - meaning the further they values are away from the NA the less impact they have.

Fill column based on several conditions, with priorities of those conditions

I have a data set like this:
Age <- rnorm(n=100, mean=20, sd=5)
ind <- which(Age %in% sample(Age, 50))
Age[ind]<-NA
Age2 <- rnorm(n=100, mean=20, sd=5)
ing <- which(Age2 %in% sample(Age2, 50))
Age2[ing]<-NA
Age3 <- rnorm(n=100, mean=20, sd=5)
int <- which(Age3 %in% sample(Age3, 50))
Age3[int]<-NA
data<-data.frame(Age,Age2,Age3)
Its an old data set several different people put together where multiple columns mean the same thing (there are several columns for age in the real data set). As you can see, there are quite a few NA's. I'd like to create a unified "age" column. To do this, I'd like to ideally use the number from the first age column, but if that is NA I'd then preferentially use the number from Age2, and if it is also NA i'd use Age3, and I'd like to do so in that order (Age3 would never supersede Age2, etc...) as I trust the people who input the data in that order haha.
I'm aware of other answers on here for filling columns based on several conditions, like so: dplyr replacing na values in a column based on multiple conditions
But I'm not sure how to place priorities. Thank you!
You can use coalesce() from dplyr which will fill based on the first non-missing value from left to right.
library(dplyr)
df <-data.frame(Age,Age2,Age3)
df$new_age <- coalesce(!!!df)
head(df)
Age Age2 Age3 new_age
1 17.19762 NA NA 17.19762
2 18.84911 21.17693 NA 18.84911
3 27.79354 NA NA 27.79354
4 NA 15.19072 NA 15.19072
5 NA NA 27.99254 27.99254
6 28.57532 NA 19.55717 28.57532
A base R possibility could be:
apply(data, 1, function(x) x[which(!is.na(x))[1]])

Replace NA for multiple columns with average of values from other dataframe

I am trying to replace NA values in multiple columns from dataframe x1 by the average of the values from dataframes x2 and x3, based on common and distinct atrribute 'ID'.
All the dataframes(each dataframe is for a particular year) have the same column structure:
ID A B C .....
01 2 5 7 .....
02 NA NA NA .....
03 5 4 8 .....
I have found an answer to do it for 1 column at a time, thanks to this post.
x1$A[is.na(x1$A)] <- (x2$A[match(x1$ID[is.na(x1$A)],x2$ID)] + x3$A[match(x1$ID[is.na(x1$A)],x3$ID)])/2
But since I have about a 100 coulmns to apply this for I would really like to have a smarter way to do it.
I tried the suggestions from this post and also from here.
I came up with this code, but couldn't make it work.
x1[6:105] = as.data.frame(lapply(x1[6:105], function(x) ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)]+x3$x[match(x1$ID, x3$ID)])/2, x1$x)))
Got the following error:
Error in ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)] + x3$x[match(x1$ID, : replacement has length zero
I initially thought function(x) worked on the entire column and x represented the column name, but i think it represents each individual cell value and that is why it wont work.
I am a novice in R, I would surely appreciate some guidance to let me know where I am going wrong, in applying the logic to multiple columns.
for (i in 1:ncol(x1)) {
nas <- is.na(x1[,i]) # where are NAs
if (sum(nas)==0) next
ids <- x1$ID[nas] # ids of NAs
nam <- colnames(x1)[i] # colname of the column
x1[nas, i] <- (x2[match(ids, x2$zip), nam] + x3[match(ids, x3$zip), nam]) / 2
}

Combine rows into one, replace NA

I have the following dataframe: (this is just a small sample)
VALUE COUNT AREA n_dd-2000 n_dd-2001 n_dd-2002 n_dd-2003 n_dd-2004 n_dd-2005 n_dd-2006 n_dd-2007 n_dd-2008 n_dd-2009 n_dd-2010
2 16 2431 243100 NA NA NA NA NA NA 3.402293 3.606941 4.000461 3.666381 3.499614
3 16 2610 261000 3.805082 4.013435 3.98 3.490139 3.433857 3.27813 NA NA NA NA NA
4 16 35419 3541900 NA NA NA NA NA NA NA NA NA NA NA
and I would like to combine all three rows into one row replacing NA with the number that appears in each column (there's only one number per column). Just ignore the first three columns. I used this code:
bdep[4,4:9] <- bdep[3,4:9]
to replace NA's with numbers from another row, but can't figure out how to repeat it for all the columns. The columns 4 and beyond have a sequence in each row of six numbers followed by 20 NA's, so I've tried going down the road of using lapply() and seq() or for loops, but my efforts are failing.
I did a simple solution by replacing the NA:s with zeroes and adding all rows per column. Did this work?
#data
bdep <- rbind(c(rep(NA,6),3.402293,3.606941,4.000461,3.666381,3.499614),
c(3.805082,4.013435,3.98,3.490139,3.433857,3.27813, rep(NA,5)),
c(rep(NA,11)))
#solution
bdep2 <- ifelse(is.na(bdep), 0, bdep)
bdep3 <- apply(bdep2, 2, sum)
bdep3 #the row you want?
I finally came to a solution by patching together some code I found in other posts (esp. sequencing and for loops). I think this would be considered messy coding, so I'd welcome other solutions. This should better describe what I was trying to do in the OP, where I was trying to generalize too much. Specifically, I have 17 variables, measured over 14 years (that's 238 columns), and something happened while generating these data where the first 6 years of a variable are in one row and the following 8 years are in the other row, so instead of re-run the model, I just wanted to combine the two rows into one.
Below are some sample data, simplified from my real scenario.
Create the data frame:
df <- data.frame(
VALUE = c(16, 16, 16),
COUNT = c(2431, 2610, 35419),
AREA = c(243100, 261000, 3541900),
n_dd_2000 = c(NA, 3.805, NA),
n_dd_2001 = c(3.402, NA, NA)
)
The next two lines establish a sequence starting a pattern at column 4, repeating every 1 column, repeated 2 times out in the first line, 1 time out in the second line, and how many times to repeat the sequence:
info <- data.frame(start=seq(4, by=1, length.out=2), len=rep(1,2))
info2 <- data.frame(start=seq(5, by=1, length.out=1), len=rep(1,2))
This is the code from my real dataset, where I started at column 4, repeated the pattern every 14 columns, out 17 times, and looked at the first 6, then 8 columns: info <- data.frame(start=seq(4, by=14, length.out=17), len=rep(c(6,8),17))
The two for loops below write the specified values in the sequence from row 2 and row 1 to row 3, respectively:
foo = sequence(info$len) + rep(info$start-1, info$len)
foo2 = sequence(info2$len) + rep(info2$start-1, info2$len)
for(n in 1:length(foo)){
df[3,foo[n]] <- df[2,foo[n]]
}
for(n in 1:length(foo2)){
df[3,foo2[n]] <- df[1,foo2[n]]
}
Then I removed the first two rows I got those values from and I'm left with one complete row, no NA's:
df <- df[-(1:2),]

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

Resources