Imagine a small data set like the one below, composed of three variables:
v1 <- c(0, 1, NA, 1, NA, 0)
v2 <- c(0, 0, NA, 1, NA, NA)
v3 <- c(1, NA, 0, 0, NA, 0)
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
One can use the is.na command as follows to calculate the number of rows with at least one missing value - and R would return 4:
sum(is.na(df$v1) | is.na(df$v2) | is.na(df$v3))
Or the number of rows with all three values missing - and R would return 1:
sum(is.na(df$v1) & is.na(df$v2) & is.na(df$v3))
Two questions at this point:
(1) How can I calculate the number of rows where "exactly one" or "exactly two" values are missing?
(2) If I am to do the above in a large data set, how can I limit the scope of the calculation to v1, v2 and v3 (that is, without having to create a subset)?
I tried variations of is.na, nrow and df, but could not get any of them to work.
Thanks!
We can use rowSums on the logical matrix (is.na(df)) and check whether the number of NAs are equal to the value of interest.
n1 <- 1
sum(rowSums(is.na(df))==n1)
To make it easier, create a function to do this
f1 <- function(dat, n){
sum(rowSums(is.na(dat)) == n)
}
f1(df, 0)
#[1] 2
f1(df, 1)
#[1] 2
f1(df, 3)
#[1] 1
f1(df, 2)
#[1] 1
NOTE: rowSums is very fast, but if it is a large dataset, then creating a logical matrix can also create problems in memory. So, we can use Reduce after looping through the columns of the dataset (lapply(df, is.na)).
sum(Reduce(`+`, lapply(df, is.na))==1)
#[1] 2
f2 <- function(dat, n){
sum(Reduce(`+`, lapply(dat, is.na))==n)
}
f2(df, 1)
Try this:
num.rows.with.x.NA <- function(df, x, cols=names(df)) {
return(sum(apply(df, 1, function(y) sum(is.na(y[cols])) == x)))
}
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
num.rows.with.x.NA(df, 0, names(df))
#[1] 2
num.rows.with.x.NA(df, 1, names(df))
#[1] 2
num.rows.with.x.NA(df, 2, names(df))
#[1] 1
num.rows.with.x.NA(df, 3, names(df))
#[1] 1
Related
There are multiple ways to fill missing values in R. However, I can't find a solution for filling just the last n NAs.
Available options:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
library(zoo)
na.locf(na_vector)
# Outputs: [1] 1 1 1 1 2 3 3 3
na.locf0(na_vector, maxgap = 2)
# Outputs: [1] 1 NA NA NA 2 3 3 3
How I would like it to be:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
fill_na <- function(vector, n){
...
}
fill_na(na_vector, n = 1)
# Outputs: [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, n = 2)
# Outputs: [1] 1 1 1 NA 2 3 3 3
Here is an option to get those outputs using dplyr and recursion:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
fill_na <- function(vector, n){
if (n == 0) {
vector
} else {
fill_na(
vector = dplyr::coalesce(vector, dplyr::lag(vector)),
n = n - 1
)
}
}
fill_na(na_vector, n = 1)
# [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, n = 2)
# [1] 1 1 1 NA 2 3 3 3
Number the NA's in each consecutive run of NA's giving a and then only fill in those with a number less than or equal to n. This uses only vector operations internally and no iteration or recursion.
library(collapse)
library(zoo)
fill_na <- function(x, n) {
a <- ave(x, groupid(is.na(x)), FUN = seq_along)
ifelse(a <= n, na.locf0(x), x)
}
fill_na(na_vector, 1)
## [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, 2)
## [1] 1 1 1 NA 2 3 3 3
Here is a solution to impute everything except the last n NA's based on base R + imputeTS.
library(imputeTS)
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
# The function that allows imputing everything except the last n NAs
fill_except_last_n_na <- function(x, n) {
index <- which(rev(cumsum(rev(as.numeric(is.na(x))))) == n+1)
x[1:tail(index,1)] <- na_locf(x[1:tail(index,1)])
return(x)
}
# Call the new function
fill_except_last_n_na(na_vector,2)
## Result
[1] 1 1 1 1 2 3 NA NA
When you want to use another imputation option than last observation carried forward, you can just replace the na_locf with na_ma (moving average), na_interpolation (interpolation), na_kalman (Kalman Smooting on State Space Models) or other imputation function provided by the imputeTS package (see also in the imputeTS documentation for a list of functions.
I have a data set of boolean variables and I am trying to generate a new variable based on 3 of the existing booleans using ifelse().
The rules I'd like to implement are:
If any of the three columns have value 1, 1
If all of the three columns have value 0, 0
If all of the three columns have value NA, NA
If the three columns have some combination of 0 and NA, 0
Here is the code to generate a sample with 3 variables that I want to use to create a fourth:
df <- structure(list(var1 = c(NA, NA, NA, 0,1),
var2 = c(1, NA, 0,0, 1),
var3 = c(NA, NA, NA,0,1)), class = "data.frame", row.names = c(NA, -5L))
I have tried the following to generate the new variable according to my desired rules:
df$newvar1 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,0))
df$newvar2 <- ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,
ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse(is.na(df$var1) & is.na(df$var2) & is.na(df$var3), NA,NA)))
df$newvar3 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,
ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,0)))
I don't understand why newvar1 and newvar3 have NA values corresponding to combinations of NAs and 0s when both examples use "&" between the na specifications (row 3 in the results).
I am assuming that NAs don't show up in newvar2 because the first ifelse() function takes precedent.
Any insight to the ifelse() function or advice on how to get the results I'm looking for would be really helpful.
Here is another possible option using rowSums:
df$newvar <- +(rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0) > 0)
# var1 var2 var3 newvar
#1 NA 1 NA 1
#2 NA NA NA NA
#3 NA 0 NA 0
#4 0 0 0 0
#5 1 1 1 1
This gives your expected results:
df$newvar <- 0
df$newvar[Reduce(`|`, lapply(df[1:3], `%in%`, 1))] <- 1
df$newvar[Reduce(`&`, lapply(df[1:3], is.na))] <- NA
df
# var1 var2 var3 newvar
# 1 NA 1 NA 1
# 2 NA NA NA NA
# 3 NA 0 NA 0
# 4 0 0 0 0
# 5 1 1 1 1
This defaults to 0 and only changes values with known conditions, which means that if there are any rows with NA and 1 (with or without 0), it will be assigned 0. It's not difficult to test for this, but it wasn't in your logic.
I try to create a subset, where I remove all answers == 0 for variable B, given another variable A == 1. However, I want to keep the NAs in Variable B (just remove the 0s).
I tried it with this df2 <- subset(df, B[df$A == 1] > 0) but the result makes no sense. Can someone help?
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
subset takes a condition and returns only the rows where the value is TRUE. If you try NA == 0, or NA != 0 it will always return NA, which is neither TRUE nor FALSE, however as subset would have it it only returns rows where the value is TRUE. There are multiple ways around this:
subset(df, !(A == 1 & B == 0) | is.na(B))
or:
subset(df, !(A == 1 & B %in% 0))
There's plenty more options available however
This should work, if I understand it correctly:
subset(df, (df$A == 1) & ((df$B != 0) | (is.na(df$B))))
outputs:
i A B
2 1 10
3 1 13
4 1 NA
10 1 NA
If you do not want to specify every single column, you can just change the 0 to NA and the NA (temporarily) to a number (for example 999/-999) and switch back after you are finished.
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
df[is.na(df)] <- 999
df[df==0] <- NA
df <- na.omit(df)
df[df==999] <- NA
i A B
2 2 1 10
3 3 1 13
4 4 1 NA
10 10 1 NA
If i is unique, identify wich cases you want to remove and select the rest, try:
df[df$i != subset(df, A==1 & B==0)$i, ]
Output:
i A B
1 1 0 0
2 2 1 10
3 3 1 13
4 4 1 NA
5 5 0 NA
6 6 0 9
9 9 0 3
10 10 1 NA
I have a dataframe, say
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA)); data
x1 x2 c
1 5 4 a
2 NA 3 b
3 1 0 a
4 6 NA NA
I want to replace the NAs by 0 on x1 and x2 columns only, so I use the lapply function as below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0}); data
This does not work as the output is:
x1 x2 c
1 0 0 a
2 0 0 b
3 0 0 a
4 0 0 NA
I then tried to create a separate function
fxNAtoZero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
and if I use this like below:
data[c("x1","x2")] <- lapply(data[c("x1","x2")], fxNAtoZero); data
it works, but the first case does not. I do not understand why the function created on fly is not working in lapply?
Your problem is that your first attempt just return the last line of the function in lapply, that is 0:
lapply(data[c("x1","x2")], function (x) {x[is.na(x)] <- 0})
$x1
[1] 0
$x2
[1] 0
while your second attempt return explicitely return the entire vector after changing the NA, because you used return. You should prefer if you want to use lapply:
lapply(data[c("x1","x2")], function (x) {ifelse(is.na(x),0,x) })
because ifelse does return a vector of the same length as the initial one.
You can also try using dplyr verbs to transform your data, and replace NA's for the desired cases. This is perhaps a bit more readable than using lapply, but note that the variables are converted to strings since that is the format for variable c.
data <- data.frame(x1 = c(5, NA, 1, 6),
x2 = c(4, 3, 0, NA),
c = c('a', 'b', 'a', NA),
id = c(1:4)) # create with row id, for spread
data %>% gather(k,v,-id) %>%
mutate(v=ifelse(is.na(v) & k!='c',0,v)) %>% # replace NA's based on conditions
spread(k,v) %>% select(-id)
c x1 x2
1 a 5 4
2 b 0 3
3 a 1 0
4 <NA> 6 0
I have a data frame that has NA's in every row. Some are on the left, some in the middle, and some on the right. Something like this:
a <- c(NA, NA, 1, NA)
b <- c(NA, 1, 1, NA)
c <- c(NA, NA, 1, 1)
d <- c(1, 1, NA, 1)
df <- data.frame(a, b, c, d)
df
# a b c d
# NA NA NA 1
# NA 1 NA 1
# 1 1 1 NA
# NA NA 1 1
I would like to replace all the NAs that are in the middle and on the right side with 0 but keep all the NA's leading to a 1 on the left as NA. So I would like an efficient way (my data frame is large) to have this data frame:
# a b c d
# NA NA NA 1
# NA 1 0 1
# 1 1 1 0
# NA NA 1 1
We can use apply to loop over the rows, find the index of the first occurence of 1. Then replace the NAs from that element to the last with 0
df[] <- t(apply(df, 1, function(x) {
i1 <- which(x == 1)[1]
i2 <- i1:length(x)
x[i2][is.na(x[i2])] <- 0
x}))
Or another option is
df[] <- t(apply(df, 1, function(x) replace(x,
cumsum(x ==1 & !is.na(x)) >= 1 & is.na(x), 0)))