Delete records containing more than 5 null values? - r

I would like to know how I can remove from a dataset the records that have more than 5 null values in the columns that define them. The following code allows you to delete records with any NA in any column. However, how can I modify it to do exactly what I ask? Any ideas?
df [ complete.cases (df),]

Here is an example data frame. One of the rows has 6 NA values.
We sum the NA values by row in a new column, filter where the number of NA is less than or equal to 5, then remove the new column.
df <- data.frame(a = c(1,NA,1,1),
b = c(1, NA, NA, 1),
c = c(1, NA, NA, NA),
d = c(1, NA, NA ,NA),
e = c(1, NA, NA, NA),
f = c(1, NA, NA, NA))
a b c d e f
1 1 1 1 1 1 1
2 NA NA NA NA NA NA
3 1 NA NA NA NA NA
4 1 1 NA NA NA NA
df %>%
mutate(count = rowSums(is.na(df))) %>%
filter(count <= 5) %>%
select(-count)
a b c d e f
1 1 1 1 1 1 1
2 1 NA NA NA NA NA
3 1 1 NA NA NA NA

I'm assuming you are referring to values of NA in your data indicating a missing value. NULL is returned by expressions and functions whose value is undefined. First create some reproducible data:
set.seed(42)
vals <- sample.int(1000, 250)
idx <- sample.int(250, 100)
vals[idx] <- NA
example <- as.data.frame(matrix(vals, 25))
Now compute the number of missing values by row and exclude the rows with more than 5 missing values:
na.count <- rowSums(is.na(example))
example[na.count<=5, ]

Related

Rank order row values in R while keeping NA values

I'm trying to convert values in a data frame to rank order values by row. So take this:
df = data.frame(A = c(10, 20, NA), B = c(NA, 10, 20), C = c(20, NA, 10))
When I do this:
t(apply(df, 1, rank))
I get this:
[1,] 1 3 2
[2,] 2 1 3
[3,] 3 2 1
But I want the NA values to continue showing as NA, like so:
[1,] 1 NA 2
[2,] 2 1 NA
[3,] NA 2 1
Try using the argument na.last and set it to keep:
t(apply(df, 1, rank, na.last='keep'))
Output:
A B C
[1,] 1 NA 2
[2,] 2 1 NA
[3,] NA 2 1
As mentioned in the documentation of rank:
na.last:
for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if "keep" they are kept with rank NA.
Here a dplyr approach
Libraries
library(dplyr)
Data
df <- tibble(A = c(10, 20, NA), B = c(NA, 10, 20), C = c(20, NA, 10))
Code
df %>%
mutate(across(.fns = ~rank(x = .,na.last = "keep")))
Output
# A tibble: 3 x 3
A B C
<dbl> <dbl> <dbl>
1 1 NA 2
2 2 1 NA
3 NA 2 1

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

R dataframe: combine conditions by processing

I have to find all columns with all NA-values. If there are not all NA-values in column, I have to replace NAs with 0.
My solution is:
NA_check <- colSums(is.na(frame)) == nrow(frame) #True or False - all NA or not
frame[is.na(frame) & which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])] <- 0
These conditions work separately, but they don't work together or I get some errors combining them. How can I solve my problem?
P.S. This modification also doesn't work if NA_checkis not all FALSE:
frame[is.na(frame[which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])])] <- 0
You can find out columns which has atleast one non-NA value (not all values are NA) and replace NA in that subset to 0.
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
We can check this with an example :
frame <- data.frame(a = c(NA, NA, 3, 4), b = NA, c = c(NA, 1:3), d = NA)
frame
# a b c d
#1 NA NA NA NA
#2 NA NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
frame
# a b c d
#1 0 NA 0 NA
#2 0 NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
We can also do this with dplyr :
library(dplyr)
frame %>% mutate(across(where(~any(!is.na(.))), tidyr::replace_na, 0))

Conditional filter with if statements

My data consists of columns and rows. Each column has "NA" and different numbers.
For example column1 is:
2
1
1
NA
1
NA
NA
NA
I want to assign a column id to the numbers in each column.
for(j in 1:54){
if(!(col[j] <-"NA")){
col[j] <- i
}
}
Expected result for column1:
1
1
NA
NA
NA
1
NA
NA
1
**column 2: **
2
2
NA
NA
NA
2
NA
NA
2
You can use
v <- c(2, 1, NA, NA, 4, 5, NA)
id <- ifelse(!is.na(v), 1, NA)
id
1 1 NA NA 1 1 NA
This means you don't need the for loop here. If you can apply a function to a vector you should avoid using the for loop.
Also, please provide your data so that others can actually use it (like in my code above).
EDIT
According to the comments you have multiple columns. You can use same code. See here
df <- data.frame(a= c(2, 1, NA, NA, 4, 5, NA), b= c(3, NA, NA, NA, 5, NA, 6))
id <- sapply(1:ncol(df), function(i){
ifelse(!is.na(df[ , i]), i, NA)})
id
a b
[1,] 1 2
[2,] 1 NA
[3,] NA NA
[4,] NA NA
[5,] 1 2
[6,] 1 NA
[7,] NA 2

Replace NA's and delete columns in an efficient way

I've got a dataframe which looks like follows:
# Code:
m3 <- c(NA, -3, NA, NA, -3)
m2 <- c(rep(NA, 5))
m1 <- c(rep(NA, 5))
Zero <- c(rep(NA, 5))
p1 <- c(1, NA, NA, 1, NA)
p2 <- c(NA, NA, NA, 2, NA)
p3 <- c(3, NA, 3, 3, NA)
df <- data.frame(m3, m2, m1, Zero, p1, p2, p3)
# Output:
m3 m2 m1 Zero p1 p2 p3
1 NA NA NA NA 1 NA 3
2 -3 NA NA NA NA NA NA
3 NA NA NA NA NA NA 3
4 NA NA NA NA 1 2 3
5 -3 NA NA NA NA NA NA
I need to insert a -3 in the whole row, if there is a -3 in the first column. I also need to delete all columns, but p1, p2, and p3. The final result should look like follows:
# Final output:
p1 p2 p3
1 1 NA 3
2 -3 -3 -3
3 NA NA 3
4 1 2 3
5 -3 -3 -3
I found a solution, but it seems very inefficient to me. I need to perform this operation multiple times and therefore need a code, which is as efficient as possible. My inefficient solution looks like follows:
# Inefficient code:
for(i in 1:length(df$m3)){
if(is.na(df$m3[i]) == FALSE){
df[i, ] <- -3
}
}
df <- df[ , 5:length(df)]
Is there a more efficient way? Thank you very much in advance!
update values:
df[df$m3 %in% -3,] <- -3
select columns:
df <- df[, c("p1", "p2", "p3")]
You can use data.table
dt <- data.table(df)
dt[m3 == -3, paste0('p', 1:3) := -3]
dt <- dt[, c("p1", "p2", "p3"), with = FALSE]

Resources