Find columns with all missing values - r

I am writing a function, which needs a check on whether (and which!) column (variable) has all missing values (NA, <NA>). The following is fragment of the function:
test1 <- data.frame (matrix(c(1,2,3,NA,2,3,NA,NA,2), 3,3))
test2 <- data.frame (matrix(c(1,2,3,NA,NA,NA,NA,NA,2), 3,3))
na.test <- function (data) {
if (colSums(!is.na(data) == 0)){
stop ("The some variable in the dataset has all missing value,
remove the column to proceed")
}
}
na.test (test1)
Warning message:
In if (colSums(!is.na(data) == 0)) { :
the condition has length > 1 and only the first element will be used
Q1: Why is the above error and any fixes ?
Q2: Is there any way to find which of columns have all NA, for example output the list (name of variable or column number)?

This is easy enough to with sapply and a small anonymous function:
sapply(test1, function(x)all(is.na(x)))
X1 X2 X3
FALSE FALSE FALSE
sapply(test2, function(x)all(is.na(x)))
X1 X2 X3
FALSE TRUE FALSE
And inside a function:
na.test <- function (x) {
w <- sapply(x, function(x)all(is.na(x)))
if (any(w)) {
stop(paste("All NA in columns", paste(which(w), collapse=", ")))
}
}
na.test(test1)
na.test(test2)
Error in na.test(test2) : All NA in columns 2

In dplyr
ColNums_NotAllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) != nrow(df)))
}
df %>%
select(ColNums_NotAllMissing(.))
example:
x <- data.frame(x = c(NA, NA, NA), y = c(1, 2, NA), z = c(5, 6, 7))
x %>%
select(ColNums_NotAllMissing(.))
or, the other way around
Cols_AllMissing <- function(df){ # helper function
as.vector(which(colSums(is.na(df)) == nrow(df)))
}
x %>%
select(-Cols_AllMissing(.))

To find the columns with all values missing
allmisscols <- apply(dataset,2, function(x)all(is.na(x)));
colswithallmiss <-names(allmisscols[allmisscols>0]);
print("the columns with all values missing");
print(colswithallmiss);

This one will generate the column names that are full of NAs:
library(purrr)
df %>% keep(~all(is.na(.x))) %>% names

To test whether columns have all missing values:
apply(test1,2,function(x) {all(is.na(x))})
To get which columns have all missing values:
test1.nona <- test1[ , colSums(is.na(test1)) == 0]

dplyr approach to finding the number of NAs for each column:
df %>%
summarise_all((funs(sum(is.na(.)))))

The following command gives you a nice table with the columns that have NA values:
sapply(dataframe, function(x)all(any(is.na(x))))
It's an improvement for the first answer you got, which doesn't work properly from some cases.

sapply(b,function(X) sum(is.na(X))
This will give you the count of na in each column of the dataset and also will give 0 if there is no na present in the column

Variant dplyr approach:
dataframe %>% select_if(function(x) all(is.na(x))) %>% colnames()

Related

Conditional recoding big set of variables into NA

I'm looking for a solution to recode huge sets of variables into NA.
This should work like:
if (df$check1==1) then df$Q1,Q2,Q3....Q100 <-NA
if (df$check2==1) then df$R1,R2,R3....R500 <-NA
I'd like to keep list of variable names to change in separate lists (CSV).
I thought about ifelse or recode but not sure how to apply it to set variables on the output side. In mutate_if we have conditions on target variables..., so I got lost.
We could use mutate with across
library(dplyr)
df <- df %>%
mutate(across(Q1:Q100, ~ replace(.x, check1 == 1, NA)),
across(R1:R500, ~ replace(.x, check2 == 1, NA)))
In base R, we may use
qcols <- paste0("Q", 1:100)
rcols <- paste0("R", 1:500)
df[df$check1 == 1, qcols] <- NA
df[df$check2 == 1, rcols] <- NA

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

Argument is not numeric or logical: returning NA with one string column

Hello I would like to calculate mean for every numeric column in my data. For now I have:
for(i in names(MyData)){
avg <- mean(MyData[[i]], na.rm = TRUE)
print(avg)
}
but I get error like topic name because last of MyData is decisive and I have here string, is there way that ignore column with string. I also know that I can change it into numbers but I don't want to do it.
We can do this more easily if we use summarise_if from dplyr
library(dplyr)
MyData %>%
summarise_if(is.numeric, mean)
In the OP's code, it is looping through each of the columns and just printing the result and not storing it. There is also a possibility that some columns are not numeric. In the below code, we pre-assign a vector ('v1') with 0 values to store the output. Create a logical condition with if/else and return the mean if it is numeric or else return NA
v1 <- numeric(length(MyData))
for(i in seq_along(MyData)) {
if(is.numeric(MyData[[i]])) {
v1[i] <- mean(MyData[[i]], na.rm = TRUE)
} else {
v1[i] <- NA_real_
}
}
In base R, this can also be done with sapply
i1 <- sapply(MyData, is.numeric)
sapply(MyData[i1], mean, na.rm = TRUE)
Or with colMeans
colMeans(MyData[i], na.rm = TRUE)

Conditions & Subtraction from Matrix in R

I've looked at R create a vector from conditional operation on matrix, and using a similar solution does not yield what I want (and I'm not sure why).
My goal is to evaluate df with the following condition: if df > 2, df -2, else 0
Take df:
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
df is simply:
a b
1 0
2 1
3 2
4 3
5 4
df_final should look like this after a suitable function:
a b
0 0
0 0
1 0
2 1
3 2
I applied the following function with the result, and I'm not sure why it doesn't work (further explanation of a solution would be appreciated)
apply(df,2,function(df){
ifelse(any(df>2),df-2,0)
})
Yielding the following:
a b
-1 -2
Thank you SO community!
Let's fix your function and understand why it didn't work:
apply(df, # apply to df
2, # to each *column* of df
function(df){ # this function. Call the function argument (each column) df
# (confusing because this is the same name as the data frame...)
ifelse( # Looking at each column...
any(df > 2), # if there are any values > 2
df - 2, # then df - 2
0 # otherwise 0
)
})
any() returns a single value. ifelse() returns something the same shape as the test, so by making your test any(df > 2) (a single value), ifelse() will also return a single value.
Let's fix this by (a) changing the function to be of a different name than the input (for readability) and (b) getting rid of the any:
apply(df, # apply to df
2, # to each *column* of df
function(x){ # this function. Call the function argument (each column) x
ifelse( # Looking at each column...
x > 2, # when x is > 2
df - 2, # make it x - 2
0 # otherwise 0
)
})
apply is made for working on matrices. When you give it a data frame, the first thing it does is convert it to a matrix. If you want the result to be a data frame, you need to convert it back to a data frame.
Or we can use lapply instead. lapply returns a list, and by assigning it to the columns of df with df[] <- lapply(), we won't need to convert. (And since lapply doesn't do the matrix conversion, it knows by default to apply the function to each column.)
df[] <- lapply(df, function(x) ifelse(x > 2, x - 2, 0))
As a side note, df <- cbind(a,b) %>% as.data.frame() is a more complicated way of writing df <- data.frame(a, b)
Create the 'out' dataset by subtracting 2, then replace the values that are based on a logical condition to 0
out <- df - 2
out[out < 0] <- 0
Or in a single step
(df-2) * ((df - 2) > 0)
Using apply
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
new_matrix <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0))
new_matrix
###if you want it to return a tibble/df
new_tibble <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0)) %>% as_tibble()

how can I apply a function to all dataframe variables?

I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA

Resources