Using apply() but getting class list answer

Using apply() but getting class list answer - r

I have a series of columns in a data.frame of which I'd like to get the last value, excluding any NAs. The function I'm using to get this done is
last_value <- function(x) tail(x[!is.na(x)], 1)
I'm using apply() to work this function across the 13 columns, for each observation (by row).
df$LastVal<-apply(df[,c(116, 561, 1006, 1451, 1896, 2341, 2786, 3231,
3676, 4121, 4566, 5011, 5456)], 1, FUN=last_value)
My problem is that the output comes out as a list of 5336 (total observations), instead of just a vector of the last values by row. The answers seem to be there but again, in list form. I've used this function before and it's worked fine. When I str() my columns, they're all integers.
Could this function get tripped up if there are no values and only NAs?
I should add that when I unlist() the new variable, I get an error that says "replacement has 4649 rows, data has 5336", so I do think this might have something to do with NAs.

First, you need to see what is the output of the function last_value as you have defined it with a row of NA values.
last_value <- function(x) tail(x[!is.na(x)], 1)
df <- matrix(1:24, 4)
df[2, ] <- NA
df <- as.data.frame(df)
apply(df, 1, last_value)
#[[1]]
#V6
#21
#
#[[2]]
#named integer(0)
#
#[[3]]
#V6
#23
#
#[[4]]
#V6
#24
The problem is that the second member of this list is of length zero. This means that unlist will not solve the problem.
You have to test for a value of length zero.
last_value <- function(x) {
y <- tail(x[!is.na(x)], 1)
if(length(y) == 0) NA else y
}
apply(df, 1, last_value)
#[1] 21 NA 23 24

You could include your function into a selection.
Example
df <- as.data.frame(matrix(1:12, 3, 4))
> df
V1 V2 V3 V4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
last_value <- function(x) tail(x[!is.na(x)], 1)
> df[, last_value(c(3, 4))] # selects last column
[1] 10 11 12
Test with NA.
df[2, 4] <- NA
> df[, last_value(c(3, 4))]
[1] 10 NA 12
If you're in need of an apply() approach use #Rui Barradas' well explained answer. Case you depend on speed, consider the benchmark of both solutions:
Unit: microseconds
expr min lq mean median uq max neval cld
apply(df, 1, last_value) 166.095 172.6005 182.09241 177.449 188.2925 257.179 100 b
df[, last_value(c(3, 4))] 32.147 33.4230 36.12764 34.699 35.5920 131.396 100 a
Apropos—for column wise use sapply().
> sapply(df[, c(3, 4)], FUN=last_value)
V3 V4
9 12

Related

Remove columns that only contains NA NULL rows in R [duplicate]

I have a data frame where some of the columns contain NA values.
How can I remove columns where all rows contain NA values?

Try this:
df <- df[,colSums(is.na(df))<nrow(df)]

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
examples using large data (30 columns, 1e6 rows)
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18

Update
You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to #mcstrother for bringing this to attention).
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select(where(not_all_na))
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select(where(not_any_na))
x
1 1
2 2
3 3
4 4
5 5
Old Answer
dplyr now has a select_if verb that may be helpful here:
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")

Another way would be to use the apply() function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9

Another options with purrr package:
library(dplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

df[sapply(df, function(x) all(is.na(x)))] <- NULL

An old question, but I think we can update #mnel's nice answer with a simpler data.table solution:
DT[, .SD, .SDcols = \(x) !all(is.na(x))]
(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.
Speed is equivalent.
microbenchmark::microbenchmark(
which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
sdcols = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
times = 2
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776 2 a
#> sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621 2 a

You can use Janitor package remove_empty
library(janitor)
df %>%
remove_empty(c("rows", "cols")) #select either row or cols or both
Also, Another dplyr approach
library(dplyr)
df %>% select_if(~all(!is.na(.)))
OR
df %>% select_if(colSums(!is.na(.)) == nrow(df))
this is also useful if you want to only exclude / keep column with certain number of missing values e.g.
df %>% select_if(colSums(!is.na(.))>500)

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.
naColsRemoval = function (DataTable) {
na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
DataTable [ , unlist (na.cols) := NULL , with = F]
}
.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

A handy base R option could be colMeans():
df[, colMeans(is.na(df)) != 1]

janitor::remove_constant() does this very nicely.

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:
How to get rid of columns where for ALL rows the value is NA?
First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)
Second, it uses dplyr.
Instead of
df <- df %>% select_if(~all(!is.na(.)))
I find that what works is
df <- df %>% select_if(~!all(is.na(.)))
The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion
every element is equal to "NA"

Remove specific columns from data frame in r using for loop [duplicate]

I have a data frame where some of the columns contain NA values.
How can I remove columns where all rows contain NA values?

Try this:
df <- df[,colSums(is.na(df))<nrow(df)]

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
examples using large data (30 columns, 1e6 rows)
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18

Update
You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to #mcstrother for bringing this to attention).
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select(where(not_all_na))
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select(where(not_any_na))
x
1 1
2 2
3 3
4 4
5 5
Old Answer
dplyr now has a select_if verb that may be helpful here:
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")

Another way would be to use the apply() function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9

Another options with purrr package:
library(dplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

df[sapply(df, function(x) all(is.na(x)))] <- NULL

An old question, but I think we can update #mnel's nice answer with a simpler data.table solution:
DT[, .SD, .SDcols = \(x) !all(is.na(x))]
(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.
Speed is equivalent.
microbenchmark::microbenchmark(
which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
sdcols = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
times = 2
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776 2 a
#> sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621 2 a

You can use Janitor package remove_empty
library(janitor)
df %>%
remove_empty(c("rows", "cols")) #select either row or cols or both
Also, Another dplyr approach
library(dplyr)
df %>% select_if(~all(!is.na(.)))
OR
df %>% select_if(colSums(!is.na(.)) == nrow(df))
this is also useful if you want to only exclude / keep column with certain number of missing values e.g.
df %>% select_if(colSums(!is.na(.))>500)

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.
naColsRemoval = function (DataTable) {
na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
DataTable [ , unlist (na.cols) := NULL , with = F]
}
.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

A handy base R option could be colMeans():
df[, colMeans(is.na(df)) != 1]

janitor::remove_constant() does this very nicely.

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:
How to get rid of columns where for ALL rows the value is NA?
First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)
Second, it uses dplyr.
Instead of
df <- df %>% select_if(~all(!is.na(.)))
I find that what works is
df <- df %>% select_if(~!all(is.na(.)))
The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion
every element is equal to "NA"

How to add columns according to element-wise multiplications

I have a table that contains two columns of numbers. I am trying to generate a new table where the result of each column comes from the element wise multiplication of the previous data.frame columns
For example, i have this:
df = data.frame(A=c(2,5,3), B=c(3,2,4))
print(df)
A B
1 2 3
2 5 2
3 3 4
And i need :
3 2 4
2 6 4 8
5 15 10 20
3 9 6 12

How about this?. You might need to change how I'm subsetting into A and B though depending on how your data.frame is set up.
df = data.frame(A=c(2, 5, 3), B=c(3, 2, 4))
df
element_wise_prod <- function(p_df) {
# use a more dynamic way to identify the two vectors of your dataframe
A <- p_df[, 1]
B <- p_df[, 2]
result <- t(sapply(A, function(x) x * B))
return(data.frame(result))
}
element_wise_prod(df)

It's a base function, it's called outer () you can choose whether to add, multiply, subtract, etc.
outer(A,B,"+")

How about something like this:
df = data.frame(A=c(2,5,3), B=c(3,2,4))
add_column <- function(df, source_column, value_key){
modifiers <- df[value_key]
# Make names
value_key <- paste0("value", as.numeric(unlist(modifiers)))
# Make room
df[value_key] <- NA
column_i <- 1
for(column in value_key){
result <- df[source_column] * modifiers[column_i, 1]
# Modify here if you want multiplication or sum
df[column] <- result
column_i <- column_i + 1
}
return(df)
}
Which gives
> add_column(df, "A", "B")
A B value3 value2 value4
1 2 3 6 4 8
2 5 2 15 10 20
3 3 4 9 6 12
Benchmark
Of note, although my answer preserves column names, it is way slower than the other answer posted. See below.
library(microbenchmark)
mbm <- microbenchmark("add_column" = {add_column(df, "A", "B")},
"element_wise" = {element_wise_prod(df)})
mbm
> mbm
Unit: microseconds
expr min lq mean median uq max
add_column 1055.127 1071.859 1125.2072 1088.6105 1188.004 1311.104
element_wise 131.732 144.879 207.6434 159.3645 174.581 4813.909
neval
100
100

mutating data frame with search by row in R

I created this data frame as an illustration of a larger problem.
> df <- data.frame(x=c(NA, 12, NA, 67), y=c(32, NA, NA, NA), z=c(NA, NA, NA, NA))
> df
x y z
1 NA 32 NA
2 12 NA NA
3 NA NA NA
4 67 NA NA
I want it to look like this.
x
1 32
2 12
3 NA
4 67
Which is essentially searching through each row for a number. If one is found to return it matching that row, and if no number is found, return an NA.
I created an empty vector.
> list <- c()
Then a for loop that goes through each row returning the element that is not an NA value. Then add it to the 'list' vector.
> for (i in 1:4) {list <- c(list, df[i,!is.na(df[i,])])}
> list
[[1]]
[1] 32
[[2]]
[1] 12
[[3]]
[1] 67
> unlist(list)
32 12 67
This gets close, but the NA rows are ignored.
I also tried a grep pattern match. But as you can imagine, the grep family of calls are designed to run through vectors and not data frame rows.
Not sure how to move forward. Again, if it could look like:
x
1 32
2 12
3 NA
4 67

Use apply to check for values in each row:
apply(df, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA})
# [1] 32 12 NA 67
Another strategy is to use rowSums, but this solution only works if there are no 0 values in your data.frame (if there are, this method will replace those results with NA):
x <- rowSums(df, na.rm = TRUE); x[x == 0] <- NA; x
# [1] 32 12 NA 67

You could use the Reduce function to combine columns pair by pair:
Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df)
# [1] 32 12 NA 67
This function should work with non-numeric data, handles rows with multiple non-NA elements gracefully (it takes the rightmost), and should be a good deal more efficient than one relying on apply.
df.big <- df[rep(1:4, 1000),]
library(microbenchmark)
microbenchmark(apply(df.big, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA}), {x <- rowSums(df.big, na.rm = TRUE); x[x == 0] <- NA; x}, Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df.big))
# Unit: microseconds
# expr min
# apply(df.big, 1, function(x) { z <- x[!is.na(x)] if (length(z)) z else NA }) 14550.050
# { x <- rowSums(df.big, na.rm = TRUE) x[x == 0] <- NA x } 239.826
# Reduce(function(x, y) { x[!is.na(y)] <- y[!is.na(y)] x }, df.big) 353.326
# lq mean median uq max neval
# 15322.4825 19124.8814 17008.2935 22037.387 43337.893 100
# 257.2215 389.4275 380.6595 424.593 1585.234 100
# 384.4750 457.9714 436.2400 511.085 799.992 100
Basically the approach is about as efficient as the rowSums one proposed by #Thomas but can handle character and other data.

Remove columns from dataframe where ALL values are NA

I have a data frame where some of the columns contain NA values.
How can I remove columns where all rows contain NA values?

Try this:
df <- df[,colSums(is.na(df))<nrow(df)]

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
examples using large data (30 columns, 1e6 rows)
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18

Update
You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to #mcstrother for bringing this to attention).
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select(where(not_all_na))
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select(where(not_any_na))
x
1 1
2 2
3 3
4 4
5 5
Old Answer
dplyr now has a select_if verb that may be helpful here:
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")

Another way would be to use the apply() function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9

Another options with purrr package:
library(dplyr)
df <- data.frame(a = NA,
b = seq(1:5),
c = c(rep(1, 4), NA))
df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

df[sapply(df, function(x) all(is.na(x)))] <- NULL

An old question, but I think we can update #mnel's nice answer with a simpler data.table solution:
DT[, .SD, .SDcols = \(x) !all(is.na(x))]
(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.
Speed is equivalent.
microbenchmark::microbenchmark(
which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
sdcols = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
times = 2
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776 2 a
#> sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621 2 a

You can use Janitor package remove_empty
library(janitor)
df %>%
remove_empty(c("rows", "cols")) #select either row or cols or both
Also, Another dplyr approach
library(dplyr)
df %>% select_if(~all(!is.na(.)))
OR
df %>% select_if(colSums(!is.na(.)) == nrow(df))
this is also useful if you want to only exclude / keep column with certain number of missing values e.g.
df %>% select_if(colSums(!is.na(.))>500)

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.
naColsRemoval = function (DataTable) {
na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
DataTable [ , unlist (na.cols) := NULL , with = F]
}
.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

A handy base R option could be colMeans():
df[, colMeans(is.na(df)) != 1]

janitor::remove_constant() does this very nicely.

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:
How to get rid of columns where for ALL rows the value is NA?
First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)
Second, it uses dplyr.
Instead of
df <- df %>% select_if(~all(!is.na(.)))
I find that what works is
df <- df %>% select_if(~!all(is.na(.)))
The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion
every element is equal to "NA"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using apply() but getting class list answer - r

Related

Remove columns that only contains NA NULL rows in R [duplicate]

Remove specific columns from data frame in r using for loop [duplicate]

How to add columns according to element-wise multiplications

mutating data frame with search by row in R

Remove columns from dataframe where ALL values are NA

Categories

Resources