How to delete columns that contain ONLY NAs?

How to delete columns that contain ONLY NAs? - r

I have a data.frame containing some columns with all NA values. How can I delete them from the data.frame?
Can I use the function,
na.omit(...)
specifying some additional arguments?

One way of doing it:
df[, colSums(is.na(df)) != nrow(df)]
If the count of NAs in a column is equal to the number of rows, it must be entirely NA.
Or similarly
df[colSums(!is.na(df)) > 0]

Here is a dplyr solution:
df %>% select_if(~sum(!is.na(.)) > 0)
Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function:
df %>%
select(
where(
~sum(!is.na(.x)) > 0
)
)
df %>%
select(
where(
~!all(is.na(.x))
)
)

Another option is the janitor package:
df <- janitor::remove_empty(df, which = "cols")
https://github.com/sfirke/janitor

It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:
#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
# id nas vals
# 1 1 NA NA
# 2 2 NA 2
# 3 3 NA 1
# 4 4 NA 2
# 5 5 NA 2
# 6 6 NA 3
# 7 7 NA 2
# 8 8 NA 3
# 9 9 NA 3
# 10 10 NA 2
#Use this command to remove columns that are entirely NA values, it will leave columns where only some values are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
# id vals
# 1 1 NA
# 2 2 2
# 3 3 1
# 4 4 2
# 5 5 2
# 6 6 3
# 7 7 2
# 8 8 3
# 9 9 3
# 10 10 2
If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.

Another option with Filter
Filter(function(x) !all(is.na(x)), df)
NOTE: Data from #Simon O'Hanlon's post.

An intuitive script: dplyr::select_if(~!all(is.na(.))). It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns).
> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
> df %>% glimpse()
Observations: 10
Variables: 3
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA
> df %>% select_if(~!all(is.na(.)))
id vals
1 1 NA
2 2 1
3 3 1
4 4 NA
5 5 1
6 6 1
7 7 1
8 8 2
9 9 3
10 10 NA

Because performance was really important for me, I benchmarked all the functions above.
NOTE: Data from #Simon O'Hanlon's post. Only with size 15000 instead of 10.
library(tidyverse)
library(microbenchmark)
set.seed(123)
df <- data.frame(id = 1:15000,
nas = rep(NA, 15000),
vals = sample(c(1:3, NA), 15000,
repl = TRUE))
df
MadSconeF1 <- function(x) x[, colSums(is.na(x)) != nrow(x)]
MadSconeF2 <- function(x) x[colSums(!is.na(x)) > 0]
BradCannell <- function(x) x %>% select_if(~sum(!is.na(.)) > 0)
SimonOHanlon <- function(x) x[ , !apply(x, 2 ,function(y) all(is.na(y)))]
jsta <- function(x) janitor::remove_empty(x)
SiboJiang <- function(x) x %>% dplyr::select_if(~!all(is.na(.)))
akrun <- function(x) Filter(function(y) !all(is.na(y)), x)
mbm <- microbenchmark(
"MadSconeF1" = {MadSconeF1(df)},
"MadSconeF2" = {MadSconeF2(df)},
"BradCannell" = {BradCannell(df)},
"SimonOHanlon" = {SimonOHanlon(df)},
"SiboJiang" = {SiboJiang(df)},
"jsta" = {jsta(df)},
"akrun" = {akrun(df)},
times = 1000)
mbm
Results:
Unit: microseconds
expr min lq mean median uq max neval cld
MadSconeF1 154.5 178.35 257.9396 196.05 219.25 5001.0 1000 a
MadSconeF2 180.4 209.75 281.2541 226.40 251.05 6322.1 1000 a
BradCannell 2579.4 2884.90 3330.3700 3059.45 3379.30 33667.3 1000 d
SimonOHanlon 511.0 565.00 943.3089 586.45 623.65 210338.4 1000 b
SiboJiang 2558.1 2853.05 3377.6702 3010.30 3310.00 89718.0 1000 d
jsta 1544.8 1652.45 2031.5065 1706.05 1872.65 11594.9 1000 c
akrun 93.8 111.60 139.9482 121.90 135.45 3851.2 1000 a
autoplot(mbm)
mbm %>%
tbl_df() %>%
ggplot(aes(sample = time)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~expr, scales = "free")

Try as follows:
df <- df[,colSums(is.na(df))<nrow(df)]

Another option using the map_lgl function from the purrr package, which returns a logical vector and using the [ to remove the columns with all NA. Here is a reproducible example:
set.seed(7)
df <- data.frame(id = 1:5 , nas = rep(NA, 5) , vals = sample(c(1:3,NA), 5, repl = TRUE))
df
#> id nas vals
#> 1 1 NA 2
#> 2 2 NA 3
#> 3 3 NA 3
#> 4 4 NA NA
#> 5 5 NA 3
library(purrr)
df[!map_lgl(df, ~ all(is.na(.)))]
#> id vals
#> 1 1 2
#> 2 2 3
#> 3 3 3
#> 4 4 NA
#> 5 5 3
Created on 2022-08-28 with reprex v2.0.2

Related

Split df column of integers into individual digits in R

I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA

We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))

Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA

Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA

Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)

Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))

Find last of several columns that is not NA (tidyverse)

Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA

The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.

tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA

An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))

base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)

Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

Can I assign a vector function result to split data column of a tibble?

I have some tidy data, comprising identical length series of 'value' ordered by 'idx', identified by 'id', categorized by 'type', with an empty result column 'rollAgg' tagged on.
> head(df,15)
# A tibble: 15 x 5
id idx type value rollAgg
<int> <int> <chr> <dbl> <dbl>
1 1 1 A 4.50 0
2 1 2 A 2.47 0
3 1 3 A 2.78 0
4 1 4 A 2.29 0
5 1 5 A 1.48 0
6 1 6 A 2.30 0
7 1 7 A 4.94 0
8 1 8 A 4.68 0
9 1 9 A 3.36 0
10 1 10 A 4.27 0
11 2 1 B 4.10 0
12 2 2 B 1.25 0
13 2 3 B 3.95 0
14 2 4 B 2.78 0
15 2 5 B 2.28 0
...
I want to split the data by 'id', and then use rollapply() to generate a vector of either rolling mean(value) or rolling sum(value), determined by 'type'.
Can I assign the vector result from rollapply() to the empty 'rollAgg' column in the split() data and then unsplit()? (rather than create an empty vector of required size and then cbind())
I can assign the result(s) to an empty vector (or matrix)
## switchable mean/sum function
mean_sum <- function(x, b = TRUE){
if (b)
{
mean(x)
} else {
sum(x)
}
}
##
#dummy data
df <- tibble(id = rep(1:6, each = 10), idx = rep(1:10, 6), type = rep(c('A', 'B'), each = 10, times = 3), value = runif(60, 1, 5), rollAgg = 0.0)
#test mean/sum function on single 'id', and assign result to 'rollAgg' column
d <- df[df$id==2,]
z <- zoo(d$value, order.by = d$idx)
par <- d$type[1]
d$rollAgg <- (rollapply(z, 5, mean_sum, b = (par == 'A'), fill = NA, align = 'right'))
#prepare split data
by_id <- split(df, df$id)
#assign result to pre-assigned matrix
result <- as_tibble(matrix(data=0.0, nrow = 10, ncol = 6, dimnames=list(NULL,seq(1,6,1))))
for (i in seq_along(by_id)){
par <- by_id[[i]]$type[1]
z <- zoo(by_id[[i]]$value, order.by = by_id[[i]]$idx)
result[[i]] <- rollapply(z, 5, mean_sum, b = (par == 'A'), fill = NA, align = 'right')
}
#... which works - columns are alternating mean() and sum():
> head(result, 10)
# A tibble: 10 x 6
`1` `2` `3` `4` `5` `6`
<S3: zoo> <S3: zoo> <S3: zoo> <S3: zoo> <S3: zoo> <S3: zoo>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 NA NA NA NA NA NA
5 2.702983 14.35262 2.308507 16.58130 2.808490 14.63715
6 2.263146 13.47958 2.026396 14.90904 2.733020 14.75438
7 2.757074 15.46849 2.073545 16.27923 2.508627 14.56983
8 3.135715 14.84012 2.003807 13.15344 2.834664 14.33360
9 3.348647 15.67731 2.377744 14.19039 2.584147 16.21944
10 3.907222 14.40763 2.520130 14.86086 2.915271 15.48656
#try to assign result direct to split data, without success...
for (i in by_id){
par <- i$type[1]
z <- zoo(i$value, order.by = i$idx)
i$rollAgg <- rollapply(z, 5, mean_sum, b = (par == 'A'), fill = NA, align = 'right')
}
# finally, not sure how to unsplit() by_id to revert to original df...

If your aim is to run rollapply on value separately for each id then instead of using split use ave:
b <- TRUE
roll <- function(x) rollapplyr(x, 5, mean_sum, b = b, fill = NA)
transform(df, rollAgg = ave(value, id, FUN = roll))
or
b <- TRUE
rollb <- function(b) {
function(x) rollapplyr(x, 5, mean_sum, b = b, fill = NA)
}
transform(df, rollAgg = ave(value, id, FUN = rollb(b)))

About
not sure how to unsplit...
One short answer with iris data :)
unsplit(split(x = iris, f = iris$Species), f = iris$Species)

a neat solution, based on rollmean with dplyr and magrittr
df <- tibble(id = rep(1:6, each = 10), idx = rep(1:10, 6), type = rep(c('A', 'B'), each = 10, times = 3), value = runif(60, 1, 5))
df %<>%
group_by(id) %>%
mutate(rollAgg=rollapply(value,5, mean_sum, b = (type[1] == 'A'), fill = NA, align="right"))

Variable names as Input in an R Function

I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.

Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA

With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))

If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3

Sum of two Columns of Data Frame with NA Values

I have a data frame with some NA values. I need the sum of two of the columns. If a value is NA, I need to treat it as zero.
a b c d
1 2 3 4
5 NA 7 8
Column e should be the sum of b and c:
e
5
7
I have tried a lot of things, and done two dozen searches with no luck. It seems like a simple problem. Any help would be appreciated!

dat$e <- rowSums(dat[,c("b", "c")], na.rm=TRUE)
dat
# a b c d e
# 1 1 2 3 4 5
# 2 5 NA 7 8 7

dplyr solution, taken from here:
library(dplyr)
dat %>%
rowwise() %>%
mutate(e = sum(b, c, na.rm = TRUE))

Here is another solution, with concatenated ifelse():
dat$e <- ifelse(is.na(dat$b) & is.na(dat$c), dat$e <-0, ifelse(is.na(dat$b), dat$e <- 0 + dat$c, dat$b + dat$c))
# a b c d e
#1 1 2 3 4 5
#2 5 NA 7 8 7
Edit, here is another solution that uses with as suggested by #kasterma in the comments, this is much more readable and straightforward:
dat$e <- with(dat, ifelse(is.na(b) & is.na(c ), 0, ifelse(is.na(b), 0 + c, b + c)))

if you want to keep NA if both columns has it you can use:
Data, sample:
dt <- data.table(x = sample(c(NA, 1, 2, 3), 100, replace = T), y = sample(c(NA, 1, 2, 3), 100, replace = T))
Solution:
dt[, z := ifelse(is.na(x) & is.na(y), NA_real_, rowSums(.SD, na.rm = T)), .SDcols = c("x", "y")]
(the data.table way)

I hope that it may help you
Some cases you have a few columns that are not numeric. This approach will serve you both.
Note that: c_across() for dplyr version 1.0.0 and later
df <- data.frame(
TEXT = c("text1", "text2"), a = c(1,5), b = c(2, NA), c = c(3,7), d = c(4,8))
df2 <- df %>%
rowwise() %>%
mutate(e = sum(c_across(a:d), na.rm = TRUE))
# A tibble: 2 x 6
# Rowwise:
# TEXT a b c d e
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 text1 1 2 3 4 10
# 2 text2 5 NA 7 8 20

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to delete columns that contain ONLY NAs? - r

I have a data.frame containing some columns with all NA values. How can I delete them from the data.frame? Can I use the function, na.omit(...) specifying some additional arguments?

One way of doing it: df[, colSums(is.na(df)) != nrow(df)] If the count of NAs in a column is equal to the number of rows, it must be entirely NA. Or similarly df[colSums(!is.na(df)) > 0]

Here is a dplyr solution: df %>% select_if(~sum(!is.na(.)) > 0) Update: The summarise_if() function is superseded as of dplyr 1.0. Here are two other solutions that use the where() tidyselect function: df %>% select( where( ~sum(!is.na(.x)) > 0 ) ) df %>% select( where( ~!all(is.na(.x)) ) )

Another option is the janitor package: df <- janitor::remove_empty(df, which = "cols") https://github.com/sfirke/janitor

Another option with Filter Filter(function(x) !all(is.na(x)), df) NOTE: Data from #Simon O'Hanlon's post.

Try as follows: df <- df[,colSums(is.na(df))<nrow(df)]

Related

Split df column of integers into individual digits in R

Find last of several columns that is not NA (tidyverse)

Can I assign a vector function result to split data column of a tibble?

Variable names as Input in an R Function

Sum of two Columns of Data Frame with NA Values

Categories

Resources