Find last of several columns that is not NA (tidyverse) - r

Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA

The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.

tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA

An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))

base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)

Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

Related

Dplyr: add multiple columns with mutate/across from character vector

I want to add several columns (filled with NA) to a data.frame using dplyr. I've defined the names of the columns in a character vector. Usually, with only one new column, you can use the following pattern:
test %>%
mutate(!!new_column := NA)
However, I don't get it to work with across:
library(dplyr)
test <- data.frame(a = 1:3)
add_cols <- c("col_1", "col_2")
test %>%
mutate(across(!!add_cols, ~ NA))
#> Error: Problem with `mutate()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Columns `col_1` and `col_2` don't exist.
#> ℹ Input `..1` is `across(c("col_1", "col_2"), ~NA)`.
test %>%
mutate(!!add_cols := NA)
#> Error: The LHS of `:=` must be a string or a symbol
expected_output <- data.frame(
a = 1:3,
col_1 = rep(NA, 3),
col_2 = rep(NA, 3)
)
expected_output
#> a col_1 col_2
#> 1 1 NA NA
#> 2 2 NA NA
#> 3 3 NA NA
Created on 2021-10-05 by the reprex package (v1.0.0)
With the first approach, the column names are correctly created, but then it directly tries to find it in the existing column names. In the second approach, I can't use anything other than a single string.
Is there a tidyverse solution or do I need to resort to the good old for loop?
The !! works for a single element
for(nm in add_cols) test <- test %>%
mutate(!! nm := NA)
-output
> test
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Or another option is
test %>%
bind_cols(setNames(rep(list(NA), length(add_cols)), add_cols))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
In base R, this is easier
test[add_cols] <- NA
Which can be used in a pipe
test %>%
`[<-`(., add_cols, value = NA)
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
across works only if the columns are already present i.e. it is suggesting to loop across the columns present in the data and do some modification/create new columns with .names modification
We could make use add_column from tibble
library(tibble)
library(janitor)
add_column(test, !!! add_cols) %>%
clean_names %>%
mutate(across(all_of(add_cols), ~ NA))
a col_1 col_2
1 1 NA NA
2 2 NA NA
3 3 NA NA
Another approach:
library(tidyverse)
f <- function(x) df$x = NA
mutate(test, map_dfc(add_cols,~ f(.x)))

Paste two columns in R but NAs are included in new column [duplicate]

This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')

Replace certain variables with NA, one variable is NA

What is the best function to use if I want to replace certain variables with NA based on a conditional?
If status = NA, then score_1:score_3 will be NA
tried:
if(df2$status == NA){
df2$score_2 <- NA
}else{
df2$score_2 <- df$score_2
}
Thanks in advance
One option is to find the NAs in 'status' and assign the columns that having 'score' as column name to NA in base R
i1 <- is.na(df2$Status)
df2[i1, grep("^Score_\\d+$", names(df2))] <- NA
Or an option in dplyr
library(dplyr)
df2 %>%
mutate_at(vars(starts_with('Score')), ~ replace(., is.na(Status), NA))
You can do this by finding out which rows in the data frame are NA and then setting the columns in those rows to NA.
df <- data.frame(client_id = 1:4,
Date = 1:4,
Status = c(1, NA, 1, NA),
Score1 = runif(4)*100,
Score2 = runif(4)*100,
Score3 = runif(4)*100)
idx <- is.na(df$Status)
df[idx, 4:6] <- NA
df
#> client_id Date Status Score1 Score2 Score3
#> 1 1 1 1 48.08677 16.62185 91.80062
#> 2 2 2 NA NA NA NA
#> 3 3 3 1 14.04552 64.55724 56.45998
#> 4 4 4 NA NA NA NA

How to remove rows with inf from a dataframe in R

I have a very large dataframe(df) with approximately 35-45 columns(variables) and rows greater than 300. Some of the rows contains NA,NaN,Inf,-Inf values in single or multiple variables and I have used
na.omit(df) to remove rows with NA and NaN but I cant remove rows with Inf and -Inf values using na.omit function.
While searching I came across this thread Remove rows with Inf and NaN in R and used the modified code df[is.finite(df)] but its not removing the rows with Inf and -Inf and also gives this error
Error in is.finite(df) : default method not implemented for type
'list'
EDITED
Remove the entire row even the corresponding one or multiple columns have inf and -inf
To remove the rows with +/-Inf I'd suggest the following:
df <- df[!is.infinite(rowSums(df)),]
or, equivalently,
df <- df[is.finite(rowSums(df)),]
The second option (the one with is.finite() and without the negation) removes also rows containing NA values in case that this has not already been done.
Depending on the data, there are a couple options using scoped variants of dplyr::filter() and is.finite() or is.infinite() that might be useful:
library(dplyr)
# sample data
df <- data_frame(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# across all columns:
df %>%
filter_all(all_vars(!is.infinite(.)))
# note that is.finite() does not work with NA or strings:
df %>%
filter_all(all_vars(is.finite(.)))
# checking only numeric columns:
df %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.)))
# checking only select columns, in this case a through c:
df %>%
filter_at(vars(a:c), all_vars(!is.infinite(.)))
The is.finite works on vector and not on data.frame object. So, we can loop through the data.frame using lapply and get only the 'finite' values.
lapply(df, function(x) x[is.finite(x)])
If the number of Inf, -Inf values are different for each column, the above code will have a list with elements having unequal length. So, it may be better to leave it as a list. If we want a data.frame, it should have equal lengths.
If we want to remove rows contain any NA or Inf/-Inf values
df[Reduce(`&`, lapply(df, function(x) !is.na(x) & is.finite(x))),]
Or a compact option by #nicola
df[Reduce(`&`, lapply(df, is.finite)),]
If we are ready to use a package, a compact option would be NaRV.omit
library(IDPmisc)
NaRV.omit(df)
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(1:5, NA, -Inf, Inf),
20*5, replace=TRUE), ncol=5))
To keep the rows without Inf we can do:
df[apply(df, 1, function(x) all(is.finite(x))), ]
Also NAs are handled by this because of:
a rowindex with value NA will remove this row in the result.
Also rows with NaN are not in the result.
set.seed(24)
df <- as.data.frame(matrix(sample(c(0:9, NA, -Inf, Inf, NaN), 20*5, replace=TRUE), ncol=5))
df2 <- df[apply(df, 1, function(x) all(is.finite(x))), ]
Here are the results of the different is.~-functions:
x <- c(42, NA, NaN, Inf)
is.finite(x)
# [1] TRUE FALSE FALSE FALSE
is.na(x)
# [1] FALSE TRUE TRUE FALSE
is.nan(x)
# [1] FALSE FALSE TRUE FALSE
df[!is.infinite(df$x),]
wherein x is the column of df that contains the infinite values. The first answer posted was contingent on rowsums but for my own problem, the df had columns which could not be added.
It took me awhile to work this out for dplyr 1.0.0 so i thought i would put up the new version of #sbha solutions using c_across since filter_all, filter_if are getting deprecated.
library(dplyr)
df <- tibble(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!all(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 4 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(a:c))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
To be honest I think #sbha answer is simpler!
I had this problem and none of the above solutions worked for me. I used the following to remove rows with +/-Inf in columns 15 and 16 of my dataframe.
d<-subset(c, c[,15:16]!="-Inf")
e<-subset(d, d[,15:16]!="Inf")
I consider myself new to coding and I couldn't get the recommendations above to work with my code.
I found a less complicated way to reduce a dataframe with 2 lines, first by replacing Inf with Na, then by selecting rows with complete data:
Df[sapply(Df, is.infinite)] <- NA
Df<-Df[complete.cases(Df), ]

How to replace NA values in a table for selected columns

There are a lot of posts about replacing NA values. I am aware that one could replace NAs in the following table/frame with the following:
x[is.na(x)]<-0
But, what if I want to restrict it to only certain columns? Let's me show you an example.
First, let's start with a dataset.
set.seed(1234)
x <- data.frame(a=sample(c(1,2,NA), 10, replace=T),
b=sample(c(1,2,NA), 10, replace=T),
c=sample(c(1:5,NA), 10, replace=T))
Which gives:
a b c
1 1 NA 2
2 2 2 2
3 2 1 1
4 2 NA 1
5 NA 1 2
6 2 NA 5
7 1 1 4
8 1 1 NA
9 2 1 5
10 2 1 1
Ok, so I only want to restrict the replacement to columns 'a' and 'b'. My attempt was:
x[is.na(x), 1:2]<-0
and:
x[is.na(x[1:2])]<-0
Which does not work.
My data.table attempt, where y<-data.table(x), was obviously never going to work:
y[is.na(y[,list(a,b)]), ]
I want to pass columns inside the is.na argument but that obviously wouldn't work.
I would like to do this in a data.frame and a data.table. My end goal is to recode the 1:2 to 0:1 in 'a' and 'b' while keeping 'c' the way it is, since it is not a logical variable. I have a bunch of columns so I don't want to do it one by one. And, I'd just like to know how to do this.
Do you have any suggestions?
You can do:
x[, 1:2][is.na(x[, 1:2])] <- 0
or better (IMHO), use the variable names:
x[c("a", "b")][is.na(x[c("a", "b")])] <- 0
In both cases, 1:2 or c("a", "b") can be replaced by a pre-defined vector.
Building on #Robert McDonald's tidyr::replace_na() answer, here are some dplyr options for controlling which columns the NAs are replaced:
library(tidyverse)
# by column type:
x %>%
mutate_if(is.numeric, ~replace_na(., 0))
# select columns defined in vars(col1, col2, ...):
x %>%
mutate_at(vars(a, b, c), ~replace_na(., 0))
# all columns:
x %>%
mutate_all(~replace_na(., 0))
Edit 2020-06-15
Since data.table 1.12.4 (Oct 2019), data.table gains two functions to facilitate this: nafill and setnafill.
nafill operates on columns:
cols = c('a', 'b')
y[ , (cols) := lapply(.SD, nafill, fill=0), .SDcols = cols]
setnafill operates on tables (the replacements happen by-reference/in-place)
setnafill(y, cols=cols, fill=0)
# print y to show the effect
y[]
This will also be more efficient than the other options; see ?nafill for more, the last-observation-carried-forward (LOCF) and next-observation-carried-backward (NOCB) versions of NA imputation for time series.
This will work for your data.table version:
for (col in c("a", "b")) y[is.na(get(col)), (col) := 0]
Alternatively, as David Arenburg points out below, you can use set (side benefit - you can use it either on data.frame or data.table):
for (col in 1:2) set(x, which(is.na(x[[col]])), col, 0)
This is now trivial in tidyr with replace_na(). The function appears to work for data.tables as well as data.frames:
tidyr::replace_na(x, list(a=0, b=0))
Not sure if this is more concise, but this function will also find and allow replacement of NAs (or any value you like) in selected columns of a data.table:
update.mat <- function(dt, cols, criteria) {
require(data.table)
x <- as.data.frame(which(criteria==TRUE, arr.ind = TRUE))
y <- as.matrix(subset(x, x$col %in% which((names(dt) %in% cols), arr.ind = TRUE)))
y
}
To apply it:
y[update.mat(y, c("a", "b"), is.na(y))] <- 0
The function creates a matrix of the selected columns and rows (cell coordinates) that meet the input criteria (in this case is.na == TRUE).
We can solve it in data.table way with tidyr::repalce_na function and lapply
library(data.table)
library(tidyr)
setDT(df)
df[,c("a","b","c"):=lapply(.SD,function(x) replace_na(x,0)),.SDcols=c("a","b","c")]
In this way, we can also solve paste columns with NA string. First, we replace_na(x,""),then we can use stringr::str_c to combine columns!
Starting from the data.table y, you can just write:
y[, (cols):=lapply(.SD, function(i){i[is.na(i)] <- 0; i}), .SDcols = cols]
Don't forget to library(data.table) before creating y and running this command.
This needed a bit extra for dealing with NA's in factors.
Found a useful function here, which you can then use with mutate_at or mutate_if:
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else(is.na(x), 'NONE', x)
x <- as.factor(x)
}
df <- df %>%
mutate_at(
vars(vector_of_column_names),
replace_factor_na
)
Or apply to all factor columns:
df <- df %>%
mutate_if(is.factor, replace_factor_na)
For a specific column, there is an alternative with sapply
DF <- data.frame(A = letters[1:5],
B = letters[6:10],
C = c(2, 5, NA, 8, NA))
DF_NEW <- sapply(seq(1, nrow(DF)),
function(i) ifelse(is.na(DF[i,3]) ==
TRUE,
0,
DF[i,3]))
DF[,3] <- DF_NEW
DF
For completeness, built upon #sbha's answer, here is the tidyverse version with the across() function that's available in dplyr since version 1.0 (which supersedes the *_at() variants, and others):
# random data
set.seed(1234)
x <- data.frame(a = sample(c(1, 2, NA), 10, replace = T),
b = sample(c(1, 2, NA), 10, replace = T),
c = sample(c(1:5, NA), 10, replace = T))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
# with the magrittr pipe
x %>% mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
# with the native pipe (since R 4.1)
x |> mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
Created on 2021-12-08 by the reprex package (v2.0.1)
it's quite handy with data.table and stringr
library(data.table)
library(stringr)
x[, lapply(.SD, function(xx) {str_replace_na(xx, 0)})]
FYI
this works fine for me
DataTable DT = new DataTable();
DT = DT.AsEnumerable().Select(R =>
{
R["Campo1"] = valor;
return (R);
}).ToArray().CopyToDataTable();

Resources