Removing empty rows of a data file in R - r

I have a dataset with empty rows. I would like to remove them:
myData<-myData[-which(apply(myData,1,function(x)all(is.na(x)))),]
It works OK. But now I would like to add a column in my data and initialize the first value:
myData$newCol[1] <- -999
Error in `$<-.data.frame`(`*tmp*`, "newCol", value = -999) :
replacement has 1 rows, data has 0
Unfortunately it doesn't work and I don't really understand why and I can't solve this.
It worked when I removed one line at a time using:
TgData = TgData[2:nrow(TgData),]
Or anything similar.
It also works when I used only the first 13.000 rows.
But it doesn't work with my actual data, with 32.000 rows.
What did I do wrong? It seems to make no sense to me.

I assume you want to remove rows that are all NAs. Then, you can do the following :
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
data[rowSums(is.na(data)) != ncol(data),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
If you want to remove rows that have at least one NA, just change the condition :
data[rowSums(is.na(data)) == 0,]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 6 7

If you have empty rows, not NAs, you can do:
data[!apply(data == "", 1, all),]
To remove both (NAs and empty):
data <- data[!apply(is.na(data) | data == "", 1, all),]

Here are some dplyr options:
# sample data
df <- data.frame(a = c('1', NA, '3', NA), b = c('a', 'b', 'c', NA), c = c('e', 'f', 'g', NA))
library(dplyr)
# remove rows where all values are NA:
df %>% filter_all(any_vars(!is.na(.)))
df %>% filter_all(any_vars(complete.cases(.)))
# remove rows where only some values are NA:
df %>% filter_all(all_vars(!is.na(.)))
df %>% filter_all(all_vars(complete.cases(.)))
# or more succinctly:
df %>% filter(complete.cases(.))
df %>% na.omit
# dplyr and tidyr:
library(tidyr)
df %>% drop_na

Alternative solution for rows of NAs using janitor package
myData %>% remove_empty("rows")

This is similar to some of the above answers, but with this, you can specify if you want to remove rows with a percentage of missing values greater-than or equal-to a given percent (with the argument pct)
drop_rows_all_na <- function(x, pct=1) x[!rowSums(is.na(x)) >= ncol(x)*pct,]
Where x is a dataframe and pct is the threshold of NA-filled data you want to get rid of.
pct = 1 means remove rows that have 100% of its values NA.
pct = .5 means remome rows that have at least half its values NA

Using dplyr's if_all/if_any
Drop rows with any NA OR Select rows with no NA value.
df %>% filter(!if_any(a:c, is.na))
# a b c
#1 1 a e
#2 3 c g
#Also
df %>% filter(if_all(a:c, Negate(is.na)))
Drop rows with all NA values or select rows with at least one non-NA value.
df %>% filter(!if_all(a:c, is.na))
# a b c
#1 1 a e
#2 <NA> b f
#3 3 c g
#Also
df %>% filter(if_any(a:c, Negate(is.na)))
data
Using data from #sbha -
df <- data.frame(a = c('1', NA, '3', NA),
b = c('a', 'b', 'c', NA),
c = c('e', 'f', 'g', NA))

Here's yet another answer if you just want a handy function wrapper. Also, many of the above solutions remove a row with ANY NAs, whereas this one only removes rows that are ALL NAs.
data <- rbind(c(1,2,3), c(1, NA, 4), c(4,6,7), c(NA, NA, NA), c(4, 8, NA)) # sample data
data
rmNArows<-function(d){
goodRows<-apply(d,1,function(x) sum(is.na(x))!=ncol(d))
d[goodRows,]
}
rmNArows(data)

Related

Selecting Rows with Missing Data in a Range of Columns

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.
However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.
For example, in dataframe df:
df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
D2 = c(NA, 0, 1, 1),
V1 = c(11, NA, 33, NA),
V2 = c(111, 222, NA, NA)
)
df
# D1 D2 V1 V2
# A NA 11 111
# B 0 NA 222
# C 1 33 NA
# D 1 NA NA
I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).
I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.
N.B., a similar SO question addresses selecting rows where none are NSs.
You can try this:
df %>% filter(is.na(V1) & is.na(V2))
OUTPUT
D1 D2 V1 V2
1 D 1 NA NA
You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...
library(dplyr)
df %>%
filter(if_all(V1:V2, is.na))
# D1 D2 V1 V2
#1 D 1 NA NA
Also works (this shows the flexibility of tidyselect):
filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))

If statement with two conditions and NA

I am looking to use a conditional statement to access date rows which are before 0021-01-11 and have NA value in a specific column (People_vaccinated for example). For those rows I wanted to impute with zero.
I want to use an IF statement with (condition1 AND condition 2).
Condition1 can be df$People_vaccinated == NA and condition2 can be df$date < 'given date'
Maybe this will help -
df <- data.frame(Date = c('0021-01-07', '0021-01-08','0021-01-11', '0021-01-12'),
a = c(2, NA, 3, NA),
b = c(1, NA, 2, 3))
ind <- match('0021-01-11', df$Date)
df$a[1:ind][is.na(df$a[1:ind])] <- 0
df
# Date a b
#1 0021-01-07 2 1
#2 0021-01-08 0 NA
#3 0021-01-11 3 2
#4 0021-01-12 NA 3
Or using dplyr -
library(dplyr)
df <- df %>%
mutate(a = replace(a,
row_number() <= match('0021-01-11', Date) & is.na(a), 0))
df

Tidy way to add column if missing from data frame

I am looking for a tidy way to add a missing column if not present in the dataset. For example, df1 does not contain column "c".
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
desired output:
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4), c=c(NA, NA, NA, NA))
Assuming you don't want to overwrite the column if it is already present in your data you can use add_column along with an if condition to check if the column is already present.
library(dplyr)
df1 <- data.frame(a=c(1:3, NA), b=c(NA,2:4))
if(!'c' %in% names(df1)) df1 <- df1 %>% add_column(c = NA)
df1
# a b c
#1 1 NA NA
#2 2 2 NA
#3 3 3 NA
#4 NA 4 NA
Tidy way would be dplyr::mutate I guess.
library(dplyr)
df1 <- df1 %>%
mutate(c = c(NA))
No need to specify multiple NA as it will be recycled to fill all rows of the data frame.

Find last of several columns that is not NA (tidyverse)

Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA
The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.
tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA
An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))
base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)
Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA

How to remove rows with inf from a dataframe in R

I have a very large dataframe(df) with approximately 35-45 columns(variables) and rows greater than 300. Some of the rows contains NA,NaN,Inf,-Inf values in single or multiple variables and I have used
na.omit(df) to remove rows with NA and NaN but I cant remove rows with Inf and -Inf values using na.omit function.
While searching I came across this thread Remove rows with Inf and NaN in R and used the modified code df[is.finite(df)] but its not removing the rows with Inf and -Inf and also gives this error
Error in is.finite(df) : default method not implemented for type
'list'
EDITED
Remove the entire row even the corresponding one or multiple columns have inf and -inf
To remove the rows with +/-Inf I'd suggest the following:
df <- df[!is.infinite(rowSums(df)),]
or, equivalently,
df <- df[is.finite(rowSums(df)),]
The second option (the one with is.finite() and without the negation) removes also rows containing NA values in case that this has not already been done.
Depending on the data, there are a couple options using scoped variants of dplyr::filter() and is.finite() or is.infinite() that might be useful:
library(dplyr)
# sample data
df <- data_frame(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# across all columns:
df %>%
filter_all(all_vars(!is.infinite(.)))
# note that is.finite() does not work with NA or strings:
df %>%
filter_all(all_vars(is.finite(.)))
# checking only numeric columns:
df %>%
filter_if(~is.numeric(.), all_vars(!is.infinite(.)))
# checking only select columns, in this case a through c:
df %>%
filter_at(vars(a:c), all_vars(!is.infinite(.)))
The is.finite works on vector and not on data.frame object. So, we can loop through the data.frame using lapply and get only the 'finite' values.
lapply(df, function(x) x[is.finite(x)])
If the number of Inf, -Inf values are different for each column, the above code will have a list with elements having unequal length. So, it may be better to leave it as a list. If we want a data.frame, it should have equal lengths.
If we want to remove rows contain any NA or Inf/-Inf values
df[Reduce(`&`, lapply(df, function(x) !is.na(x) & is.finite(x))),]
Or a compact option by #nicola
df[Reduce(`&`, lapply(df, is.finite)),]
If we are ready to use a package, a compact option would be NaRV.omit
library(IDPmisc)
NaRV.omit(df)
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(1:5, NA, -Inf, Inf),
20*5, replace=TRUE), ncol=5))
To keep the rows without Inf we can do:
df[apply(df, 1, function(x) all(is.finite(x))), ]
Also NAs are handled by this because of:
a rowindex with value NA will remove this row in the result.
Also rows with NaN are not in the result.
set.seed(24)
df <- as.data.frame(matrix(sample(c(0:9, NA, -Inf, Inf, NaN), 20*5, replace=TRUE), ncol=5))
df2 <- df[apply(df, 1, function(x) all(is.finite(x))), ]
Here are the results of the different is.~-functions:
x <- c(42, NA, NaN, Inf)
is.finite(x)
# [1] TRUE FALSE FALSE FALSE
is.na(x)
# [1] FALSE TRUE TRUE FALSE
is.nan(x)
# [1] FALSE FALSE TRUE FALSE
df[!is.infinite(df$x),]
wherein x is the column of df that contains the infinite values. The first answer posted was contingent on rowsums but for my own problem, the df had columns which could not be added.
It took me awhile to work this out for dplyr 1.0.0 so i thought i would put up the new version of #sbha solutions using c_across since filter_all, filter_if are getting deprecated.
library(dplyr)
df <- tibble(a = c(1, 2, 3, NA), b = c(5, Inf, 8, 8), c = c(9, 10, Inf, 11), d = c('a', 'b', 'c', 'd'))
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!all(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 4 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 2 Inf 10 b
# 3 3 8 Inf c
# 4 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(where(is.numeric)))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
df %>%
rowwise %>%
filter(!any(is.infinite(c_across(a:c))))
# # A tibble: 2 x 4
# # Rowwise:
# a b c d
# <dbl> <dbl> <dbl> <chr>
# 1 1 5 9 a
# 2 NA 8 11 d
To be honest I think #sbha answer is simpler!
I had this problem and none of the above solutions worked for me. I used the following to remove rows with +/-Inf in columns 15 and 16 of my dataframe.
d<-subset(c, c[,15:16]!="-Inf")
e<-subset(d, d[,15:16]!="Inf")
I consider myself new to coding and I couldn't get the recommendations above to work with my code.
I found a less complicated way to reduce a dataframe with 2 lines, first by replacing Inf with Na, then by selecting rows with complete data:
Df[sapply(Df, is.infinite)] <- NA
Df<-Df[complete.cases(Df), ]

Resources