Find the index of columns containing more than 5 NA values - r

I want to subset a dataframe and extract only the columns that contain 5 or more NA values.
data.frame(A = rep(1, 10), B = c(rep(2,5), rep(3,5)), D = rep(5, 10), E = c(rep(1,2), rep(NA,6), rep(6,2)), F = c(rep(NA,2), rep(2,8)))
A B D E F
1 1 2 5 1 NA
2 1 2 5 1 NA
3 1 2 5 NA 2
4 1 2 5 NA 2
5 1 2 5 NA 2
6 1 3 5 NA 2
7 1 3 5 NA 2
8 1 3 5 NA 2
9 1 3 5 6 2
10 1 3 5 6 2
So in this example I want to have the index of the column "E".
My original dataset has about 3000 columns, so speed is more or less important.
I have been trying to do this with sum(is.na) and filter_if(any_vars) but all to no avail..

Using ColSums with is.na
names(df)[colSums(is.na(df))>5]
[1] "E"

We can use colSums on logical matrix (is.na(df1)), get the index with which and extract the names
names(which(colSums(is.na(df1)) >= 5))
#[1] "E"

which(unlist(lapply(df, function(x) sum(is.na(x)) > 5)))
4

Related

how make some value of a column NA with respect of another column

I want to make value of each row of column A , NA ,where column B is 2:
data
A B
1 2
2 4
NA 5
6 2
output
A B
NA 2
2 4
NA 5
NA 2
first and last row of B was 2 so A got NA in those.
Here's a way using ifelse in base R -
df$A <- ifelse(df$B == 2, NA_real_, df$A)
set.seed(0)
df <- data.frame(A = sample(1:10, size=5, replace=T),
B = sample(1:10, size=5, replace=T))
df
A B
1 9 7
2 4 2
3 7 3
4 1 1
5 2 5
df$A[df$B == 2] <- NA
df
A B
1 9 7
2 NA 2
3 7 3
4 1 1
5 2 5

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

R dataframe filtering

I have a dataframe df as follows:
A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
what I want to do is remove all the rows that has NA.
if I use
apply(df,1,function(row) all(!is.na(row)))
I get the list of all the rows with TRUE (if the row does not contain a NA) and FALSE(if the row contains a NA).
But how do I get the rowname such that I can create some like
df2<-df[-c(list of rows that contains NA),]
which will give me all the new dataframe with NA in rows.
Thanks in advance.
Assuming you have a dataframe that looks like this:
A B C
1 NA 1 2
2 2 NA 3
3 4 5 6
4 7 8 9
Then try:
df1[apply(df1,1,function(x) !any(is.na(x))), ]
A B C
3 4 5 6
4 7 8 9
It doesn't use rownames but rather a logical vector. I guess Joshua and I read you question differently but we used the same method.
Joshua's suggestion is more compact:
> na.omit(df1)
A B C
3 4 5 6
4 7 8 9
And it reminds me that I should have used:
> df1[complete.cases(df1), ]
A B C
3 4 5 6
4 7 8 9
You can use the logical vector from your apply call to index your data.frame.
> Data[!apply(Data,1,function(row) all(!is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3
> # or like this:
> Data[apply(Data,1,function(row) any(is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3
is.na on a data.frame returns a matrix, which is a better candidate for apply:
df <- read.table(textConnection(" A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
"))
## a matrix
is.na(df)
## logical for selecting rows that are all NA
apply(df, 1, function(x) all(is.na(x)))
## one liner
df[!apply(df, 1, function(x) all(is.na(x))), ]

Resources