Select columns, excluding some which are all NA - r

Suppose I have this dataframe
df <- data.frame(keep = c(1, NA, 2),
also_want = c(NA, NA, NA),
maybe = c(1, 2, NA),
maybe_2 = c(NA, NA, NA))
Edit: In the actual dataframe there are many columns I'd like to keep, so spelling them all out isn't viable. These columns are all the columns that do not start with maybe. The maybe columns, instead, do have a common naming like maybe, maybe_1 etc. that could work with grep or stringr::str_detect
I want to select keep, and also_want. I also want any of the maybe columns that have values other than NA
desired_df
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I can use select_if to get all columns that have non-NA values, but then I lose also_want
library(dplyr)
df %>%
select_if(~sum(!is.na(.)) > 0)
keep maybe
1 1 1
2 NA 2
3 2 NA
Thoughts?

With dplyr 1.0.0 you can use the where function inside a select statement to test for conditions that your variables have to satisfy, but first you specify the variables you also want to keep.
EDIT
I've inserted the condition that only the "maybe" variables have to contain values other than NA; before, we select every column that does not start with "maybe".
df %>%
select(!starts_with("maybe"), starts_with("maybe") & where(~sum(!is.na(.)) > 0))
Output
# keep also_want maybe
# 1 1 NA 1
# 2 NA NA 2
# 3 2 NA NA

following your comments, in Base-R we can use
df[,!apply(
rbind(
grepl("maybe",colnames(df)),
!apply(df, 2, function(x) !all(is.na(x)))
)
,2,all)]
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
Or if you prefer seeing the same code all on 1 line:
df[,!apply(rbind(grepl("maybe",colnames(df)),!apply(df, 2, function(x) !all(is.na(x)))),2,all)]

I eventually figured this out. Using str_detect to select all non-maybe columns, and then using a one-liner inside sapply to also select any other columns (i.e. any maybe columns) that have non-NA values.
library(dplyr)
library(stringr)
df %>%
select_if(stringr::str_detect(names(.), "maybe", negate = TRUE) |
sapply(., function(x) {
sum(!is.na(x))
} > 0))
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA

Related

How to count the number of rows with NA values in specific columns?

I have a dataframe with 502543 obs. of 9 variables including the ID (which is repeated several times). I need to count how many rows have NA values in all variables except in ID. I don't want to delete this ID column, as later I will need to count n_distinct(ID), that's why I am looking for a method to count rows with NA values in all columns except in this one.
My dataframe looks like this sample:
ID neckpain backpain kneepain
1 Yes NA NA
2 NA NA NA
3 Yes Yes Yes
2 NA NA NA
3 Yes Yes Yes
4 NA NA NA
The outcome I'm trying to obtain would be
nrows: 3
Thanks in advance
For a supplement, it's a dplyr solution.
library(dplyr)
df %>% filter(across(-ID, is.na)) %>% count()
# n
# 1 3
Assuming that ID is your first column, then
sum(rowSums(is.na(df[-1])) == ncol(df[-1]))
#[1] 3
If you want to look at it from the opposite direction (i.e. 0 columns with non-NA), then you can use suggestion by #RonakShah,
sum(rowSums(!is.na(df[-1])) == 0)
Keeping in the tidyverse world (assumed since you wanted to use n_distinct)
library(tidyverse)
##Your data
data <- tibble(ID = c(1,2,3,2,3,4),
neckpain = c('Yes',NA,'Yes',NA,'Yes',NA),
backpain = c(NA,NA,'Yes',NA,'Yes',NA),
kneepain = c(NA,NA,'Yes',NA,'Yes',NA))
##Pull out ones are missing across ID and count the rows if you want to cherry pick columns
nrow(data %>%
rowwise() %>%
mutate(row_total = sum(is.na(neckpain),
is.na(backpain),
is.na(kneepain))) %>%
filter(row_total == 3))
[1] 3
##Or if you just want to do it across all rows as noted in the comments
nrow(data %>%
mutate(row_total = rowSums(is.na(.[2:4]))) %>%
filter(row_total == 3))
[1] 3
Here is a one-liner:
sum(apply(df1[-1], 1, function(x) all(is.na(x))))
#[1] 3
Data
df1 <- read.table(text = "
ID neckpain backpain kneepain
1 Yes NA NA
2 NA NA NA
3 Yes Yes Yes
2 NA NA NA
3 Yes Yes Yes
4 NA NA NA
", header = TRUE)

Subtract columns in R data frame but keep values of var1 or var2 when the other is NA

I wanted to subtract one column from the other in R and this turned out more complicated than I thought.
Suppose this is my data (columns a and b) and column c is what I want, namely a - b but keeping a when b==NA and vice versa:
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
Now I tried different things but most of the time it returned NA when at least one column was NA. For example:
matrixStats::rowDiffs(data, na.rm=T) # only works for matrix-format, and returns NA's
dat$c <- dat$a - dat$b + ifelse(is.na(dat$b),dat$a,0) + ifelse(is.na(dat$a),dat$b,0) # seems like a desparately basic solution, but not even this does the trick as it also returns NA's
apply(dat[,(1:2)], MARGIN = 1,FUN = diff, na.rm=T) # returns NA's
dat$b<-dat$b*(-1)
dat$c<-rowSums(dat,na.rm=T) # this kind of works but it's a really ugly workaround
Also, if you can think of a dplyr solution, please share your knowledge. I didn't even know what to try.
Will delete this question if you think it's a duplicate of an existing one, though none of the existing threads were particularly helpful.
Try this (Base R Solution):
If df$b is NA then simply take the value of df$a else if df$a is NA then simply take the value of df$b else do df$a-df$b
df$c=ifelse(is.na(df$b),df$a,ifelse(is.na(df$a),df$b,df$a-df$b))
Output:
df
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
You may try using the coalesce function from the dplyr package:
dat <- data.frame(a=c(2, 2, NA, NA), b=c(1, NA, 3, NA))
dat$c <- coalesce(dat$a - coalesce(dat$b, 0), dat$b)
dat$c
a b c
1 2 1 1
2 2 NA 2
3 NA 3 3
4 NA NA NA
The idea here is to take a minus b, or a alone if b be NA. If that entire expression is still NA, then it implies that a is also NA, in which case we take b.
Here is one option with base R where we replace the NA elements with 0, Reduce it to a single vector by taking the rowwise difference and change the rows that have all NA elements to NA
df1$c <- abs(Reduce(`-`, replace(df1, is.na(df1), 0))) *
NA^ (!rowSums(!is.na(df1)) )
df1$c
#[1] 1 2 3 NA
Or using similar method with data.table
library(data.table)
setDT(df1)[!is.na(a) | !is.na(b), c := abs(Reduce(`-`,
replace(.SD, is.na(.SD), 0)))]
data
df1 <- structure(list(a = c(2L, 2L, NA, NA), b = c(1L, NA, 3L, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

How to subset data in R without losing NA rows?

I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)

passing positive results from multiple columns into a single new column in r

I am trying to work out a way to create a single column from multiple columns in R. What I want to do is for R to go through all rows for multiple columns and if it finds a positive result in one of those columns, to pass that result into an 'amalgam' column (sorry I don't know a better word for it).
See the toy dataset below
x <- c(NA, NA, NA, NA, NA, 1)
y <- c(NA, NA, 1, NA, NA, NA)
z <- c(NA, 1, NA, NA, NA, NA)
df <- data.frame(cbind(x, y, z))
df[, "compCol"] <- NA
df
x y z compCol
1 NA NA NA NA
2 NA NA 1 NA
3 NA 1 NA NA
4 NA NA NA NA
5 NA NA NA NA
6 1 NA NA NA
I need to pass positive results from each of the columns into the compCol column while changing negative results to 0. So that it looks like this.
x y z compCol
1 NA NA NA 0
2 NA NA 1 3
3 NA 1 NA 2
4 NA NA NA 0
5 NA NA NA 0
6 1 NA NA 1
I know if probably requires an if else statement nested inside a for loop but all the ways I have tried result in errors that I don't understand.
I tried the following just for a single column
for (i in 1:length(x)) {
if (df$x[i] == 1) {
df$compCol[i] <- df$x[i]
}
}
But it didn't work at all.
I got the message 'Error in if (df$x[i] == 1) { : missing value where TRUE/FALSE needed'
And that makes sense but I can't see where to put the TRUE/FALSE statement
You can also use reshaping with NA removal
library(dplyr)
library(tidyr)
df.id = df %>% mutate(ID = 1:n() )
df.id %>%
gather(variable, value,
x, y, z,
na.rm = TRUE) %>%
left_join(df.id)
We can use max.col. Create a logical matrix by checking whether the selected columns are greater than 0 and are not NA ('ind'). We use max.col to get the column index for each row and multiply with rowSums of 'ind' so that if there is 0 TRUE values for a row, it will be 0.
ind <- df > 0 & !is.na(df)
df$compCol <- max.col(ind) *rowSums(ind)
df$compCol
#[1] 0 3 2 0 0 1
Or another option is pmax after multiplying with the col(df)
do.call(pmax,col(df)*replace(df, is.na(df), 0))
#[1] 0 3 2 0 0 1
NOTE: I used the dataset before creating the 'compCol' in the OP's post.

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Resources