Select last non-NA value in a row, by row - r

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!

Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1

Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).

Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Related

Select columns, excluding some which are all NA

Suppose I have this dataframe
df <- data.frame(keep = c(1, NA, 2),
also_want = c(NA, NA, NA),
maybe = c(1, 2, NA),
maybe_2 = c(NA, NA, NA))
Edit: In the actual dataframe there are many columns I'd like to keep, so spelling them all out isn't viable. These columns are all the columns that do not start with maybe. The maybe columns, instead, do have a common naming like maybe, maybe_1 etc. that could work with grep or stringr::str_detect
I want to select keep, and also_want. I also want any of the maybe columns that have values other than NA
desired_df
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I can use select_if to get all columns that have non-NA values, but then I lose also_want
library(dplyr)
df %>%
select_if(~sum(!is.na(.)) > 0)
keep maybe
1 1 1
2 NA 2
3 2 NA
Thoughts?
With dplyr 1.0.0 you can use the where function inside a select statement to test for conditions that your variables have to satisfy, but first you specify the variables you also want to keep.
EDIT
I've inserted the condition that only the "maybe" variables have to contain values other than NA; before, we select every column that does not start with "maybe".
df %>%
select(!starts_with("maybe"), starts_with("maybe") & where(~sum(!is.na(.)) > 0))
Output
# keep also_want maybe
# 1 1 NA 1
# 2 NA NA 2
# 3 2 NA NA
following your comments, in Base-R we can use
df[,!apply(
rbind(
grepl("maybe",colnames(df)),
!apply(df, 2, function(x) !all(is.na(x)))
)
,2,all)]
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
Or if you prefer seeing the same code all on 1 line:
df[,!apply(rbind(grepl("maybe",colnames(df)),!apply(df, 2, function(x) !all(is.na(x)))),2,all)]
I eventually figured this out. Using str_detect to select all non-maybe columns, and then using a one-liner inside sapply to also select any other columns (i.e. any maybe columns) that have non-NA values.
library(dplyr)
library(stringr)
df %>%
select_if(stringr::str_detect(names(.), "maybe", negate = TRUE) |
sapply(., function(x) {
sum(!is.na(x))
} > 0))
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA

Getting wrong result while removing all NA value columns in R

I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one
aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.

Conditional row-wise average

I have a simple question. I have a data.frame that looks like this:
df
A B C Exclusion_criteria
3 4 5 3
2 1
6 9 2
I simply would like to mean rows of columns A, B and C (row-wise) when the Exclusion_criteria is different from 1 (e.g. for all cases except Exclusion_criteria == 1).
Can anyone help me please?
Kind regards
We can loop over the rows with apply, remove the element that is showed in the 4th column by, and get the mean
apply(df, 1, function(x) mean(x[1:3][-x[4]], na.rm = TRUE))
#[1] 3.5 NaN 6.0
Or another option is to replace the values in 'df' based on the row/column index (from 4th column) to NA and do a rowMeans
df[cbind(1:nrow(df), df[,4])] <- NA
rowMeans(df[1:3], na.rm = TRUE)
#[1] 3.5 NaN 6.0

RowSums NA + NA gives 0 [duplicate]

This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10

Return Column Names when True in R

I am using R for a project and I have a data frame in in the following format:
A B C
1 1 0 0
2 0 1 1
I want to return a data frame that gives the Column Name when the value is 1.
i.e.
Impair1 Impair2
1 A NA
2 B C
Is there a way to do this for thousands of records? The max impairment number is 4.
Note: There are more than 3 columns. Only 3 were listed to make it easier.
You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end:
`colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))),
paste("Impair", 1:4))
# Impair1 Impair2 Impair3 Impair4
# 1 "A" NA NA NA
# 2 "B" "C" NA NA
Using the apply family of functions, here is a general solution that should work for your larger dataset:
res <- apply(df, 1, function(x) {
out <- character(4) # create a 4-length vector of NAs
tmp <- colnames(df)[which(x==1)] # store the column names in a tmp field
out[1:length(tmp)] <- tmp # overwrite the relevant positions
out
})
# transpose and turn it into a data.frame
> data.frame(t(res))
X1 X2 X3 X4
1 A
2 B C

Resources