Getting wrong result while removing all NA value columns in R - r

I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one

aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.

Related

How to exclude missing data in specific columns in R

I have a df with 15,105 rows and 127 columns. I'd like to exclude some specific colunms' rows that have NA. I´m using the following command:
wave1b <- na.omit(wave1, cols=c("Bx", "Deq", "Gef", "Has", "Pla", "Ty"))
However, when I run it it returns with 19 rows only, when it was expected to return with 14,561 rows (if it should have excluded only the NA in those specific colunms requested). I'm afirming this, cause I did a subset on the df in order to test the accuracy of the missing deletion.
Does anyone could help me solving this issue? Thank you!
I think this code is not efficient but it could work:
df <- data.frame(A = rep(NA,3), B = c(NA,2,3),C=c(1,NA,2))
df
A B C
1 NA NA 1
2 NA 2 NA
3 NA 3 2
It removes only the rows which have missing values for the columns B and C:
df[-which(is.na(df$B)|is.na(df$C)),]
A B C
3 NA 3 2
You can use complete.cases
> df[complete.cases(df[, -1]), ]
A B C
3 NA 3 2

Delete column with NAs in first row

If I have a dataframe like so
a <- c(NA,1,2,NA,4)
b <- c(6,7,8,9,10)
c <- c(NA,12,13,14,15)
d <- c(16,NA,18,NA,20)
df <- data.frame(a,b,c,d)
How can I delete columns "a" and "c" by asking R to delete those columns that contain an NA in the first row?
My actual dataset is much bigger, and this is only by way of a reproducible example.
Please note that this isn't the same as asking to delete columns with any NAs in it. My columns may have other NA values in it. I'm looking to delete just the ones with an NA in the first row.
You can use a vector of booleans indicating wether the first row is missing in this case.
res <- df[,!is.na(df[1,])]
> res
b d
1 6 16
2 7 NA
3 8 18
4 9 NA
5 10 20

Error counting non-NA entries in dataframe

I am trying to see if the amount of information that I have about a case is correlated to the duration of the user.
Currently, I have a dataframe, df, and I attempted to do the following:
df["amount_known"] <-df[rowSums(!is.na(df)),]
This resulted in the following error:
Error in [<-.data.frame(*tmp*, "amount_known", value = list(status = c(3L, :
replacement element 1 has 808047 rows, need 808247
What could cause this to happen (and of course, how do I fix it)?
If you want the number of non-NA entries in a new column amount_known in df you can do it like this:
df$amount_known <-rowSums(!is.na(df))
Here's a small example of what is happening:
df <- data.frame(x = 1:3, y = 66:68)
df$y[1] <- NA
df$x[3] <- NA
df
# x y
#1 1 NA
#2 2 67
#3 NA 68
rowSums(!is.na(df))
#[1] 1 2 1
This results in a vector with the number of non-NAs in df.
Now, if you do
df[rowSums(!is.na(df)),]
This will select the rows in the vector c(1,2,1) from df:
# x y
#1 1 NA
#2 2 67
#1.1 1 NA
So for example, row 1 is shown twice.
And in your code, you were then assigning that output to a new column in df.

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1
It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.
You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.
There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works
I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.
Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE
Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Resources