To subset rows from a data frame, inserting the condition in the first part of [ , ] seems to be the reference method, and inserting this condition inside "which()" seems to be useless.
However, in the presence of missing data, why is the first method not working, while the "which method" does, as in the following example?
df <- data.frame(var1=c(1,2,3,NA,NA), var2=c(4,0,5,2,3), var3=c(1,2,3,0,6))
testvar1<-df[df$var1==3,]
testvar1.which<-df[which(df$var1==3),]
testvar1
var1
var2
var3
3
3
5
3
NA
NA
NA
NA
NA.1
NA
NA
NA
testvar1.which
var1
var2
var3
3
3
5
3
The simple answer is that which suppresses NA values by default, whereas a straightforward logical test will return a vector of the same length as the input with NA preserved. Compare:
df$var1 == 3
#> [1] FALSE FALSE TRUE NA NA
which(df$var1 == 3)
#> [1] 3
If you subset the data frame with the first result, the first two rows are dropped as expected (because they correspond to FALSE) and the third row is kept because it is TRUE, which is also expected. The last two rows are where the confusion comes in. If you subset a data frame with an NA, you don't get a NULL result, you get an NA result, which is different. The two rows at the bottom are NA rows, which you get if you subset a data frame with NA values.
Related
I have some problems with NA value cause my dataset from excel is not same column number so It showed NA. It deleted all row containing NA value when make calculation Similarity Index function Psicalc in RInSp package.
B F
4 7
5 6
6 8
7 5
NA 4
NA 3
NA 2
Do you know how to handle with NA or remove it but not delete all row or not affect to package?. Beside when I import.RinSP it has message
In if (class(filename) == "character") { :
the condition has length > 1 and only the first element will be used
Thank you so much
Many R functions ( specifically base R ) have an na.rm argument, which is FALSE by default. That means if you omit this argument, and your data has NA, your "calculation" will result in NA. To remove these in the calculations, include an na.rm argument and assign it to TRUE.
Example:
x <- c(4,5,6,7,NA,NA)
mean(x) # Oops!
[1] NA
mean(x, na.rm=TRUE)
[1] 5.5
I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one
aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.
I want to split a dataset in R based on NA values from a variable, for example:
var1 var2
1 21
2 NA
3 NA
4 10
and make it like this:
var1 var2
1 21
4 10
var1 var2
2 NA
3 NA
See More Details:
Most statistical functions (e.g., lm()) have something like na.action which applies to the model, not to individual variables. na.fail() returns the object (the dataset) if there are no NA values, otherwise it returns NA (stopping the analysis). na.pass() returns the data object whether or not it has NA values, which is useful if the function deals with NA values internally. na.omit () returns the object with entire observations (rows) omitted if any of the variables used in the model are NA for that observation. na.exclude() is the same as na.omit(), except that it allows functions using naresid or napredict. You can think of na.action as a function on your data object, the result being the data object in the lm() function. The syntax of the lm() function allows specification of the na.action as a parameter:
lm(na.omit(dataset),y~a+b+c)
lm(dataset,y~a+b+c,na.omit) # same as above, and the more common usage
You can set your default handling of missing values with
options("na.actions"=na.omit)
You could just subset the data frame using is.na():
df1 <- df[!is.na(df$var2), ]
df2 <- df[is.na(df$var2), ]
Demo here:
Rextester
Hi try this
new_DF <- DF[rowSums(is.na(DF)) > 0,]
or in case you want to check a particular column, you can also use
new_DF <- DF[is.na(DF$Var),]
In case you have NA character values, first run
Df[Df=='NA'] <- NA
to replace them with missing values.
split function comes handily in this case.
data <- read.table(text="var1 var2
1 21
2 NA
3 NA
4 10", header=TRUE)
split(data, is.na(data$var2))
#
# $`FALSE`
# var1 var2
# 1 1 21
# 4 4 10
#
# $`TRUE`
# var1 var2
# 2 2 NA
# 3 3 NA
An alternative and more general approach is using the complete.cases command. The command spots rows that have no missing values (no NAs) and returns TRUE/FALSE values.
dt = data.frame(var1 = c(1,2,3,4),
var2 = c(21,NA,NA,10))
dt1 = dt[complete.cases(dt),]
dt2 = dt[!complete.cases(dt),]
dt1
# var1 var2
# 1 1 21
# 4 4 10
dt2
# var1 var2
# 2 2 NA
# 3 3 NA
If I have a dataframe like so
a <- c(NA,1,2,NA,4)
b <- c(6,7,8,9,10)
c <- c(NA,12,13,14,15)
d <- c(16,NA,18,NA,20)
df <- data.frame(a,b,c,d)
How can I delete columns "a" and "c" by asking R to delete those columns that contain an NA in the first row?
My actual dataset is much bigger, and this is only by way of a reproducible example.
Please note that this isn't the same as asking to delete columns with any NAs in it. My columns may have other NA values in it. I'm looking to delete just the ones with an NA in the first row.
You can use a vector of booleans indicating wether the first row is missing in this case.
res <- df[,!is.na(df[1,])]
> res
b d
1 6 16
2 7 NA
3 8 18
4 9 NA
5 10 20
I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1