Why is there no NA_logical_ - r

From help("NA"):
There are also constants NA_integer_, NA_real_, NA_complex_ and
NA_character_ of the other atomic vector types which support missing
values: all of these are reserved words in the R language.
My question is why there is no NA_logical_ or similar, and what to do about it.
Specifically, I am creating several large very similar data.tables, which should be class compatible for later rbinding. When one of the data.tables is missing a variable, I am creating that column but with it set to all NAs of the particular type. However, for a logical I can't do that.
In this case, it probably doesn't matter too much (data.table dislikes coercing columns from one type to another, but it also dislikes adding rows, so I have to create a new table to hold the rbound version anyway), but I'm puzzled as to why the NA_logical_, which logically should exist, does not.
Example:
library(data.table)
Y <- data.table( a=NA_character_, b=rep(NA_integer_,5) )
Y[ 3, b:=FALSE ]
Y[ 2, a:="zebra" ]
> Y
a b
1: NA NA
2: zebra NA
3: NA 0
4: NA NA
5: NA NA
> class(Y$b)
[1] "integer"
Two questions:
Why doesn't NA_logical_ exist, when its relatives do?
What should I do about it in the context of data.table or just to avoid coercion as much as possible? I assume using NA_integer_ buys me little in terms of coercion (it will coerce the logical I'm adding in to 0L/1L, which isn't terrible, but isn't ideal.

NA is already logical so NA_logical_ is not needed. Just use NA in those situations where you need a missing logical. Note:
> typeof(NA)
[1] "logical"
Since the NA_*_ names are all reserved words there was likely a desire to minimize the number of them.
Example:
library(data.table)
X <- data.table( a=NA_character_, b=rep(NA,5) )
X[ 3, b:=FALSE ]
> X
a b
1: NA NA
2: NA NA
3: NA FALSE
4: NA NA
5: NA NA

I think based on this
#define NA_LOGICAL R_NaInt
from $R_HOME/R/include/R_ext/Arith.h we can suggest using NA_integer or NA_real and hence plain old NA in R code:
R> as.logical(c(0,1,NA))
[1] FALSE TRUE NA
R>
R> as.logical(c(0L, 1L, NA_integer_))
[1] FALSE TRUE NA
R>
which has
R> class(as.logical(c(0,1,NA)))
[1] "logical"
R>
R> class(as.logical(c(0, 1, NA_real_)))
[1] "logical"
R>
Or am I misunderstanding your question? R's logical type is three-values: yay, nay and missing. And we can use the NA from either integer or numeric to cast. Does that help?

Related

Replacing values by index with data.table syntax

assume we have data.table d1 with 6 rows:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
we add a column to d1 called test, and fill it with NA
d1$test <- NA
the external vector rows gives the index of rows we want to fill with values contained in vals
rows <- c(5,6)
vals <- c(6,3)
how do you do this in data table syntax? i have not been able to figure this out from the documentation.
it seems like this should work, but it does not:
d1[rows, test := vals]
the following error is returned:
Warning: 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
This is my desired outcome:
data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5), test = c(NA,NA,NA,NA,6,3))
Let's walk through this:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
d1$test <- NA
rows <- c(5,6)
vals <- c(6,3)
d1[rows, test := vals]
# Warning in `[.data.table`(d1, rows, `:=`(test, vals)) :
# 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
class(d1$test)
# [1] "logical"
class(vals)
# [1] "numeric"
R can be quite "sloppy" in general, allowing one to coerce values from one class to another. Typically, this is from integer to floating point, sometimes from number to string, sometimes logical to number, etc. R does this freely, at times unexpectedly, and often silently. For instance,
13 > "2"
# [1] FALSE
The LHS is of class numeric, the RHS character. Because of the different classes, R silently converts 13 to "13" and then does the comparison. In this case, a string-comparison is doing a lexicographic comparison, which is letter-by-letter, meaning that it first compares the "1" with the "2", determines that it is unambiguously not true, and stops the comparison (since no other letter will change the results). The fact that the numeric comparison of the two is different, nor the fact that the RHS has no more letters to compare (lengths themselves are not compared) do not matter.
So R can be quite sloppy about this; not all languages are this allowing (most are not, in my experience), and this can be risky in unsupervised (automated) situations. It often produces unexpected results. Because of this, many (including devs of data.table and dplyr, to name two) "encourage" (force) the user to be explicit about class coersion.
As a side note: R has at least 8 different classes of NA, and all of them look like NA:
str(list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_,
Sys.Date()[NA], Sys.time()[NA], as.POSIXlt(Sys.time())[NA]))
# List of 8
# $ : logi NA
# $ : int NA
# $ : num NA
# $ : chr NA
# $ : cplx NA
# $ : Date[1:1], format: NA
# $ : POSIXct[1:1], format: NA
# $ : POSIXlt[1:1], format: NA
There are a few ways to fix that warning.
Instantiate the test column as a "real" (numeric, floating-point) version of NA:
# starting with a fresh `d1` without `test` defined
d1$test <- NA_real_
d1[rows, test := vals] # works, no warning
Instantiate the test column programmatically, matching the class of vals without using the literal NA_real_:
# starting with a fresh `d1` without `test` defined
d1$test <- vals[1][NA]
d1[rows, test := vals] # works, no warning
Convert the existing test column in its entirety (not subsetted) to the desired class:
d1$test <- NA # this one is class logical
d1[, test := as.numeric(test)] # converts from NA to NA_real_
d1[rows, test := vals] # works, no warning
Things that work but are still being sloppy:
replace allows us to do this, but it is silently internally coercing from logical to numeric:
d1$test <- NA # logical class
d1[, test := replace(test, .I %in% rows, vals)]
This works because the internals of replace are simple:
function (x, list, values)
{
x[list] <- values
x
}
The reassignment to x[list] causes R to coerce the entire vector from logical to numeric, and it returns the whole vector at once. In data.table, assigning to the whole column at once allows this, since it is a common operation to change the class of a column.
As a side note, some might be tempted to use replace to fix things here. Using base::ifelse, this works, but further demonstrates the sloppiness of R here (and more so in ifelse, which while convenient, it is broken in a few ways).
base::ifelse doesn't work here out of the box because we'd need vals to be the same length as number of rows in d1. Even if that were the case, though, ifelse also silently coerces the class of one or the other. Imagine these scenarios:
ifelse(c(TRUE, TRUE), pi, "pi")
# [1] 3.141593 3.141593
ifelse(c(TRUE, FALSE), pi, "pi")
# [1] "3.14159265358979" "pi"
The moment one of the conditions is false in this case, the whole result changes from numeric to character, and there was no message or warning to that effect. It is because of this that data.table::fifelse (and dplyr::if_else) will fail preemptively:
fifelse(c(TRUE, TRUE), pi, "pi")
# Error in fifelse(c(TRUE, TRUE), pi, "pi") :
# 'yes' is of type double but 'no' is of type character. Please make sure that both arguments have the same type.
(There are other issues with ifelse, not just this, caveat emptor.)

Create double object with NA

How can I create a double object with an NA value.
I am writing a test case where the output is NA:
gt[2]$height
[1] NA
typeof(gt[2])
> "double"
Question is how can I create an object of type "double" with an NA value.
By default, NA is a logical constant of length 1 used to represent missing values in data, and the type of NA can be modified by using one of the four types of NA such as NA_integer_, NA_real_, NA_complex_ and NA_character_.
For more info, please read the documentation page of ?NA
Try this:
x <- numeric()
typeof(x)
# [1] "double"
y <- NA_real_
typeof(y)
# [1] "double"
y
# [1] NA
mydata<-data.frame(height=NA)
mydata$height<-as.double(mydata$height)
typeof(mydata$height)

Replace values in list

I have a nested list, which could look something like this:
characlist<-list(list(c(1,2,3,4)),c(1,3,2,NA))
Next, I want to replace all values equal to one with NA. I tried the following, but it produces an error:
lapply(characlist,function(x) ifelse(x==1,NA,x))
Error in ifelse(x == 1, NA, x) :
(list) object cannot be coerced to type 'double'
Can someone tell me what's wrong with the code?
Use rapply instead:
> rapply(characlist,function(x) ifelse(x==1,NA,x), how = "replace")
#[[1]]
#[[1]][[1]]
#[1] NA 2 3 4
#
#
#[[2]]
#[1] NA 3 2 NA
The problem in your initial approach was that your first list element is itself a list. Hence you cannot directly apply the ifelse logic as you would on an atomic vector. By using ?rapply you can avoid that problem (rapply is a recursive version of lapply).
Another option would be using relist after we replace the elements that are 1 to NA in the unlisted vector. We specify the skeleton as the original list to get the same structure.
v1 <- unlist(characlist)
relist(replace(v1, v1==1, NA), skeleton=characlist)
#[[1]]
#[[1]][[1]]
#[1] NA 2 3 4
#[[2]]
#[1] NA 3 2 NA

summary still shows NAs after using both na.omit and complete.cases

I am a grad student using R and have been reading the other Stack Overflow answers regarding removing rows that contain NA from dataframes. I have tried both na.omit and complete.cases. When using both it shows that the rows with NA have been removed, but when I write summary(data.frame) it still includes the NAs. Are the rows with NA actually removed or am I doing this wrong?
na.omit(Perios)
summary(Perios)
Perios[complete.cases(Perios),]
summary(Perios)
The error is that you actually didn't assign the output from na.omit !
Perios <- na.omit(Perios)
If you know which column the NAs occur in, then you can just do
Perios[!is.na(Perios$Periostitis),]
or more generally:
Perios[!is.na(Perios$colA) & !is.na(Perios$colD) & ... ,]
Then as a general safety tip for R, throw in an na.fail to assert it worked:
na.fail(Perios) # trust, but verify! Die Paranoia ist gesund.
is.na is not the proper function. You want complete.cases and you want complete.cases which is the equivalent of function(x) apply(is.na(x), 1, all) or na.omit to filter the data:
That is, you want all rows where there are no NA values.
< x <- data.frame(a=c(1,2,NA), b=c(3,NA,NA))
> x
a b
1 1 3
2 2 NA
3 NA NA
> x[complete.cases(x),]
a b
1 1 3
> na.omit(x)
a b
1 1 3
Then this is assigned back to x to save the data.
complete.cases returns a vector, one element per row of the input data frame. On the other hand, is.na returns a matrix. This is not appropriate for returning complete cases, but can return all non-NA values as a vector:
> is.na(x)
a b
[1,] FALSE FALSE
[2,] FALSE TRUE
[3,] TRUE TRUE
> x[!is.na(x)]
[1] 1 2 3

counting vectors with NA included

By mistake, I found that R count vector with NA included in an interesting way:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[temp>1])
[1] 3
> temp <- c(NA,NA,1) # 3 items
> length(temp[temp>1])
[1] 2
At first I assume R will process all NAs into one NA, but this is not the case.
Can anyone explain? Thanks.
You were expecting only TRUE's and FALSE's (and the results to only be FALSE) but a logical vector can also have NA's. If you were hoping for a length zero result, then you had at least three other choices:
> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[ which(temp>1) ] )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length(subset( temp, temp>1) )
[1] 0
> temp <- c(NA,NA,NA,1) # 4 items
> length( temp[ !is.na(temp) & temp>1 ] )
[1] 0
You will find the last form in a lot of the internal code of well established functions. I happen to think the first version is more economical and easier to read, but the R Core seems to disagree. I have several times been advised on R help not to use which() around logical expressions. I remain unconvinced. It is correct that one should not combine it with negative indexing.
EDIT The reason not to use the construct "minus which" (negative indexing with which) is that in the case where all the items fail the which-test and where you would therefore expect all of them to be returned , it returns an unexpected empty vector:
temp <- c(1,2,3,4,NA)
temp[!temp > 5]
#[1] 1 2 3 4 NA As expected
temp[-which(temp > 5)]
#numeric(0) Not as expected
temp[!temp > 5 & !is.na(temp)]
#[1] 1 2 3 4 A correct way to handle negation
I admit that the notion that NA's should select NA elements seems a bit odd, but it is rooted in the history of S and therefore R. There is a section in ?"[" about "NA's in indexing". The rationale is that each NA as an index should return an unknown result, i.e. another NA.
If you break down each command and look at the output, it's more enlightening:
> tmp = c(NA, NA, 1)
> tmp > 1
[1] NA NA FALSE
> tmp[tmp > 1]
[1] NA NA
So, when we next perform length(tmp[tmp > 1]), it's as if we're executing length(c(NA,NA)). It is fine to have a vector full of NAs - it has a fixed length (as if we'd created it via NA * vector(length = 2), which should be different from NA * vector(length = 3).
You can use 'sum':
> tmp <- c(NA, NA, NA, 3)
> sum(tmp > 1)
[1] NA
> sum(tmp > 1, na.rm=TRUE)
[1] 1
A bit of explanation: 'sum' expects numbers but 'tmp > 1' is logical. So it is automatically coerced to be numeric: TRUE => 1; FALSE => 0; NA => NA.
I don't think there is anything precisely like this in 'The R Inferno' but this is definitely the sort of question that it is aimed at. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Resources