How does subsetting with NA work? - r

Can someone please answer in layman terms how indexing (subsetting) with NA works. Even though there are some answers from google, I would like to understand it better in simple terms.
When indexing a vector (of length > 1) using a single NA, why does it yield five missing values?
> x <- 1:5
> x[NA]
[1] NA NA NA NA NA

From help("["):
When extracting, a numerical, logical or character NA index picks an
unknown element and so returns NA in the corresponding element of a
logical, integer, numeric, complex or character result, and NULL for a
list.
What does "corresponding element" mean? This can be understood if you know about recycling of vector elements. x[NA] (this is a logical NA per default) in your example is actually "interpreted" as x[c(NA, NA, NA, NA, NA)] since logical indices are recycled. So, each element of x has a corresponding NA during subsetting and thus (per the quote above) NA is returned for each element of x. In layman's language: For each element of x we don't know if we want it. Thus an unknown value is returned for each element.
As #r2evans points out: x[NA_integer_] returns only one NA because integer indices are not recycled. In layman's language: We want one value but don't know which one. Thus, one unknown value is returned.

Related

Logical Indexing with NA in R - How to set to FALSE or exclude rather than return NA? [duplicate]

This question already has answers here:
Gotchas with logical indexing and "which" in R
(2 answers)
Closed last month.
Apologies if this is a common question, but it has caused some unexpected frustration in a script I am running. I have a dataset which roughly looks like the following (though much larger in practice):
df <- data.frame(A = c(1, 2, 3, NA, NA, 6),
B = c(10, 20, 30, 40 , 50, 60))
My script cycles through a list of values from column A and is supposed to take action based on whether the values in B are larger than 25. However, the corresponding values of B for missing values in A are ALWAYS returned, whereas I want them to always be excluded. For example,
df$B[df$A == 6]
Gives the output
NA NA 60
Rather than the expected
60
Thus, the code
df$B[df$A == 6] > 25
returns
NA NA TRUE
rather than just
TRUE
Could someone explain the reason for this and any simple solutions? The immediate solution that came to mind is to remove any rows with NA values in column A, but I would prefer a solution which is robust to missingness in A and will only return the single desired logical value from B.
Whenever you ask whether Not Available (NA) value is equal to number or anything else - you got the only possible answer: The answer is Not Available (NA).
NA might be equal to 6, or to John the Baptist, or to ⛄ as well as to any other object. It is just impossible to say if it does, since the value is not available.
To get the answer you want, you can use na.omit() or na.exclude() on the results. Or you can apply yet another logical condition during subsetting:
with(df, B[A == 6 & !is.na(A)])
# [1] 60

Why does R return integer(0) for under-indexing but NA for over-indexing a vector? [duplicate]

Say I have a vector, for example, x <- 1:10, then x[0] returns a zero-length vector of the same class as x, here integer(0).
I was wondering if there is a reason behind that choice, as opposed to throwing an error, or returning NA as x[11] would? Also, if you can think of a situation where having x[0] return integer(0) is useful, thank you for including it in your answer.
As seen in ?"["
NA and zero values are allowed: rows of an index matrix containing a
zero are ignored, whereas rows containing an NA produce an NA in the
result.
So an index of 0 just gets ignored. We can see this in the following
x <- 1:10
x[c(1, 3, 0, 5, 0)]
#[1] 1 3 5
So if the only index we give it is 0 then the appropriate response is to return an empty vector.
My crack at it as I am not a programmer and certainly do not contribute to R source. I think it may be because you need some sort of place holder to state that something occurred here but nothing was returned. This becomes more apparent with things like tables and split. For instance when you make a table of values and say there are zero of that cell you need to hold that that cell made from a string in a vector has no values. it would not be a appropriate to have x[0]==0 as it's not the numeric value of zero but the absence of any value.
So in the following splits we need a place holder and integer(0) holds the place of no values returned which is not the same as 0. Notice for the second one it returns numeric(0) which is still a place holder stating it was numeric place holder.
with(mtcars, split(as.integer(gear), list(cyl, am, carb)))
with(mtcars, split(gear, list(cyl, am, carb)))
So in a way my x[FALSE] retort is true in that it holds the place of the non existent zero spot in the vector.
All right this balonga I just spewed is true until someone disputes it and tears it down.
PS page 19 of this guide (LINK) state that integer() and integer(0) are empty integer.
Related SO post: How to catch integer(0)?
Since the array indices are 1-based, index 0 has no meaning. The value is ignored as a vector index.

Add a specified number of blank rows to a data table without overwriting the heading

Im trying to make a large blank data.table with a header row in order to add values in specific places once it is set up. I have been able to duplicate the first row and then clear every other row or every row, but what I'd like to do is clear every row after the header row. Some columns are numeric input and some are character input.
[input3]:
headers: header1 header2 header3..... header 60+
Values: NA NA NA ... NA
Duplicate row:
input3 <- input2[rep(1:nrow(input2), each = 2), ]
Clear every row:
input3[1:nrow(input3) %% 1 == 0, ] <- NA
But if I try to rewrite that as duplicating blank rows starting at row 2 (to preserve the header) I get this error:
input3[2:nrow(input3) %% 1 == 0, ] <- NA
"Error in [.data.table(x, i, which = TRUE) : i evaluates to a logical vector length 9 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle."
I need to be able to dynamically add rows while keeping the header as this is going to be a gigantic table I will export to another program.
Edit: this is different from this link in that I'm adding additional rows not specified originally in the data. Not just wiping rows.
Instead use
input3[c(FALSE,2:nrow(input3) %% 1 == 0,] <- NA
By using 2:nrow, you were explicitly giving a shortened vector. When that thing is a logical vector, it must be length 1 or the same as the number of rows. Period.
Though this has its problems and I discourage its use, perhaps you were expecting it to behave like this:
input3[which(2:nrow(input3) %% 1 == 0),] <- NA
The "good" of this is that the which(...) returns a vector of integer, so it does not need to be the same length as the number of rows in the frame/table.
From ?Extract (which includes [ and friends):
For '['-indexing only: 'i', 'j', '...' can be logical
vectors, indicating elements/slices to select. Such vectors
are recycled if necessary to match the corresponding extent.
'i', 'j', '...' can also be negative integers, indicating
elements/slices to leave out of the selection.
"Recycling" is why length 1 works: its logical value is used for all rows. If you use length 2 and there are an even number of rows (e.g., mtcars[c(T,F),]), then it will give every-other-row. On a similar vein, if you assume recycling and there are not an even multiple of rows (e.g., mtcars[c(T,F,F),]), then your assumptions start becoming less clear.
Add to that the behavior of data.table where it does not enforcing of this. Recycling can get you in trouble, so data.table doesn't encourage it.
library(data.table)
mt <- as.data.table(mtcars)
mt[c(T,F),] <- NA
# Error in `[.data.table`(x, i, which = TRUE) :
# i evaluates to a logical vector length 2 but there are 32 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
mt[c(1,3),] <- NA

Subsetting a vector with a condition (excluding NA)

vector1 = c(1,2,3,NA)
condition1 = (vector1 == 2)
vector1[condition1]
vector1[condition1==TRUE]
In the above code, the condition1 is "FALSE TRUE FALSE NA",
and the 3rd and the 4th lines both gives me the result "2 NA"
which is not I expected.
I wanted elements whose values are really '2', not including NA.
Could anybody explain why R is designed to work in this way?
and how I can get the result I want with a simple command?
The subset vector[NA] will always be NA because the NA value is unknown and therefore the result of the subset is also unknown. %in% returns FALSE for NA, so it can be useful here.
vector1 = c(1,2,3,NA)
condition1 = (vector1 %in% 2)
vector1[condition1]
# [1] 2
If you are in RStudio and enter
?`[`
You will get the following explanation:
NAs in indexing
When extracting, a numerical, logical or character NA index picks an
unknown element and so returns NA in the corresponding element of a
logical, integer, numeric, complex or character result, and NULL for a
list. (It returns 00 for a raw result.)
When replacing (that is using indexing on the lhs of an assignment) NA
does not select any element to be replaced. As there is ambiguity as
to whether an element of the rhs should be used or not, this is only
allowed if the rhs value is of length one (so the two interpretations
would have the same outcome). (The documented behaviour of S was that
an NA replacement index ‘goes nowhere’ but uses up an element of
value: Becker et al p. 359. However, that has not been true of other
implementations.)
try the logical operator in that case,
vector1 = c(1,2,3,NA)
condition1<-(vector1==2 & !is.na(vector1) )
condition1
# FALSE TRUE FALSE FALSE
vector1[condition1]
# 2
& operation returns true when both of the logical operators are True.
identical is "The safe and reliable way to test two objects for being exactly equal. It returns TRUE in this case, FALSE in every other case." (see ?identical)
As it does not compare elementwise comparison you can use it in sapply to compare each element in vector1 to 2. I.e.:
condition1 = sapply(vector1, identical, y = 2)
which will give:
vector1[condition1]
[1] 2

Why does any() return NA when no true values [duplicate]

This question already has answers here:
Logical operators (AND, OR) with NA, TRUE and FALSE
(2 answers)
Closed 6 years ago.
So we have this behaviour:
any(c(TRUE, FALSE, NA))
#> [1] TRUE
any(c(TRUE, NA))
#> [1] TRUE
any(c(FALSE, NA))
#> [1] NA
Anyone know the rationale for returning NA instead of FALSE? IMO the function should be testing for presence of non-FALSE values, which NA is not.
This behavior is explained in the values section of the help file:
The value returned is TRUE if at least one of the values in x is TRUE, and FALSE if all of the values in x are FALSE (including if there are no values). Otherwise the value is NA.
As you note, this seems to differ from the behavior of more commonly used functions such as sum and mean, since the presence of NA values in vector arguments to these functions return NA. This problem in perception is cleared up by joran's answer which refers to the documentation from ?Logic, to requote:
NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous. In other words NA & TRUE evaluates to NA, but NA & FALSE evaluates to FALSE. See the examples below.
So in the case of ambiguity, for example, the calculation of a mean where the vector contains NA, or NA | FALSE where the missing value might be TRUE, NA will be the output. Whereas in other cases such as any(c(TRUE, NA)) or TRUE | NA, the outcome is unambiguous despite the presence of a missing value. This logic may be clearer in #Floo0's answer and in some of the comments to the question.
I might be mistaken but the logic here is:
NA means unknown value. So the question
Is any of value of (FALSE, NA) true?
Is answered with "I dont know" aka NA because NA could be TRUE but it is unknown at the moment you are asking.
Take the question
Is any value of (TRUE, NA) true?
This is answered with TRUE as certainly the first value is TRUE.
I would wrap the call in isTRUE, this yields the desired result:
> any(c(FALSE, NA))
[1] NA
> isTRUE(any(c(FALSE, NA)))
[1] FALSE
From the documentation:
‘isTRUE(x)’ is an abbreviation of ‘identical(TRUE, x)’, and so is
true if and only if ‘x’ is a length-one logical vector whose only
element is ‘TRUE’ and which has no attributes (not even names).

Resources