Indexing integer vector with NA - r

I have problems understanding this. I have an integer vector of length 5:
x <- 1:5
If I index it with a single NA, the result is of length 5:
x[NA]
# [1] NA NA NA NA NA
My first idea was that R checks whether 1-5 is NA but
x <- c(NA, 2, 4)
x[NA]
# NA NA NA.
So this cannot be the solution. My second approach is that x[NA] is indexing but then I do not understand
Why this gives me five NA's
What NA as an index means. x[1] gives you the first value but what should be the result of x[NA]?

Compare your code:
> x <- 1:5; x[NA]
[1] NA NA NA NA NA
with
> x <- 1:5; x[NA_integer_]
[1] NA
In the first case, NA is of type logical (class(NA) shows), whereas in the second it's an integer. From ?"[" you can see that in the case of i being logical, it is recycled to the length of x:
For [-indexing only: i, j, ... can be logical vectors, indicating
elements/slices to select. Such vectors are recycled if necessary to
match the corresponding extent. i, j, ... can also be negative
integers, indicating elements/slices to leave out of the selection.

Related

Understanding output of the seq function

When using the seq function, I get the following outputs:
>seq(1,4)
1 2 3 4
and this retrieves the second element from the sequence
>seq(1,4) [2]
2
These two I understand. However, I don't understand why the following yields four NA values
>seq(1,4) [NA]
NA NA NA NA
But the below example does not initiate four "ABC" values instead just one NA
>seq(1,4) ["ABC"]
NA
Why is this happening?
What is important here is that NA is logical:
class(NA)
## [1] "logical"
and logical indexes always get recycled.
seq(1, 4)[c(TRUE, FALSE)]
## [1] 1 3
If you use an integer NA then this won't happen:
seq(1, 4)[NA_integer_]
## [1] NA
I don't think it has anything to do with seq function. If you try to subset values using NA, you get back a vector of NAs.
a <- c(1, 2)
a[NA]

ifelse r - x and y lengths differ

I'm trying to use an ifelse on an array called "OutComes" but it's giving me some trouble.
> PersonNumber Risk_Factor OC_Death OnsetAge Clinical CS_Death Cure AC_Death
>[1,] 1 1 99.69098 NA NA NA NA NA
>[2,] 2 1 60.68009 NA NA NA NA NA
>[3,] 3 0 88.67483 NA NA NA NA NA
>[4,] 4 0 87.60846 NA NA NA NA NA
>[5,] 5 0 78.23118 NA NA NA NA NA
Now I will try to use an apply to analyse this table's Risk_Factor Column and apply one of two functions to replace the OnsetAge column's NA's.
I've been using an apply function -
apply(OutComes, 1, function(x)ifelse(OutComes[,"Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x))
However this obviously won't work as the ifelse itself won't work. the error being -
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
I'm not sure what's going on in this ifelse or what the x and y lengths are.
There is a mistake in your apply function. You are applying a function with argument x (one row of OutComes), but then whithin ifelse, you use a vector OutComes[,"Risk_Factor"] which is a column of the original matrix, not a single number. One simple solution is to do
apply(OutComes, 1, function(x) ifelse(x["Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x)))
But when dealing with a scalar, there is no real need to use ifelse, so it may be more efficient to write
apply(OutComes, 1, function(x) if (x["Risk_Factor"] == 1) HighOnsetFunction(x) else OnsetFunction(x)))

Difference between intersect and match in R

I am trying to understand the difference between match and intersect in R. Both return the same output in a different format. Are there any functional differences between both?
match(names(set1), names(set2))
# [1] NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11
intersect(names(set1), names(set2))
# [1] "Year" "ID"
match(a, b) returns an integer vector of length(a), with the i-th element giving the position j such that a[i] == b[j]. NA is produced by default for no_match (although you can customize it).
If you want to get the same result as intersect(a, b), use either of the following:
b[na.omit(match(a, b))]
a[na.omit(match(b, a))]
Example
a <- 1:5
b <- 2:6
b[na.omit(match(a, b))]
# [1] 2 3 4 5
a[na.omit(match(b, a))]
# [1] 2 3 4 5
I just wanted to know if there any other differences between the both. I was able to understand the results myself.
Then we read source code
intersect
#function (x, y)
#{
# y <- as.vector(y)
# unique(y[match(as.vector(x), y, 0L)])
#}
It turns out that intersect is written in terms of match!
Haha, looks like I forgot the unique in the outside. Em, by setting nomatch = 0L we can also get rid of na.omit. Well, R core is more efficient than my guess.
Follow-up
We could also use
a[a %in% b] ## need a `unique`, too
b[b %in% a] ## need a `unique`, too
However, have a read on ?match. In "Details" we can see how "%in%" is defined:
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
So, yes, everything is written using match.

RowSums NA + NA gives 0 [duplicate]

This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10

What does x[is.na(x)] do in R?

I'm following the swirl tutorial, and one of the parts has a vector x defined as:
> x
[1] 1.91177824 0.93941777 -0.72325856 0.26998371 NA NA
[7] -0.17709161 NA NA 1.98079386 -1.97167684 -0.32590760
[13] 0.23359408 -0.19229380 NA NA 1.21102697 NA
[19] 0.78323515 NA 0.07512655 NA 0.39457671 0.64705874
[25] NA 0.70421548 -0.59875008 NA 1.75842059 NA
[31] NA NA NA NA NA NA
[37] -0.74265585 NA -0.57353603 NA
Then when we type x[is.na(x)] we get a vector of all NA's
> x[is.na(x)]
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why does this happen? My confusion is that is.na(x) itself returns a vector of length 40 with True or False in each entry of the vector depending on whether that entry is NA or not. Why does "wrapping" this vector with x[ ] suddenly subset to the NA's themselves?
This is called logical indexing. It's a very common and neat R idiom.
Yes, is.na(x) gives a boolean ("logical") vector of same length as your vector.
Using that logical vector for indexing is called logical indexing.
Obviously x[is.na(x)] accesses the vector of all NA entries in x, and is totally pointless unless you intend to reassign them to some other value, e.g. impute the median (or anything else)
x[is.na(x)] <- median(x, na.rm=T)
Notes:
whereas x[!is.na(x)] accesses all non-NA entries in x
or compare also to the na.omit(x) function, which is way more clunky
The way R's builtin functions historically do (or don't) handle NAs (by default or customizably) is a patchwork-quilt mess, that's why the x[is.na(x)] idiom is so crucial)
many useful functions (mean, median, sum, sd, cor) are NA-aware, i.e. they support an na.rm=TRUE option to ignore NA values. See here. Also for how to define table_, mode_, clamp_

Resources