What does x[is.na(x)] do in R? - r

I'm following the swirl tutorial, and one of the parts has a vector x defined as:
> x
[1] 1.91177824 0.93941777 -0.72325856 0.26998371 NA NA
[7] -0.17709161 NA NA 1.98079386 -1.97167684 -0.32590760
[13] 0.23359408 -0.19229380 NA NA 1.21102697 NA
[19] 0.78323515 NA 0.07512655 NA 0.39457671 0.64705874
[25] NA 0.70421548 -0.59875008 NA 1.75842059 NA
[31] NA NA NA NA NA NA
[37] -0.74265585 NA -0.57353603 NA
Then when we type x[is.na(x)] we get a vector of all NA's
> x[is.na(x)]
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why does this happen? My confusion is that is.na(x) itself returns a vector of length 40 with True or False in each entry of the vector depending on whether that entry is NA or not. Why does "wrapping" this vector with x[ ] suddenly subset to the NA's themselves?

This is called logical indexing. It's a very common and neat R idiom.
Yes, is.na(x) gives a boolean ("logical") vector of same length as your vector.
Using that logical vector for indexing is called logical indexing.
Obviously x[is.na(x)] accesses the vector of all NA entries in x, and is totally pointless unless you intend to reassign them to some other value, e.g. impute the median (or anything else)
x[is.na(x)] <- median(x, na.rm=T)
Notes:
whereas x[!is.na(x)] accesses all non-NA entries in x
or compare also to the na.omit(x) function, which is way more clunky
The way R's builtin functions historically do (or don't) handle NAs (by default or customizably) is a patchwork-quilt mess, that's why the x[is.na(x)] idiom is so crucial)
many useful functions (mean, median, sum, sd, cor) are NA-aware, i.e. they support an na.rm=TRUE option to ignore NA values. See here. Also for how to define table_, mode_, clamp_

Related

ifelse r - x and y lengths differ

I'm trying to use an ifelse on an array called "OutComes" but it's giving me some trouble.
> PersonNumber Risk_Factor OC_Death OnsetAge Clinical CS_Death Cure AC_Death
>[1,] 1 1 99.69098 NA NA NA NA NA
>[2,] 2 1 60.68009 NA NA NA NA NA
>[3,] 3 0 88.67483 NA NA NA NA NA
>[4,] 4 0 87.60846 NA NA NA NA NA
>[5,] 5 0 78.23118 NA NA NA NA NA
Now I will try to use an apply to analyse this table's Risk_Factor Column and apply one of two functions to replace the OnsetAge column's NA's.
I've been using an apply function -
apply(OutComes, 1, function(x)ifelse(OutComes[,"Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x))
However this obviously won't work as the ifelse itself won't work. the error being -
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
I'm not sure what's going on in this ifelse or what the x and y lengths are.
There is a mistake in your apply function. You are applying a function with argument x (one row of OutComes), but then whithin ifelse, you use a vector OutComes[,"Risk_Factor"] which is a column of the original matrix, not a single number. One simple solution is to do
apply(OutComes, 1, function(x) ifelse(x["Risk_Factor"] == 1,
HighOnsetFunction(x), OnsetFunction(x)))
But when dealing with a scalar, there is no real need to use ifelse, so it may be more efficient to write
apply(OutComes, 1, function(x) if (x["Risk_Factor"] == 1) HighOnsetFunction(x) else OnsetFunction(x)))

Indexing integer vector with NA

I have problems understanding this. I have an integer vector of length 5:
x <- 1:5
If I index it with a single NA, the result is of length 5:
x[NA]
# [1] NA NA NA NA NA
My first idea was that R checks whether 1-5 is NA but
x <- c(NA, 2, 4)
x[NA]
# NA NA NA.
So this cannot be the solution. My second approach is that x[NA] is indexing but then I do not understand
Why this gives me five NA's
What NA as an index means. x[1] gives you the first value but what should be the result of x[NA]?
Compare your code:
> x <- 1:5; x[NA]
[1] NA NA NA NA NA
with
> x <- 1:5; x[NA_integer_]
[1] NA
In the first case, NA is of type logical (class(NA) shows), whereas in the second it's an integer. From ?"[" you can see that in the case of i being logical, it is recycled to the length of x:
For [-indexing only: i, j, ... can be logical vectors, indicating
elements/slices to select. Such vectors are recycled if necessary to
match the corresponding extent. i, j, ... can also be negative
integers, indicating elements/slices to leave out of the selection.

Subscript with matrix generated by assign()

I assigned a matrix to a name which varies with j:
j <- 2L
assign(paste0("pca", j,".FAVAR_fcst", sep=""), matrix(ncol=24, nrow=12))
This works very neat. Then I try to access a column of that matrix
paste0("pca", j,".FAVAR_fcst", sep="")[,2]
and get the following error:
Error in paste0("pca", j, ".FAVAR_fcst", sep = "")[, 2] :
incorrect number of dimensions
I've tried several variations and combinations with cat(), print() and capture.output(), but nothing seems to work. I'm not sure what I have to search exactly for and couldn't find a solution. Can you help me?
You can use get :
get(paste0("pca", j,".FAVAR_fcst", sep="")) # for the matrix
get(paste0("pca", j,".FAVAR_fcst", sep=""))[,2] # for the column
# [1] NA NA NA NA NA NA NA NA NA NA NA NA
An other solution would be to combine eval and as.symbol :
eval(as.symbol(paste0("pca", j,".FAVAR_fcst", sep="")))[,2]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA

Replacing each element of any object

Is there any clever way to replace each part of any object with some values (for example NA's).
Let's take those objects
obj1 <- t.test(1:10)
obj2 <- matrix(1:9, 3)
obj3 <- 1:10
obj4 <- list(a = 1:10, b = letters[1:5], c = as.factor(1:10))
the expected output would be similar to
for (i in 1:length(obj1)) obj1[[i]] <- rep(NA, length(obj1[[i]]))
obj2 <- matrix(rep(NA, 9), 3)
obj3 <- rep(NA, 10)
obj4 <- list(a = rep(NA, 10), b = rep(NA, 5), c = rep(NA, 10))
So no matter if an object is a list, matrix, data.frame, vector etc. each part of the object is to be replaced with NA.
Is there any clever way to do so that does not need multiple loops, checking for object type every time and lots of exceptions (if (is.list(part)) ... etc.)?
You can take advantage of the fact that using an empty extraction index during assignment (i.e., x[] <- NA) replaces all elements with the right-hand side value. In your case, you could do something like this using rapply to attack all elements of all objects:
> rapply(mget(ls()), function(x) x[] <- rep(NA, length(x)), how = "replace")
$obj1
$obj1$statistic
[1] NA
$obj1$parameter
[1] NA
$obj1$p.value
[1] NA
$obj1$conf.int
[1] NA NA
$obj1$estimate
[1] NA
$obj1$null.value
[1] NA
$obj1$alternative
[1] NA
$obj1$method
[1] NA
$obj1$data.name
[1] NA
$obj2
[1] NA NA NA NA NA NA NA NA NA
$obj3
[1] NA NA NA NA NA NA NA NA NA NA
$obj4
$obj4$a
[1] NA NA NA NA NA NA NA NA NA NA
$obj4$b
[1] NA NA NA NA NA
$obj4$c
[1] NA NA NA NA NA NA NA NA NA NA
That's a very simple solution, though. You could probably complicate the function being passed to rapply so that it used S3 method dispatch to identify what class of object it was seeing and possibly return a different data structure (e.g., data.frame or matrix) accordingly, rather than just a vector of NAs.

having trouble understanding why this R syntax would work

I'm trying to understand why this R code does a certain transformation.
Df[,"cutoff"] = as.numeric(levels(Df[,"cutoff"]))[Df[,"cutoff"]]
Previously, Df[,"cutoff"] is a factor with 49 levels and now after this operation, it's a vector. I just don't understand this syntax at all. Is there an explanation behind what having as.numeric(levels(Df[,"cutoff"])) does to a factor?
Thanks!
If for any reason you get the numbers as factors, some R functions do not interpret those as numbers even though you see numbers. For example summary will count the number of cases instead the usual six numbers.
See:
Df=data.frame(cutoff=factor(rep(c(2:6),2)),y=runif(10,12,15))
str(Df)
summary(Df[,"cutoff"])
2 3 4 5 6
2 2 2 2 2
#If you want the levels as numbers
Df[,"cutoff"] = as.numeric(levels(Df[,"cutoff"]))[Df[,"cutoff"]]
summary(Df[,"cutoff"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 3 4 4 5 6
It's a vector of NA, if the factor was not a displayed numeric.
df <- data.frame(cutoff = letters[1:26])
as.numeric(levels(df[,"cutoff"]))[df[,"cutoff"]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion
Let's break it down, this shows you the levels of the factor, returning a character string:
levels(df[,"cutoff"])
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" ...
This tries to convert a character string to numeric (which it can't, and therefore returns NA)
as.numeric(levels(df[,"cutoff"]))
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion
Now, adding the last element [df[,"cutoff"]], all this does is subset the result by the factor df[,"cutoff"], but since every element is NA, you wouldn't see any difference. In practice this would likely change the order of the result in unexpected (read: useless) ways.
as.numeric(levels(df[,"cutoff"]))[df[,"cutoff"]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA ...
# Warning message:
# NAs introduced by coercion

Resources