Replacing values by index with data.table syntax - r

assume we have data.table d1 with 6 rows:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
we add a column to d1 called test, and fill it with NA
d1$test <- NA
the external vector rows gives the index of rows we want to fill with values contained in vals
rows <- c(5,6)
vals <- c(6,3)
how do you do this in data table syntax? i have not been able to figure this out from the documentation.
it seems like this should work, but it does not:
d1[rows, test := vals]
the following error is returned:
Warning: 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
This is my desired outcome:
data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5), test = c(NA,NA,NA,NA,6,3))

Let's walk through this:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
d1$test <- NA
rows <- c(5,6)
vals <- c(6,3)
d1[rows, test := vals]
# Warning in `[.data.table`(d1, rows, `:=`(test, vals)) :
# 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
class(d1$test)
# [1] "logical"
class(vals)
# [1] "numeric"
R can be quite "sloppy" in general, allowing one to coerce values from one class to another. Typically, this is from integer to floating point, sometimes from number to string, sometimes logical to number, etc. R does this freely, at times unexpectedly, and often silently. For instance,
13 > "2"
# [1] FALSE
The LHS is of class numeric, the RHS character. Because of the different classes, R silently converts 13 to "13" and then does the comparison. In this case, a string-comparison is doing a lexicographic comparison, which is letter-by-letter, meaning that it first compares the "1" with the "2", determines that it is unambiguously not true, and stops the comparison (since no other letter will change the results). The fact that the numeric comparison of the two is different, nor the fact that the RHS has no more letters to compare (lengths themselves are not compared) do not matter.
So R can be quite sloppy about this; not all languages are this allowing (most are not, in my experience), and this can be risky in unsupervised (automated) situations. It often produces unexpected results. Because of this, many (including devs of data.table and dplyr, to name two) "encourage" (force) the user to be explicit about class coersion.
As a side note: R has at least 8 different classes of NA, and all of them look like NA:
str(list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_,
Sys.Date()[NA], Sys.time()[NA], as.POSIXlt(Sys.time())[NA]))
# List of 8
# $ : logi NA
# $ : int NA
# $ : num NA
# $ : chr NA
# $ : cplx NA
# $ : Date[1:1], format: NA
# $ : POSIXct[1:1], format: NA
# $ : POSIXlt[1:1], format: NA
There are a few ways to fix that warning.
Instantiate the test column as a "real" (numeric, floating-point) version of NA:
# starting with a fresh `d1` without `test` defined
d1$test <- NA_real_
d1[rows, test := vals] # works, no warning
Instantiate the test column programmatically, matching the class of vals without using the literal NA_real_:
# starting with a fresh `d1` without `test` defined
d1$test <- vals[1][NA]
d1[rows, test := vals] # works, no warning
Convert the existing test column in its entirety (not subsetted) to the desired class:
d1$test <- NA # this one is class logical
d1[, test := as.numeric(test)] # converts from NA to NA_real_
d1[rows, test := vals] # works, no warning
Things that work but are still being sloppy:
replace allows us to do this, but it is silently internally coercing from logical to numeric:
d1$test <- NA # logical class
d1[, test := replace(test, .I %in% rows, vals)]
This works because the internals of replace are simple:
function (x, list, values)
{
x[list] <- values
x
}
The reassignment to x[list] causes R to coerce the entire vector from logical to numeric, and it returns the whole vector at once. In data.table, assigning to the whole column at once allows this, since it is a common operation to change the class of a column.
As a side note, some might be tempted to use replace to fix things here. Using base::ifelse, this works, but further demonstrates the sloppiness of R here (and more so in ifelse, which while convenient, it is broken in a few ways).
base::ifelse doesn't work here out of the box because we'd need vals to be the same length as number of rows in d1. Even if that were the case, though, ifelse also silently coerces the class of one or the other. Imagine these scenarios:
ifelse(c(TRUE, TRUE), pi, "pi")
# [1] 3.141593 3.141593
ifelse(c(TRUE, FALSE), pi, "pi")
# [1] "3.14159265358979" "pi"
The moment one of the conditions is false in this case, the whole result changes from numeric to character, and there was no message or warning to that effect. It is because of this that data.table::fifelse (and dplyr::if_else) will fail preemptively:
fifelse(c(TRUE, TRUE), pi, "pi")
# Error in fifelse(c(TRUE, TRUE), pi, "pi") :
# 'yes' is of type double but 'no' is of type character. Please make sure that both arguments have the same type.
(There are other issues with ifelse, not just this, caveat emptor.)

Related

Apply log2 transformation only to numeric columns of a data.frame

I am trying to run a log2 transformation on my data set but I keep getting an error that says "non-numeric variable(s) in data frame". My data has row.names = 1 and header = TRUE and is of class data.frame()
I tried adding lappy(na.strings) but this does not fix the problem
Shared_DEGs <- cbind(UT.Degs_heatmap[2:11], MT.Degs_heatmap[2:11], HT.Degs_heatmap[2:11])
Shared_DEGs1 <- `row.names<-`(Shared_DEGs, (UT.Degs_heatmap[,1]))
MyData.INF.log2 <- log2(Shared_DEGs1)
The data should be log2 transformed as an output
I always recommend using 'tidyverse' to process data frame. Install it with install.packages('tidyverse')
library(tidyverse)
log2_transformed <- mutate_if(your_data, is.numeric, log2)
Yet another way using base R's rapply, kindly using the data provided by #r2evans.
rapply(mydf, f = log2, classes = c("numeric", "integer"), how = "replace")
# num int chr lgl
#1 1.651496 2.321928 A TRUE
Do not try to run log2 (or other numeric computations) on a data.frame as a whole, instead you need to do it per column. Since we don't have your data, I'll generate something to fully demonstrate:
mydf <- data.frame(num = pi, int = 5L, chr = "A", lgl = TRUE, stringsAsFactors = FALSE)
mydf
# num int chr lgl
# 1 3.141593 5 A TRUE
isnum <- sapply(mydf, is.numeric)
isnum
# num int chr lgl
# TRUE TRUE FALSE FALSE
mydf[,isnum] <- lapply(mydf[,isnum], log2)
mydf
# num int chr lgl
# 1 1.651496 2.321928 A TRUE
What I'm doing here:
isnum is the subset of columns that are numeric (integer or float). This logical indexing can be extended to include things like "nothing negative" or "no NAs", completely up to you.
mydf[,isnum] subsets the data to just those columns
lapply(mydf[,isnum], log2) runs the function log2 against each column of the sub-frame, each column individually; what is passed to log2 is a vector of numbers, not a data.frame as in your attempt
mydf[,isnum] <- lapply(...): normally, if we do mydf <- lapply(...), we will be storing a list, which overwrites your previously instance (losing non-number columns) and no longer a frame, so using the underlying R function [<- (assigns to a subset), we replace the components of the frame (a) preserving other columns, and (b) without losing the "class" of the parent frame.

R: Applying function to DataFrame

I have following code:
library(Ecdat)
data(Fair)
Fair[1:5,]
x1 = function(x){
mu = mean(x)
l1 = list(s1=table(x),std=sd(x))
return(list(l1,mu))
}
mylist <- as.list(Fair$occupation,
Fair$education)
x1(mylist)
What I wanted is that x1 outputs the result for the items selected in mylist. However, I get In mean.default(x) : argument is not numeric or logical: returning NA.
You need to use lapply if your passing a list to a function
output<-lapply(mylist,FUN=x1)
This will process your function x1 for each element in mylist and return a list of results to output.
Here the mylist is created not in the correct way and a list is not needed also as data.frame is a list with columns of equal length. So, just subset the columns of interest and apply the function
lapply(Fair[c("occupation", "education")], x1)
In the OP's code, as.list simply creates a list of length 601 with only a single element in each.
str(mylist)
#List of 601
#$ : int 7
#$ : int 6
#$ : int 1
#...
#...
Another problem in the code is that it is not even considering the 2nd argument. Using a simple example
as.list(1:3, 1:2)
#[[1]]
#[1] 1
#[[2]]
#[1] 2
#[[3]]
#[1] 3
The second argument is not at all considered. It could have been
list(1:3, 1:2)
#[[1]]
#[1] 1 2 3
#[[2]]
#[1] 1 2
But for data.frame columns, we don't need to explicitly call the list as it is a list of vectors that have equal length.
Regarding the error in OP's post, mean works on vectors and not on list or data.frame.

Why does x[NA] yield an NA vector the same length as x?

The code is like this
x <- 1:5
x[NA]
Why does it produce 5 NAs?
The answer to this question has two sides:
How is NA interpreted when indexing matrices?
In one of the links provided by #alexis_laz, I found a very well structured explanation of how TRUE, FALSE and NA are interpreted when indexing matrices:
Logical indices tell R which elements to include or exclude.
You have three options: TRUE, FALSE and NA
They serve to indicate whether or not the index represented in that position should be included. In other words:
TRUE == "Include the elment at this index"
FALSE == "Do not include the element at this index"
NA == "Return NA instead of this index" # loosely speaking
For example:
x <- 1:6
x[ c(TRUE, FALSE, TRUE, NA, TRUE, FALSE)]
# [1] 1 3 NA 5
An important detail is that the default storage mode for an isolated NA value is logical (try typeof(NA)). You can choose the storage mode of the NA by using NA_integer_, NA_real_ (for double), NA_complex_ or NA_character_.
Why 5 NA and not just 1?
When the length of the indices is smaller than the length of vector x, the indexing will start over to also index the values in x that have not been indexed yet. In other words, R will automatically "recycle" the indices:
(...) However, standard recycling rules apply. So in the previous example, if we drop the last FALSE, the index vector is recycled, the first element of the index is TRUE, and hence the 6th element of x is now included
x <- 1:6
x[c(TRUE, FALSE, TRUE, NA, TRUE)]
# [1] 1 3 NA 5 6
Recall the detail about the storage mode from the previous section. If you type x[NA_integer_], then you will find a different result.

Unexpected behaviour of function table with "NaN" values

Recently, I've faced a behaviour in table function that was not what I was expected:
For example, let take the following vector:
ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
If I check for NA values in my vector, "NaN" is not considered one (as expected):
is.na(ex_vec)
# [1] FALSE FALSE FALSE FALSE FALSE TRUE
But if I tried to get the different values frequencies:
table(ex_vec)
#ex_vec
#Nan Non Oui
# 1 2 1
"NaN" does not appear in the table.
However, if I "ask" table to show the NA values, I get this:
table(ex_vec, useNA="ifany")
#ex_vec
# Nan NaN Non Oui <NA>
# 1 1 2 1 1
So, the character strings "NaN" is treated as a NA value inside table call, while being treated in the ouput as a not NA value.
I know (it would be better and) I could solve my problem by converting my vector to a factor but nonetheless, I'd really like to know what's going on here. Does anyone have an idea?
When factor matches up levels for a vector it converts its exclude list to the same type as the input vector:
exclude <- as.vector(exclude, typeof(x))
so if your exclude list has NaN and your vector is character, this happens:
as.vector(exclude, typeof(letters))
[1] NA "NaN"
Oh dear. Now the real "NaN" strings will be excluded.
To fix, use exclude=NA in table (and factor if you are making factors that might hit this).
I do love this in the docs for factor:
There are some anomalies associated with factors that have ‘NA’ as
a level. It is suggested to use them sparingly, e.g., only for
tabulation purposes.
Reassuring...
First idea coming to my mind was to have a look at table definition which start by:
> table
function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
"ifany", "always"), dnn = list.names(...), deparse.level = 1)
{
Sounds logical, by default table exclude NA and NaN.
Digging within table code we see that if xis not a factor it coerce it to a factor (nothing new here, it's said in the doc).
else {
a <- factor(a, exclude = exclude)
I didn't find anything else which could have impacted the input to coerce "NaN" into NA values.
So looking into factor to get the why we find the root cause:
> factor
function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
[...] # Snipped for brievety
exclude <- as.vector(exclude, typeof(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))] # defined in the snipped part above, is the sorted unique values of input vector, coerced to char.
f <- match(x, levels)
[...]
f
}
Here we got it, the exclude parameter, even being NA values is coerced into a character vector.
So what happens is:
> ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
> excludes<-c(NA,NaN)
> as.vector(excludes,"character")
[1] NA "NaN"
> match(ex_vec,as.vector(excludes,"character"))
[1] NA NA NA NA 2 1
We do match character "NaN" as the exclude vector as been coerced to character before comparison.

Why is there no NA_logical_

From help("NA"):
There are also constants NA_integer_, NA_real_, NA_complex_ and
NA_character_ of the other atomic vector types which support missing
values: all of these are reserved words in the R language.
My question is why there is no NA_logical_ or similar, and what to do about it.
Specifically, I am creating several large very similar data.tables, which should be class compatible for later rbinding. When one of the data.tables is missing a variable, I am creating that column but with it set to all NAs of the particular type. However, for a logical I can't do that.
In this case, it probably doesn't matter too much (data.table dislikes coercing columns from one type to another, but it also dislikes adding rows, so I have to create a new table to hold the rbound version anyway), but I'm puzzled as to why the NA_logical_, which logically should exist, does not.
Example:
library(data.table)
Y <- data.table( a=NA_character_, b=rep(NA_integer_,5) )
Y[ 3, b:=FALSE ]
Y[ 2, a:="zebra" ]
> Y
a b
1: NA NA
2: zebra NA
3: NA 0
4: NA NA
5: NA NA
> class(Y$b)
[1] "integer"
Two questions:
Why doesn't NA_logical_ exist, when its relatives do?
What should I do about it in the context of data.table or just to avoid coercion as much as possible? I assume using NA_integer_ buys me little in terms of coercion (it will coerce the logical I'm adding in to 0L/1L, which isn't terrible, but isn't ideal.
NA is already logical so NA_logical_ is not needed. Just use NA in those situations where you need a missing logical. Note:
> typeof(NA)
[1] "logical"
Since the NA_*_ names are all reserved words there was likely a desire to minimize the number of them.
Example:
library(data.table)
X <- data.table( a=NA_character_, b=rep(NA,5) )
X[ 3, b:=FALSE ]
> X
a b
1: NA NA
2: NA NA
3: NA FALSE
4: NA NA
5: NA NA
I think based on this
#define NA_LOGICAL R_NaInt
from $R_HOME/R/include/R_ext/Arith.h we can suggest using NA_integer or NA_real and hence plain old NA in R code:
R> as.logical(c(0,1,NA))
[1] FALSE TRUE NA
R>
R> as.logical(c(0L, 1L, NA_integer_))
[1] FALSE TRUE NA
R>
which has
R> class(as.logical(c(0,1,NA)))
[1] "logical"
R>
R> class(as.logical(c(0, 1, NA_real_)))
[1] "logical"
R>
Or am I misunderstanding your question? R's logical type is three-values: yay, nay and missing. And we can use the NA from either integer or numeric to cast. Does that help?

Resources