Apply log2 transformation only to numeric columns of a data.frame - r

I am trying to run a log2 transformation on my data set but I keep getting an error that says "non-numeric variable(s) in data frame". My data has row.names = 1 and header = TRUE and is of class data.frame()
I tried adding lappy(na.strings) but this does not fix the problem
Shared_DEGs <- cbind(UT.Degs_heatmap[2:11], MT.Degs_heatmap[2:11], HT.Degs_heatmap[2:11])
Shared_DEGs1 <- `row.names<-`(Shared_DEGs, (UT.Degs_heatmap[,1]))
MyData.INF.log2 <- log2(Shared_DEGs1)
The data should be log2 transformed as an output

I always recommend using 'tidyverse' to process data frame. Install it with install.packages('tidyverse')
library(tidyverse)
log2_transformed <- mutate_if(your_data, is.numeric, log2)

Yet another way using base R's rapply, kindly using the data provided by #r2evans.
rapply(mydf, f = log2, classes = c("numeric", "integer"), how = "replace")
# num int chr lgl
#1 1.651496 2.321928 A TRUE

Do not try to run log2 (or other numeric computations) on a data.frame as a whole, instead you need to do it per column. Since we don't have your data, I'll generate something to fully demonstrate:
mydf <- data.frame(num = pi, int = 5L, chr = "A", lgl = TRUE, stringsAsFactors = FALSE)
mydf
# num int chr lgl
# 1 3.141593 5 A TRUE
isnum <- sapply(mydf, is.numeric)
isnum
# num int chr lgl
# TRUE TRUE FALSE FALSE
mydf[,isnum] <- lapply(mydf[,isnum], log2)
mydf
# num int chr lgl
# 1 1.651496 2.321928 A TRUE
What I'm doing here:
isnum is the subset of columns that are numeric (integer or float). This logical indexing can be extended to include things like "nothing negative" or "no NAs", completely up to you.
mydf[,isnum] subsets the data to just those columns
lapply(mydf[,isnum], log2) runs the function log2 against each column of the sub-frame, each column individually; what is passed to log2 is a vector of numbers, not a data.frame as in your attempt
mydf[,isnum] <- lapply(...): normally, if we do mydf <- lapply(...), we will be storing a list, which overwrites your previously instance (losing non-number columns) and no longer a frame, so using the underlying R function [<- (assigns to a subset), we replace the components of the frame (a) preserving other columns, and (b) without losing the "class" of the parent frame.

Related

Replacing values by index with data.table syntax

assume we have data.table d1 with 6 rows:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
we add a column to d1 called test, and fill it with NA
d1$test <- NA
the external vector rows gives the index of rows we want to fill with values contained in vals
rows <- c(5,6)
vals <- c(6,3)
how do you do this in data table syntax? i have not been able to figure this out from the documentation.
it seems like this should work, but it does not:
d1[rows, test := vals]
the following error is returned:
Warning: 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
This is my desired outcome:
data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5), test = c(NA,NA,NA,NA,6,3))
Let's walk through this:
d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
d1$test <- NA
rows <- c(5,6)
vals <- c(6,3)
d1[rows, test := vals]
# Warning in `[.data.table`(d1, rows, `:=`(test, vals)) :
# 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
class(d1$test)
# [1] "logical"
class(vals)
# [1] "numeric"
R can be quite "sloppy" in general, allowing one to coerce values from one class to another. Typically, this is from integer to floating point, sometimes from number to string, sometimes logical to number, etc. R does this freely, at times unexpectedly, and often silently. For instance,
13 > "2"
# [1] FALSE
The LHS is of class numeric, the RHS character. Because of the different classes, R silently converts 13 to "13" and then does the comparison. In this case, a string-comparison is doing a lexicographic comparison, which is letter-by-letter, meaning that it first compares the "1" with the "2", determines that it is unambiguously not true, and stops the comparison (since no other letter will change the results). The fact that the numeric comparison of the two is different, nor the fact that the RHS has no more letters to compare (lengths themselves are not compared) do not matter.
So R can be quite sloppy about this; not all languages are this allowing (most are not, in my experience), and this can be risky in unsupervised (automated) situations. It often produces unexpected results. Because of this, many (including devs of data.table and dplyr, to name two) "encourage" (force) the user to be explicit about class coersion.
As a side note: R has at least 8 different classes of NA, and all of them look like NA:
str(list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_,
Sys.Date()[NA], Sys.time()[NA], as.POSIXlt(Sys.time())[NA]))
# List of 8
# $ : logi NA
# $ : int NA
# $ : num NA
# $ : chr NA
# $ : cplx NA
# $ : Date[1:1], format: NA
# $ : POSIXct[1:1], format: NA
# $ : POSIXlt[1:1], format: NA
There are a few ways to fix that warning.
Instantiate the test column as a "real" (numeric, floating-point) version of NA:
# starting with a fresh `d1` without `test` defined
d1$test <- NA_real_
d1[rows, test := vals] # works, no warning
Instantiate the test column programmatically, matching the class of vals without using the literal NA_real_:
# starting with a fresh `d1` without `test` defined
d1$test <- vals[1][NA]
d1[rows, test := vals] # works, no warning
Convert the existing test column in its entirety (not subsetted) to the desired class:
d1$test <- NA # this one is class logical
d1[, test := as.numeric(test)] # converts from NA to NA_real_
d1[rows, test := vals] # works, no warning
Things that work but are still being sloppy:
replace allows us to do this, but it is silently internally coercing from logical to numeric:
d1$test <- NA # logical class
d1[, test := replace(test, .I %in% rows, vals)]
This works because the internals of replace are simple:
function (x, list, values)
{
x[list] <- values
x
}
The reassignment to x[list] causes R to coerce the entire vector from logical to numeric, and it returns the whole vector at once. In data.table, assigning to the whole column at once allows this, since it is a common operation to change the class of a column.
As a side note, some might be tempted to use replace to fix things here. Using base::ifelse, this works, but further demonstrates the sloppiness of R here (and more so in ifelse, which while convenient, it is broken in a few ways).
base::ifelse doesn't work here out of the box because we'd need vals to be the same length as number of rows in d1. Even if that were the case, though, ifelse also silently coerces the class of one or the other. Imagine these scenarios:
ifelse(c(TRUE, TRUE), pi, "pi")
# [1] 3.141593 3.141593
ifelse(c(TRUE, FALSE), pi, "pi")
# [1] "3.14159265358979" "pi"
The moment one of the conditions is false in this case, the whole result changes from numeric to character, and there was no message or warning to that effect. It is because of this that data.table::fifelse (and dplyr::if_else) will fail preemptively:
fifelse(c(TRUE, TRUE), pi, "pi")
# Error in fifelse(c(TRUE, TRUE), pi, "pi") :
# 'yes' is of type double but 'no' is of type character. Please make sure that both arguments have the same type.
(There are other issues with ifelse, not just this, caveat emptor.)

Why does indexing with a single character index work on a data frame but not a matrix?

In data frames, [-indexing can be performed using a single character. E.g. mtcars["mpg"].
On the other hand, trying the same on a matrix, results in NA, e.g.
m = cbind(A = 1:5, B = 1:5)
m["A"]
# NA
...implying that this is somehow an invalid way to subset a matrix.
Is this normal R behavior? If so, where is it documented?
cbind() creates a matrix, by default. mtcars is a data frame.
class(cbind(A = 1:5, B = 1:5))
# [1] "matrix" "array"
class(mtcars)
# [1] "data.frame"
Because data frames are built as lists of columns, dataframe["column_name"], using one argument in [, defaults to treating the data frame as a list, allowing you to select columns, mostly the same as dataframe[, "column_name"].
A matrix has no such list underpinnings, so if you use [ with one argument, it doesn't assume you want columns. Use matrix[, "column_name"] to select columns from a matrix.
cbind is a bad way to create data frames from scratch. You can specify cbind.data.frame(A = 1:5, B = 1:5), but it's simpler and clearer to use data.frame(A = 1:5, B = 1:5). However, if you are adding multiple columns to an existing data frame then cbind(my_data_frame, A = 1:5, B = 1:5) is fine, and will result in a data frame as long as one of the arguments is already a data frame.
This behaviour is documented in ?"[", section "Matrices and arrays":
Matrices and arrays are vectors with a dimension attribute and so
all the vector forms of indexing can be used with a single index.
It means that if you use just a single index, the object to subset is treated as an object without dimensions and so if the index is a character vector, the method will look for the names attribute, which is absent in this case (try names(m) on your matrix to check this). What you did in the question is totally equivalent to (c(1:5, 1:5))["A"]. If you use a double index instead, the method will search for the dimnames attribute to subset. Even if confusing, a matrix may have both names and dimnames. Consider this:
m<-matrix(c(1:5,1:5), ncol = 2, dimnames = list(LETTERS[1:5], LETTERS[1:2]))
names(m)<-LETTERS[1:10]
#check whether the attributes are set
str(m)
# int [1:5, 1:2] 1 2 3 4 5 1 2 3 4 5
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# ..$ : chr [1:2] "A" "B"
# - attr(*, "names")= chr [1:10] "A" "B" "C" "D" ...
We have set rownames, colnames and names. Let's subset it:
#a column
m[,"A"]
#A B C D E
#1 2 3 4 5
#a row
m["A",]
# A B
#1 1
#an element
m["A"]
#A
#1
Two cases here,
m = cbind(A = 1:5, B = 11:15)
typeof(m)
"integer"
And
typeof(mtcars)
"list"
So reading is different. First case needs comma,
cbind(A = 1:5, B = 11:15)[,"A"]
[1] 1 2 3 4 5

return index of all factor variables that don't have a predefined name

I'm trying to write a function that will return the index of all binary variables in a data frame with the exception of a predefined variable or list of variable supplied. you can generate example data with this:
data<-data.frame("RESPONSE" = sample(c("YES","NO"),100,replace = T),
"FACTOR" = sample(c("YES","NO","MAYBE"),100,replace = T),
"BINARY" = sample(c("YES","NO"),100,replace = T),
"NUMERIC" = sample(1:100,100,replace = T))
In this case the predefined variable to ignore is "RESPONSE"
response.variable.name<-"RESPONSE"
I can get the list of all the binary variables using:
sapply(data,function(x) nlevels(as.factor(x))==2)
and the list of all variables not named "RESPONSE" using:
!names(data) %in% response.variable.name
but the output I'm looking for ignores the predefined column or list of columns and would return the same output as you would get with:
names(data)=="BINARY"
I thought using the two conditions inside the sapply function, but names(x) inside sapply returns NULL values. I know there's an easy fix for this problem
## Desired result?
names(data)=="BINARY"
# [1] FALSE FALSE TRUE FALSE
## Desired method
response.variable.name<-"RESPONSE"
sapply(data,function(x) nlevels(as.factor(x))==2) & !names(data) %in% response.variable.name
# RESPONSE FACTOR BINARY NUMERIC
# FALSE FALSE TRUE FALSE
## same values, has names too (bonus!)
## wrap in `unname()` if you don't like names
We can use Map from base R
unlist(Map(function(x, y) nlevels(factor(x)) == 2 &
y != response.variable.name, data, names(data)))
# RESPONSE FACTOR BINARY NUMERIC
# FALSE FALSE TRUE FALSE
Or using imap
library(tidyverse)
data %>%
imap_lgl(~ nlevels(.x) == 2 & .y != response.variable.name)
# RESPONSE FACTOR BINARY NUMERIC
# FALSE FALSE TRUE FALSE

Names of variables inside the 'for loop' [duplicate]

This question already has answers here:
Dynamically select data frame columns using $ and a character value
(10 answers)
Closed 6 years ago.
I am trying to create a function that allows the conversion of selected columns of a data frame to categorical data type (factor) before running a regression analysis.
Question is how do I slice a particular column from a data frame using a string (character).
Example:
strColumnNames <- "Admit,Rank"
strDelimiter <- ","
strSplittedColumnNames <- strsplit(strColumnNames, strDelimiter)
for( strColName in strSplittedColumnNames[[1]] ){
dfData$as.name(strColName) <- factor(dfData$get(strColName))
}
Tried:
dfData$as.name()
dfData$get(as.name())
dfData$get()
Error Msg:
Error: attempt to apply non-function
Any help would be greatly appreciated! Thank you!!!
You need to change
dfData$as.name(strColName) <- factor(dfData$get(strColName))
to
dfData[[strColName]] <- factor(dfData[[strColName]])
You may read ?"[[" for more.
In your case, column names are generated programmingly, [[]] is the only way to go. Maybe this example will be clear enough to illustrate the problem of $:
dat <- data.frame(x = 1:5, y = 2:6)
z <- "x"
dat$z
# [1] NULL
dat[[z]]
# [1] 1 2 3 4 5
Regarding the other answer
apply definitely does not work, because the function you apply is as.factor or factor. apply always works on a matrix (if you feed it a data frame, it will convert it into a matrix first) and returns a matrix, while you can't have factor data class in matrix. Consider this example:
x <- data.frame(x1 = letters[1:4], x2 = LETTERS[1:4], x3 = 1:4, stringsAsFactors = FALSE)
x[, 1:2] <- apply(x[, 1:2], 2, as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: chr "a" "b" "c" "d"
# $ x2: chr "A" "B" "C" "D"
# $ x3: int 1 2 3 4
Note, you still have character variable rather than factor. As I said, we have to use lapply:
x[1:2] <- lapply(x[1:2], as.factor)
str(x)
#'data.frame': 4 obs. of 3 variables:
# $ x1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ x2: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# $ x3: int 1 2 3 4
Now we see the factor class in x1 and x2.
Using apply for a data frame is never a good idea. If you read the source code of apply:
dl <- length(dim(X))
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)
You see that a data frame (which has 2 dimension) will be coerced to matrix first. This is very slow. If your data frame columns have multiple different class, the resulting matrix will have only 1 class. Who knows what the result of such coercion would be.
Yet apply is written in R not C, with an ordinary for loop:
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
so it is no better than an explicit for loop you write yourself.
I would use a different method.
Create a vector of column names you want to change to factors:
factorCols <- c("Admit", "Rank")
Then extract these columns by index:
myCols <- which(names(dfData) %in% factorCols)
Finally, use apply to change these columns to factors:
dfData[,myCols] <- lapply(dfData[,myCols],as.factor)

Error when adding columns with default values to dataframe with zero rows

Why does this work,
# add ONE column to dataframe with zero rows
x <- data.frame(a=character(0))
x["b"] <- character(0)
while this does not?
# add SEVERAL columns to dataframe with zero rows
x <- data.frame(a=character(0))
x[c("b", "c")] <- character(0)
error in value[[jvseq[[jjj]]]] : index out of limits [... freely translated]
Note, that this is perfectly okay, if we have non-zero rows.
x <- data.frame(a=1)
x["b"] <- NA
x <- data.frame(a=1)
x[c("b", "c")] <- NA
And what would be a simple alternative to add multiple columns to zero row dataframes?
From help("[.data.frame"):
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list.
From help("["):
Recursive (list-like) objects
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
Thus, you need to do pass a list (or data.frame):
x <- data.frame(a=character(0))
x[c("b", "c")] <- list(character(0), character(0))
str(x)
#'data.frame': 0 obs. of 3 variables:
# $ a: Factor w/ 0 levels:
# $ b: chr
# $ c: chr

Resources