Change stringsAsFactors settings for data.frame - r

I have a function in which I define a data.frame that I use loops to fill with data. At some point I get the Warning message:
Warning messages:
1: In [<-.factor(*tmp*, iseq, value = "CHANGE") :
invalid factor level, NAs generated
Therefore, when I define my data.frame, I'd like to set the option stringsAsFactors to FALSE but I don't understand how to do it.
I have tried:
DataFrame = data.frame(stringsAsFactors=FALSE)
And also:
options(stringsAsFactors=FALSE)
What is the correct way to set the stringsAsFactors option?

It depends on how you fill your data frame, for which you haven't given any code. When you construct a new data frame, you can do it like this:
x <- data.frame(aName = aVector, bName = bVector, stringsAsFactors = FALSE)
In this case, if e.g. aVector is a character vector, then the dataframe column x$aName will be a character vector as well, and not a factor vector. Combining that with an existing data frame (using rbind, cbind or similar) should preserve that mode.
When you execute
options(stringsAsFactors = FALSE)
you change the global default setting. So every data frame you create after executing that line will not auto-convert to factors unless explicitly told to do so. If you only need to avoid conversion in a single place, then I'd rather not change the default. However if this affects many places in your code, changing the default seems like a good idea.
One more thing: if your vector already contains factors, then neither of the above will change it back into a character vector. To do so, you should explicitly convert it back using as.character or similar.

Related

using setNames in ifelse statment in R

I noticed that if I called setNames() in ifelse() the returned object does not preserved the names from setNames().
x <- 1:10
#no names kept
ifelse(x <5, setNames(x+1,letters[1:4]), setNames(x^3, letters[5:10]))
#names kept
setNames(ifelse(x <5, x+1,x^3), letters[1:10])
After looking at the code I realize that the second way is more concise but still would be interested to know why the names are not preserved when setNames() is called in ifelse(). ifelse() documentation warns of :
The mode of the result may depend on the value of test (see the examples), and the class attribute (see oldClass) of the result is taken from test and may be inappropriate for the values selected from yes and no.
Is the named list being stripped related to this warning?
It's not really specific to setNames. ifelse simply doesn't preserve names for the TRUE/FALSE parameter. It would get confusing if your TRUE and FALSE values had different names so it just doesn't bother. However, according to the Value session of the help page
A vector of the same length and attributes (including dimensions and "class") as test
Since names are stored as attributes, names are only preserved from the the test parameter. Observe these simple examples
ifelse(TRUE, c(a=1), c(x=4))
# [1] 1
ifelse(c(g=TRUE), c(a=1), c(x=4))
# g
# 1
So in your examples you need to move the names to the test condition
ifelse(setNames(x <5,letters[1:10]), x+1, x^3)

When creating new data.frame column, what is the difference between `df$NewCol=` and `df[,"NewCol"]=` methods?

Using the default "iris" DataFrame in R, how come when creating a new column "NewCol"
iris[,'NewCol'] = as.POSIXlt(Sys.Date()) # throws Warning
BUT
iris$NewCol = as.POSIXlt(Sys.Date()) # is correct
This issue doesn't exist when assigning Primitive types like chr, int, float, ....
First, notice as #sindri_baldur pointed, as.POSIXlt returns a list.
From R help ($<-.data.frame):
There is no data.frame method for $, so x$name uses the default method which treats x as a list (with partial matching of column names if the match is unique, see Extract). The replacement method (for $) checks value for the correct number of rows, and replicates it if necessary.
So, if You try iris[, "NewCol"] <- as.POSIClt(Sys.Date()) You get warning that You're trying assign a list object to a vector. So only the first element of the list is used.
Again, from R help:
"For [ the replacement value can be a list: each element of the list is used to replace (part of) one column, recycling the list as necessary".
And in Your case, only one column is specified meaning only the first element of the as.POSIXlt's result (list) will be used. And You are warned of that.
Using $ syntax the iris data.frame is treated as a list and then the result of as.POSIXlt - a list again - is appended to it. Finally, the result is data.frame, but take a look at the type of the NewCol - it's a list.
iris[, "NewCol"] <- as.POSIXlt(Sys.Date()) # warning
iris$NewCol2 <- as.POSIXlt(Sys.Date())
typeof(iris$NewCol) # double
typeof(iris$NewCol2) # list
Suggestion: maybe You wanted to use as.POSIXct()?

Is there a way in R to ignore a "." in my data when calculating mean/sd/etc

I have a large data set that I need to calculate mean/std dev/min/ and max on for several columns. The data set uses a "." to denote when a value is missing for a subject. When running the mean or sd function this causes R to return NA . Is there a simple way around this?
my code is just this
xCAL<-mean(longdata$CAL)
sdCAL<-sd(longdata$CAL)
minCAL<-min(longdata$CAL)
maxCAL<-max(longdata$CAL)
but R will return NA on all these variables. I get the following Error
Warning message:
In mean.default(longdata$CAL) :
argument is not numeric or logical: returning NA
You need to convert your data to numeric to be able to do any calculations on it. When you run as.numeric, your . will be converted to NA, which is what R uses for missing values. Then, all of the functions you mention take an argument na.rm that can be set to TRUE to remove (rm) missing values (na).
If your data is a factor, you need to convert it to character first to avoid loss of information as explained in this FAQ.
Overall, to be safe, try this:
longdata$CAL <- as.numeric(as.character(longdata$CAL))
xCAL <- mean(longdata$CAL, na.rm = TRUE)
sdCAL <- sd(longdata$CAL, na.rm = TRUE)
# etc
Do note that na.rm is a property of the function - it's not magic that works everywhere. If you look at the help pages for ?mean ?sd, ?min, etc., you'll see the na.rm argument documented. If you want to remove missing values in general, the na.omit() function works well.

Coercing a vector to numeric mode in R

So, I have a set of data, and what I'm trying to do is find all the local maxima on the resulting curve. I read in a CSV file, which has x-values in the first column and y-values in the second, first step done, easy.
To find the maxima, I tried to use the findpeaks() function from the pracma database. However, each time I tried to run it, I got the same error:
Error: is.vector(x, mode = "numeric") is not TRUE
So, I first tried just converting this to a vector. Still got the same issue, however is.vector(x, mode = "any") was now returning true. I found some other help threads (which I can no longer find, so I can't share them, sorry!), and decided to try using lapply to coerce each entry in the new vector using as.numeric. Didn't work. Looked into ?as.numeric, and it mentioned that as.double might be better suited. Didn't work. Now I'm at a loss and not sure what to do - current working code is shown below.
plot <- read_csv("AFGP60 UV-05-04-16.csv",
col_names = FALSE, na = "null", skip = 2,n_max = numrow)
diffplot <- c(plot[1:601,2])
diffplot <- lapply(diffplot,as.double)
findpeaks(diffplot)`
Try diffplot <- as.numeric(as.vector(plot[1:600, 2])).
The problem was that the data was read as character or as factor. The above code should change that. However, there are multiple issues with your code. First, plot is a base function used for plotting. Naming a variable with such a name is bad practice.
Second, the diffplot variable is a vector (first 600 rows from the second column), so there is no need to change each element separately with the lapply function.

Error while mapping SYMBOLS to ENTREZID

I am getting a strange error converting Gene Symbols to Entrez ID. Here is my code:
testData = read.delim("IL_CellVar.txt",head=T,row.names = 2)
testData[1:5,1:3]
# ClustID Genes.Symbol ChrLoc
# NM_001034168.1 4 Ank2 chrNA:-1--1
# NM_013795.4 4 Atp5l chrNA:-1--1
# NM_018770 4 Igsf4a chrNA:-1--1
# NM_146150.2 4 Nrd1 chrNA:-1--1
# NM_134065.3 4 Epdr1 chrNA:-1--1
clustNum = 5
filteredClust = testData[testData$ClustID == clustNum,]
any(is.na(filteredClust$Genes.Symbol))
# [1] FALSE
selectedEntrezIds <- unlist(mget(filteredClust$Genes.Symbol,org.Mm.egSYMBOL2EG))
# Error in unlist(mget(filteredClust$Genes.Symbol, org.Mm.egSYMBOL2EG)) :
# error in evaluating the argument 'x' in selecting a method for function
# 'unlist': Error in #.checkKeysAreWellFormed(keys) :
# keys must be supplied in a character vector with no NAs
Another approach fails too:
selectedEntrezIds = select(org.Mm.eg.db,filteredClust$Genes.Symbol, "ENTREZID")
# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
# 'keys' must be a character vector
Just for the sake or error, removing 'NA', doesn't help:
a <- filteredClust$Genes.Symbol[!is.na(filteredClust$Genes.Symbol)]
selectedEntrezIds <- unlist(mget(a,org.Mm.egSYMBOL2EG))
# Error in unlist(mget(a, org.Mm.egSYMBOL2EG)) :
# error in evaluating the argument 'x' in selecting a method for function
# 'unlist': Error in # .checkKeysAreWellFormed(keys) :
# keys must be supplied in a character vector with no NAs
I am not sure why I am getting this error as the master file from which gene symbols were extracted for testData gives no problem while converting to EntrezID. Would apprecite help on this.
Since you didn't provide a minimal reproducible example for us to replicate the error you've experienced, I'm making a speculation here based on the error message. This is most likely caused by the default behavior of read.delim and functions alike (read.csv, read.table etc.) that converts strings in your data file to factor's.
You need to add an extra parameter to read.delim, specifically, stringsAsFactors=F (by default, it is TRUE).
That is,
testData = read.delim("IL_CellVar.txt", head=T, row.names = 2, stringsAsFactors=F)
If you read the documentation:
stringsAsFactors
logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.
You can check the class of your Gene.symbol column by:
class(testData$Gene.Symbol)
and I guess it woul be "factor".
This leads to the error you had:
# Error in .select(x, keys, columns, keytype = extraArgs[["kt"]], jointype = jointype) :
# 'keys' must be a character vector
You can also manually convert the factors to strings/characters by:
testData$Gene.Symbol <- as.character(testData$Gene.Symbol)
You can read more about this peculiar behavior in this chapter of Hadley's book "Advanced R". And I'm quoting the relevant paragraph here:
... Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. A global option, options(stringsAsFactors = FALSE), is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re source()ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave. ...

Resources