Converting factor values with levels to numeric in r - r

I'm trying to convert factor values in R into numeric. I tried various methods but no matter what I do, I get the error "NAs introduced by coercion". Here is a sample code I run and the error I get:
> demand <- read.csv("file.csv" )
> demand[3,3]
[1] 5,185
25 Levels: 2/Jan/2011 3,370 4,339 4,465 4,549 4,676 4,767 4,844 5,055 5,139 5,185 5,265 5,350 5,434 ... dam
> a <- demand[3,3]
> as.numeric(as.character(a))
[1] NA
Warning message:
NAs introduced by coercion
How can I get numeric values?

You should replace
as.numeric(as.character(a))
in your code with
as.numeric( gsub("[,]", "", as.character(a) ) )

I got 2 comments here:
You are using probably files from East Europe Excel float notation (',' instead of '.').
To make it working well, use read.csv2() function.
The firs observation is probably the header? I guess the observations below are somehow connected via this date (2/Jan/2011). I will suggest to use header=T argument.
Summarizing:
Try read.csv2("file.csv", head=T)
If for any reasons you still need to change factors to numeric values, I suggest :
f = as.factor(1:10)
as.numeric(f[f])
Best,
Adii_

Related

Error in cor(mydata) : 'x' must be numeric

In R, I have been having trouble trying to create a correlation matrix for my data. I keep running into this problem: "Error in cor(mydata) : 'x' must be numeric" and I don't know how to fix it.
> mydata <- Combo[, c(1,2,3,4,5,6,7)]
> head(mydata, 13)
> #computing matrix
> corrmax = cor(mydata)
**Error in cor(mydata) : 'x' must be numeric**
>
I believe not all the data in mydata are numeric. You can test this by running: str(mydata) or sapply(mydata, is.numeric).
If there are variables in mydata that are chr or other non-numeric formats or return FALSE in the case of sapply, you will need to convert them to numeric before running the command or be more selective about the set of variables for which you calculate a correlation. I see strings and percent signs in what you posted. The strings will need to be removed and the formatted percents (%) converted to a numeric representation (decimals).

new to R, and getting this error message, how do I omit NA in my cohort to analyze my data?

new to R, and getting this error message, how do I omit NA in my cohort to analyze my data? mean(cohort5$"age.at.diagnosis") [1] NA Warning message: In mean.default(cohort5$age.at.diagnosis) : argument is not numeric or logical: returning NA
All you need to do to handle NAs is add na.rm = TRUE:
mean(cohort5$age.at.diagnosis, na.rm = TRUE)
However, the error message you received suggests that the problem is actually in the data format. You should make sure that the variable in your dataframe is, actually, numeric and doesn't contain non-numeric values (for example some unusual character used to indicate missing values). class(cohort5$age.at.diagnosis) will tell you the data type.
cohort5$age.at.diagnosis <- as.numeric(cohort5$age.at.diagnosis) # if currently character
cohort5$age.at.diagnosis <- as.numeric(as.character(cohort5$age.at.diagnosis)) # if currently factor
Both of these lines will coerce non-numeric values into NAs, so be careful because you may be throwing away information by doing that.
There is are ways to omit missing data prior to running any sort of analysis using the na.omit function.
na.omit(Cohort5)

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

mean.default argument not numerical on R

I'm trying to use R for the first time, I have never taken courses and have some questions. the first is this:
when I try to do the mean of some Temperature values (they are all between 18.15 and 18.40)
I get this answer
"Warning message:
In mean.default(d_Temp_Experiment$value) :
argument is not numeric or logical: returning NA"
I dont' have the same problem with values of PAR 5that are all integer numbers and with values of pH all decimal numbers like 8.831...
Can you tell what I do wrong?
As Arun hints at it could be that the column is character rather than numeric.
If you are sure that all the values are correct you could coerce the values with
d_Temp_Experiment$value <- as.numeric(d_Temp_Experiment$value)
You might have the below sort of business going on.
myvector <- c(0,1,2,3,4,5,"6","7")
mv <- as.numeric(myvector)
mean(myvector)
mean(mv)

Error with knn function

I try to run this line :
knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)
but i always get this error :
Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[, :
NAs introduced by coercion
Any idea please ?
PS : mydades.training and mydades.test are defined as follow :
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:
NA/NaN/Inf in foreign function call (arg 6)
makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:
Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr),
as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)),
as.double(test), res = integer(nte), pr = double(nte),
integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))
where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors
NAs introduced by coercion
I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:
as.double(train)
that may fail in cases such as:
# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23
You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):
library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))
# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message
# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"
# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"
# ... or wrong decimal symbol?
mydades[3,3] <- "1,23"
# should be 1.23, as R uses '.' as decimal symbol and not ','
# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)
I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:
mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]
Using calls
str(mydades); summary(mydades)
may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.
The rest of the run code (after breaking the data), as provided by you:
N <- nrow(mydades)
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
Great answer by#Teemu.
As this is a well-read question, I will give the same answer from an analytics perspective.
The KNN function classifies data points by calculating the Euclidean distance between the points. That's a mathematical calculation requiring numbers. All variables in KNN must therefore be coerce-able to numerics.
The data preparation for KNN often involves three tasks:
(1) Fix all NA or "" values
(2) Convert all factors into a set of booleans, one for each level in the factor
(3) Normalize the values of each variable to the range 0:1 so that no variable's range has an unduly large impact on the distance measurement.
I would also point out that the function seems to fail when using integers. I needed to convert everything into "num" type prior to calling the knn function. This includes the target feature, which most methods in R use the factor type. Thus, as.numeric(my_frame$target_feature) is required.

Resources