R data.frame strange behavior when converting characters to numeric - r

I am dealing with a dataset containing US States FIPS codes coded as characters, where codes from 1 to 9 sometimes have a 0 prefix (01, 02,...). While trying to clean it up I came across the following issue:
test <- data.frame(fips = c(1,"01")) %>%
mutate(fips = as.numeric(fips))
> test
fips
1 2
2 1
where 1 is converted as a 2, and 01 as a 1. This annoying behavior disappears with a tibble:
test <- tibble(fips = c(1,"01")) %>%
mutate(fips = as.numeric(fips))
> test
# A tibble: 2 x 1
fips
<dbl>
1 1
2 1
Does anyone know what is going on?
Thanks

This is a difference in the defaults for tibbles and data.frames. When you mix together strings and numbers as in c(1, "01"), R converts everything to a string.
c(1, "01")
[1] "1" "01"
The default behavior for data.frame is to make strings into factors. If you look at the help page for data.frame you will see the argument:
stringsAsFactors: ... The ‘factory-fresh’ default is TRUE
So data frame makes c(1, "01") into a factor with two levels "1" and "01"
T1 = data.frame(fips = c(1,"01"))
str(T1)
'data.frame': 2 obs. of 1 variable:
$ fips: Factor w/ 2 levels "01","1": 2 1
Now factors are stored as integers for efficiency. That is why you see 2 1 at the end of the about output of str(T1). So if you directly convert that to an integer, you get 2 and 1.
You can get the behavior that you want, either by making the data.frame more carefully with
T1 = data.frame(fips = c(1,"01"), stringsAsFactors=FALSE)
or you can convert the factor to a string before converting to a number
fips = as.numeric(as.character(fips))
Tibbles do not have this problem because they do not convert the strings to factors.

Related

A problem when releveling a factor to the default order?

I have this df
df = data.frame(x = 1:3)
converted to a factor
df$x = factor(df$x)
the levels by default are
str(df)
now let's make level 2 as the reference level
df$x = relevel(df$x,ref=2)
everything till now is ok. but when deciding to make the level 1 again as the default level it's not working
df$x = relevel(df$x,ref=2)
str(df)
df$x = relevel(df$x,ref=1)
str(df)
Appreciatethe help.
From ?relevel,
ref: the reference level, typically a string.
I'll key off of "typically". Looking at the code of stats:::relevel.factor, one key part is
if (is.character(ref))
ref <- match(ref, lev)
This means to me that after this expression, ref is now (assumed to be) an integer that corresponds to the index within the levels. In that context, your ref=1 is saying to use the first level by its index (which is already first).
Try using a string.
relevel(df$x,ref=1)
# [1] 1 2 3
# Levels: 2 1 3
relevel(df$x,ref="1")
# [1] 1 2 3
# Levels: 1 2 3

BNlearn R error “variable Variable1 must have at least two levels.”

Trying to create a BN using BNlearn, but I keep getting an error;
Error in check.data(data, allowed.types = discrete.data.types) : variable Variable1 must have at least two levels.
It gives me that error for every of my variable, even though they're all factors and has more than 1 levels, As you can see - in this case my variable "model" has 4 levels
As I can't share the variables and dataset, I've created a small set and belonging code to the data set. I get the same problem. I know I've only shared 2 variables, but I get the same error for all the variables.
library(tidyverse)
library (bnlearn)
library(openxlsx)
DataFull <- read.xlsx("(.....)/test.xlsx", sheet = 1, startRow = 1, colNames = TRUE)
set.seed(600)
DataFull <- as_tibble(DataFull)
DataFull$Variable1 <- as.factor(DataFull$Variable1)
DataFull$TargetVar <- as.factor(DataFull$TargetVar)
DataFull <- na.omit(DataFull)
DataFull <- droplevels(DataFull)
DataFull <- DataFull[sample(nrow(DataFull)),]
Data <- DataFull[1:as.integer(nrow(DataFull)*0.70)-1,]
Datatest <- DataFull[as.integer(nrow(DataFull)*0.70):nrow(DataFull),]
nrow(Data)+nrow(Datatest)==nrow(DataFull)
FocusVar <- as.character("TargetVar")
BN.naive <- naive.bayes(Data, FocusVar)
Using str(data), I can see that the variable has 2 or more levels already:
str(Data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 27586 obs. of 2 variables:
$ Variable1: Factor w/ 3 levels "Small","Medium",..: 2 2 3 3 3 3 3 3 3 3 ...
$ TargetVar: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
Link to data set: https://drive.google.com/open?id=1VX2xkPdeHKdyYqEsD0FSm1BLu1UCtOj9eVIVfA_KJ3g
bnlearn expects a data.frame : doesn't work with tibbles, So keep your data as a data.frame by omitting the line DataFull <- as_tibble(DataFull)
Example
library(tibble)
library (bnlearn)
d <- as_tibble(learning.test)
hc(d)
Error in check.data(x) : variable A must have at least two levels.
In particular, it is the line from bnlearn:::check.data
if (nlevels(x[, col]) < 2)
stop("variable ", col, " must have at least two levels.")
In a standard data.frame,learning.test[,"A"] returns a vector and so nlevels(learning.test[,"A"]) works as expected, however, by design, you cannot extract vectors like this from tibbles : d[,"A"]) is still a tbl_df and not a vector hence nlevels(d[,"A"]) doesn't work as expected, and returns zero.

R check character values for numeric and change var datatype automatically

I have many dataframes where all the data is character. I can guess that a var containing a number should be changed to a numeric data type. I have 100's of columns though so I don't want to type out each each one to change in order to change it.
Is there another way to automate this process and to scan a column of data check if the character has a numeric value and change it into a numeric type from character type?
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c("21000", "23400", "26800")
gender <- c("M", "M", "F")
rank <- c("5", "109", "2")
df <- data.frame(employee, salary, gender, rank)
I don't want to have to do this for each column/var
df$rank <- as.numeric(df$rank)
I would like to do something like this
i <- sapply(df, is.vector.of.columns.contaning.numeric.values)
df[i] <- lapply(df[i], as.numeric)
We can write a function with the number condition. It works by trying as.numeric and checking if it returns NA, if it does, that means the value cannot be coerced to an unambiguous numeric. When this happens, the function will keep the column as is.
smartConvert <- function(x) {
if(any(is.na(as.numeric(as.character(x))))) x else as.numeric(x)
}
df[] <- lapply(df, smartConvert)
str(df)
# 'data.frame': 3 obs. of 4 variables:
# $ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
# $ salary : num 1 2 3
# $ gender : Factor w/ 2 levels "F","M": 2 2 1
# $ rank : num 3 1 2

Convert factor to integer in a data frame

I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.

Advice on how to convert factors to integers by using the as.numeric [duplicate]

I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.

Resources