Trying to create a BN using BNlearn, but I keep getting an error;
Error in check.data(data, allowed.types = discrete.data.types) : variable Variable1 must have at least two levels.
It gives me that error for every of my variable, even though they're all factors and has more than 1 levels, As you can see - in this case my variable "model" has 4 levels
As I can't share the variables and dataset, I've created a small set and belonging code to the data set. I get the same problem. I know I've only shared 2 variables, but I get the same error for all the variables.
library(tidyverse)
library (bnlearn)
library(openxlsx)
DataFull <- read.xlsx("(.....)/test.xlsx", sheet = 1, startRow = 1, colNames = TRUE)
set.seed(600)
DataFull <- as_tibble(DataFull)
DataFull$Variable1 <- as.factor(DataFull$Variable1)
DataFull$TargetVar <- as.factor(DataFull$TargetVar)
DataFull <- na.omit(DataFull)
DataFull <- droplevels(DataFull)
DataFull <- DataFull[sample(nrow(DataFull)),]
Data <- DataFull[1:as.integer(nrow(DataFull)*0.70)-1,]
Datatest <- DataFull[as.integer(nrow(DataFull)*0.70):nrow(DataFull),]
nrow(Data)+nrow(Datatest)==nrow(DataFull)
FocusVar <- as.character("TargetVar")
BN.naive <- naive.bayes(Data, FocusVar)
Using str(data), I can see that the variable has 2 or more levels already:
str(Data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 27586 obs. of 2 variables:
$ Variable1: Factor w/ 3 levels "Small","Medium",..: 2 2 3 3 3 3 3 3 3 3 ...
$ TargetVar: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
Link to data set: https://drive.google.com/open?id=1VX2xkPdeHKdyYqEsD0FSm1BLu1UCtOj9eVIVfA_KJ3g
bnlearn expects a data.frame : doesn't work with tibbles, So keep your data as a data.frame by omitting the line DataFull <- as_tibble(DataFull)
Example
library(tibble)
library (bnlearn)
d <- as_tibble(learning.test)
hc(d)
Error in check.data(x) : variable A must have at least two levels.
In particular, it is the line from bnlearn:::check.data
if (nlevels(x[, col]) < 2)
stop("variable ", col, " must have at least two levels.")
In a standard data.frame,learning.test[,"A"] returns a vector and so nlevels(learning.test[,"A"]) works as expected, however, by design, you cannot extract vectors like this from tibbles : d[,"A"]) is still a tbl_df and not a vector hence nlevels(d[,"A"]) doesn't work as expected, and returns zero.
Related
I am working with a modest survey dataset (190 x 2162). The platform we are using exports to csv, but when imported every column is a factor. I could assign each column a class on import, but a recent change of hats at work has put many more surveys in my future. So, out of respect for my future self's sanity, I am looking to build a small library of functions to convert ranges of columns as needed.
Goals of the initial function:
Convert from factor to numeric
Take column names as input, rather than numbers
The core of the function appears to work fine:
startNum <- match("Q7_1", names(rawNum))
endNum <- match("Q7_19", names(rawNum))
for(i in c(startNum:endNum)){
rawNum[,i] <- as.character(rawNum[,i])
rawNum[,i] <- as.numeric(rawNum[,i])
}
However, when I attempt to wrap this in a function, it falls apart. I believe the issue is in passing the *_col_name arguments into the function, but I can't seem to find where it is going wrong.
facToNum <- function(frame_name, start_col_name, end_col_name){
startNum <- match(start_col_name, names(frame_name))
endNum <- match(end_col_name, names(frame_name))
for(i in c(startNum:endNum)){
frame_name[,i] <- as.character(frame_name[,i])
frame_name[,i] <- as.numeric(frame_name[,i])
}
}
What am I missing here? I'm sure it's something obvious, but I've had to soldier on with my partial solution, and it rankles.
I don't see any obvious mistakes in your function except that you need to return the changed dataframe at the end of the function.
facToNum <- function(frame_name, start_col_name, end_col_name){
startNum <- match(start_col_name, names(frame_name))
endNum <- match(end_col_name, names(frame_name))
for(i in c(startNum:endNum)){
frame_name[,i] <- as.character(frame_name[,i])
frame_name[,i] <- as.numeric(frame_name[,i])
}
return(frame_name)
}
You can also simplify your approach using dplyr library.
library(dplyr)
facToNum <- function(frame_name, start_col_name, end_col_name){
frame_name %>%
mutate(across(start_col_name:end_col_name, ~as.numeric(as.character(.x))))
}
df <- data.frame(a = factor(rnorm(5)), b = factor(runif(5)), c = 1:5)
str(df)
#data.frame': 5 obs. of 3 variables:
# $ a: Factor w/ 5 levels "-1.64324479436782",..: 3 1 5 4 2
# $ b: Factor w/ 5 levels "0.11049344507046",..: 1 4 3 2 5
# $ c: int 1 2 3 4 5
result <- facToNum(df, "a", "b")
str(result)
#'data.frame': 5 obs. of 3 variables:
# $ a: num -0.11733 -1.64324 1.26266 0.00823 -0.9531
# $ b: num 0.11 0.614 0.545 0.228 0.902
# $ c: int 1 2 3 4 5
I have created a function, which works well on dummy data. But, when I run this function on real data, I've got back an error
Error in wilcox.test.formula(tab[[dependent]] ~ as.factor(tab$group), :
grouping factor must have exactly 2 levels
and warning messages:
In wilcox.test.default(x = c(11.2558701380866, 31.8401548036613, : cannot compute exact p-value with ties
So, "thresholding" in my function seems not correctly split real data in two groups. Also, the sub-setting of the real data is not correct. But I don't understand why?? The dummy and real tables structure seem the same:
Structure of dummy and real data:
Dummy:
> str(tab)
'data.frame': 80 obs. of 3 variables:
$ infGrad : num 14.15 12.53 3.03 9.21 16.36 ...
$ distance : int 1 1 1 1 1 1 1 1 1 1 ...
$ uniqueGroup: Factor w/ 2 levels "x","y": 1 2 1 2 1 2 1 2 1 2 ...
Real:
> str(tab)
'data.frame': 142 obs. of 10 variables:
$ distance : num 100 100 100 100 100 100 100 100 100 100 ...
$ infGrad : num 11.3 17.4 31.8 11.1 47.8 ...
$ uniqueGroup: Factor w/ 6 levels "x",..: 5 2 5 2 5 5 5 5 3 6 ...
I have found that NAs might cause these problems, or specification of formula of the wilcox.test(y ~ x).
So, I tried to add na.omit to my function, and instead of wilcox.test(y~x) use wilcox.test(y, x). None of these have worked.
Do you have any ideas how to make my function work or how to make it more robust to accept my real data? Your help is highly appreciated.
What the code does:
classify data in two groups by "moving threshold"
test statistical differences between those groups.
I run the function with nested lapply to vary my thresholds and different data subsets.
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshols function
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
# apply function over threshols and subset
res<-lapply(gr.list, function(x) lapply(th.list,
movThreshold,
tab = tab[uniqueGroup == x,], # does not work on REAL data
dependent = "infGrad"))
What seems not working on real data:
Groups classification within the function
tab$group<- ifelse(tab$distance < th, "a", "b")
Data subsetting in nested lapply loop
subsetting: tab = tab[uniqueGroup == x,]
The issue probably happens because of a single value group.
You can reproduce the error for instance adding a high value to th.list.
# unique threshold
th.list<-c(2,3,4,100)
The easiest way to avoid this is checking for the length of tab$group before performing the test.
This change in the function should suffice:
movThreshold <- function(th, tab, dependent, ...) {
tab<-na.omit(tab)
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b") # does not WORK on REAL data
# Check there are two groups
if(length(unique(tab$group))<2){return(NA)}
# Calculate wincoxon test
test<-wilcox.test(tab[[dependent]] ~ as.factor(tab$group), # specify column name
data = tab)
# Put results in a vector
c(th, dependent, round(test$p.value, 3))
}
I am working with a data by using pamr and tring to do a prediction analysis of microarrays. I tried an examples in this package and it worked well as follows.
*x <- matrix(rnorm(1000*20),ncol=20)
y <- sample(c(1:4),size=20,replace=TRUE)
mydata <- list(x=x,y=y)
mytrain <- pamr.train(mydata)
123456789101112131415161718192021222324252627282930
mycv <- pamr.cv(mytrain,mydata)
1234Fold 1 :123456789101112131415161718192021222324252627282930
Fold 2 :123456789101112131415161718192021222324252627282930
Fold 3 :123456789101112131415161718192021222324252627282930
pamr.predict(mytrain, mydata$x , threshold=1)
[1] 1 3 1 2 1 3 2 2 4 3 2 1 4 2 3 1 2 1 2 4
Levels: 1 2 3 4*
However,when I run those codes to handle my data, I receive the following error:
"Error in 1:ncol(data$x) : argument of length 0"
*"z=read.table("shishi.txt",sep="\t",header=T)
mytrain <- pamr.train(Z)
Error in 1:ncol(data$x) : argument of length 0"*
My data was performed in the format of the example in the package as follows:
Did the error mean that there is no arguments in column? How to deal with the error? Thanks.
From "pamr" manual:
The input data. A list with components: x- an expression genes in the
rows, samples in the columns), and y- a vector of the class labels for
each sample. Optional components- genenames, a vector of gene names,
and geneid- a vector of gene identifiers.
In your example you've created a list with this characteristics in mydata <- list(x=x,y=y), but not in your actual data use.
After reading the table into R, with z=read.table("shishi.txt",sep="\t",header=T), you should create a list with mydata <- list(x=z,y=samplegroups), where samplegroups is your sample group vector.
I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.
I have the following code
anna.table<-data.frame (anna1,anna2)
write.table<-(anna.table, file="anna.file.txt",sep='\t', quote=FALSE)
my table in the end contains numbers such as the following
chr start end score
chr2 41237927 41238801 151
chr1 36976262 36977889 226
chr8 83023623 83025129 185
and so on......
after that i am trying to to get only the values which fit some criteria such as score less than a specific value
so i am doing the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
significant.anna<-subset(anna.total,score <=0.001)
Error: In Ops.factor(score, 0.001) <= not meaningful for factors
so i guess the problem is that my table has factors and not integers
I guess that my anna.total$score is a factor and i must make it an integer
If i read correctly the as.numeric might solve my problem
i am reading about the as.numeric function but i cannot understand how i can use it
Hence could you please give me some advices?
thank you in advance
best regards
Anna
PS : i tried the following
anna3<-"data/anna/anna.file.txt"
anna.total<-read.table(anna3,header=TRUE)
anna.total$score.new<-as.numeric (as.character(anna.total$score))
write.table(anna.total,file="peak.list.numeric.v3.txt",append = FALSE ,quote = FALSE,col.names =TRUE,row.names=FALSE, sep="\t")
anna.peaks<-subset(anna.total,fdr.new <=0.001)
Warning messages:
1: In Ops.factor(score, 0.001) : <= not meaningful for factors
again i have the same problem......
With anna.table (it is a data frame by the way, a table is something else!), the easiest way will be to just do:
anna.table2 <- data.matrix(anna.table)
as data.matrix() will convert factors to their underlying numeric (integer) levels. This will work for a data frame that contains only numeric, integer, factor or other variables that can be coerced to numeric, but any character strings (character) will cause the matrix to become a character matrix.
If you want anna.table2 to be a data frame, not as matrix, then you can subsequently do:
anna.table2 <- data.frame(anna.table2)
Other options are to coerce all factor variables to their integer levels. Here is an example of that:
## dummy data
set.seed(1)
dat <- data.frame(a = factor(sample(letters[1:3], 10, replace = TRUE)),
b = runif(10))
## sapply over `dat`, converting factor to numeric
dat2 <- sapply(dat, function(x) if(is.factor(x)) {
as.numeric(x)
} else {
x
})
dat2 <- data.frame(dat2) ## convert to a data frame
Which gives:
> str(dat)
'data.frame': 10 obs. of 2 variables:
$ a: Factor w/ 3 levels "a","b","c": 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
> str(dat2)
'data.frame': 10 obs. of 2 variables:
$ a: num 1 2 2 3 1 3 3 2 2 1
$ b: num 0.206 0.177 0.687 0.384 0.77 ...
However, do note that the above will work only if you want the underlying numeric representation. If your factor has essentially numeric levels, then we need to be a bit cleverer in how we convert the factor to a numeric whilst preserving the "numeric" information coded in the levels. Here is an example:
## dummy data
set.seed(1)
dat3 <- data.frame(a = factor(sample(1:3, 10, replace = TRUE), levels = 3:1),
b = runif(10))
## sapply over `dat3`, converting factor to numeric
dat4 <- sapply(dat3, function(x) if(is.factor(x)) {
as.numeric(as.character(x))
} else {
x
})
dat4 <- data.frame(dat4) ## convert to a data frame
Note how we need to do as.character(x) first before we do as.numeric(). The extra call encodes the level information before we convert that to numeric. To see why this matters, note what dat3$a is
> dat3$a
[1] 1 2 2 3 1 3 3 2 2 1
Levels: 3 2 1
If we just convert that to numeric, we get the wrong data as R converts the underlying level codes
> as.numeric(dat3$a)
[1] 3 2 2 1 3 1 1 2 2 3
If we coerce the factor to a character vector first, then to a numeric one, we preserve the original information not R's internal representation
> as.numeric(as.character(dat3$a))
[1] 1 2 2 3 1 3 3 2 2 1
If your data are like this second example, then you can't use the simple data.matrix() trick as that is the same as applying as.numeric() directly to the factor and as this second example shows, that doesn't preserve the original information.
I know this is an older question, but I just had the same problem and may be it helps:
In this case, your score column seems like it should not have become a factor column. That usually happens after read.table when it is a text column. Depending on which country you are from, may be you separate floats with a "," and not with a ".". Then R thinks that is a character column and makes it a factor. AND in that case Gavins answer won't work, because R won't make "123,456" to 123.456 . You can easily fix that in a text editor with replace "," with "." though.