Procedure of the OneR algorithm in R - r

I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no.
My characteristics of the attributs are also yes and no.
The result of the OneR algorithm is:
Ranking-No. 1
Atribut-Name: OR1:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1459-------------------18
Error-Rate: 1459 (0 + 1459)
Ranking-No. 2
Atribut-Name: OR2:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1460-------------------17
Error-Rate: 1460 (0 + 1460)
However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.
Atribut-Name: CO4:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25204-------------------39
1(Class: -------------------1348-------------------129
Error-Rate: 1387 (39 + 1348)
Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?
Which criterias does the OneR algorithm use?
--- Addition to better understand my question ---
The complete data are too big to show it.
I have constructed a new datapool, which has the same effect
DELAYED - OR1 - CO4 ..
1 ---------1--------1--
0 ---------0--------0--
0 ---------0--------1--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
The code for show the error rate for a single attribute:
print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))
The code the OneR function:
library(FSelector)
oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)
subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)
print(subset_stackoverflow)
The code for the correlation:
cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))
In this case the results are:
Error-Rate: OR1
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):---------------------4-------------------------0
1(Class: ---------------------3-------------------------1
Manuel calculated Error-Rate: 3(0 + 3)
Error-Rate: CO4
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-----------------------3-----------------------1
1(Class: -----------------------0-----------------------4
Error-Rate: 1(1 + 0)
Correlation:
Attribut OR1: 0.377
Attribut CO4: 0.77
OneR: "OR1", "CO4"
Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?

You haven't given the types of your data, but I'm assuming that you have numerical values. FSelector discretizes these values before using them in oneR and it seems that bad things happen there (which may be a bug in RWeka's Discretize function). However, you probably want factor variables anyway and not numeric data as you have only 0-1 values. Then everything works fine for me:
> df = data.frame(delayed=factor(c(1,0,0,1,0,1,0,1)), or1 = factor(c(1,0,0,0,0,0,0,0)), co4 = factor(c(1,0,1,1,0,1,0,1)))
> library(FSelector)
> oneR(delayed~., df)
attr_importance
or1 0.2000000
co4 0.4285714
As you can see, co4 now has a much higher importance than or1, as it should have.

Ok, i have the solution. The algorithm calculates the sum of the error rate of the characteristcs in a attribut (in relation to the max value of a characteristc)
In this example:
Attribut OR1: 3/7 + 0/1 = 3/7
Attribut CO4: 0/3 + 1/5 = 0.2

No, the CO4 should be chosen, choosing the other attribute is wrong - see what the OneR package (available on CRAN) gives:
> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
>
> data <- data.frame(DELAYED, OR1, CO4)
>
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)
Attribute Accuracy
1 * CO4 87.5%
2 OR1 62.5%
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'
> summary(model)
Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1
Accuracy:
7 of 8 instances classified correctly (87.5%)
Contingency table:
CO4
DELAYED 0 1 Sum
0 * 3 1 4
1 0 * 4 4
Sum 3 5 8
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441
>
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)
Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1
Accuracy:
5 of 8 instances classified correctly (62.5%)
Contingency table:
OR1
DELAYED 0 1 Sum
0 * 4 0 4
1 3 * 1 4
Sum 7 1 8
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1
You can find more information about the OneR package here: https://github.com/vonjd/OneR
(full disclosure: I am the author of this package)

Related

why am I getting so strange t statistics and p values in the nls() output?

I'm running a nonlinear multiple regression with the nls() function with one dependent variable (Gross Primary Production (GPP)) and three independent variables (solar irradiance (RAD), Green Fractional Cover (GFC) and Volumetric Water Content (VWC)).
I'm trying to follow the model of Magnani et al. (2022) which is:
GPP = (Fα0RAD/ F+ α0RAD) * (A0+A1GFC+A2VWC)+ε, where F, α0, A0, A1, A2 are the parameters to estimate.
This is the code I used:
nls.3<- nls(GPP~(F*α0*RAD/(F+α0*RAD))*(A0+(A1*GFC)+(A2*VWC)), data = SCALED,start=list(F=-2.16, α0=-0.031, A0=0.021, A1=7.31, A2=0.0024),control=nls.control( minFactor=2^-148, warnOnly=TRUE,maxiter=10000))
In this attempt I took as starting values the estimated parameters of the cited model (my data are from the same site, but the year is different).
This is the output I got:
Formula: GPP ~ (F * α0 * RAD/(F + α0 * RAD)) *
(A0 + (A1 * GFC) + (A2 * VWC))
Parameters:
Estimate Std. Error t value Pr(>|t|)
F -4.063e+00 2.488e+08 0 1
α0 -5.831e-02 3.571e+06 0 1
A0 2.508e-03 1.536e+05 0 1
A1 8.720e-01 5.341e+07 0 1
A2 2.864e-04 1.754e+04 0 1
Residual standard error: 1.003 on 278 degrees of freedom
Number of iterations till stop: 10000
Achieved convergence tolerance: 0.5849
Reason stopped: il numero di iterazioni ha superato il massimo di 10000
------
Residual sum of squares: 279
------
t-based confidence interval:
------
Correlation matrix:
a b L d e
a 1 1 1 1 1
b 1 1 1 1 1
L 1 1 1 1 1
d 1 1 1 1 1
e 1 1 1 1 1
I never saw a similar output with all the t statistics = 0 and all the p values = 1.
Can someone tell me what I'm doing wrong?
[or there is another way to run this model?]
Below is a sample of the head of my data (all the variables are standardized):
RAD GFC VWC GPP
1 -0.2491831 -1.0107985 1.4436443 0.3294411
2 -0.2171896 -0.8891009 -1.2268249 0.8456750
3 -0.1498026 0.9968661 -0.8714393 -0.4678534
4 0.2738084 -1.0062102 -1.6228261 0.3982723
5 -0.5789165 -0.6060990 -0.9932858 0.6449174
6 0.1203928 -0.6509521 -0.4957459 0.1057398
This is not a software problem. It is a fundamental problem with the model as it is not identifiable. If (F, a0, A1, A2, A3) is a solution then replacing it with (F/s, a0/s, sA1, sA2, s*A3) gives the same fitted values for any non-zero scalar s. The answer obtained with the other software is somewhat misleading as there are an infinite number of solutions.
For example, make this transformation: a0=b0*F and Ai=Bi/(b0*F) for i=0,1,2. Then the right hand side becomes the following in 4, rather than 5, parameters.
(RAD/(1 + b0 * RAD)) * (B0 + (B1 * GFC) + (B2 * VWC))
Using the plinear algorithm of nls we only need to provide starting values for the parameters that enter non-linearly so only b0 needs a starting value. To use it provide a matrix on the right hand side with one column per linear parameter as shown. You may need a different starting value.
nls(GPP ~ RAD/(1 + b0 * RAD) * cbind(B0 = 1, B1 = GFC, B2 = VWC),
data = SCALED, start = list(b0 = 1), algorithm = "plinear")

Ranger Predicted Class Probability of each row in a data frame

With regard to this link Predicted probabilities in R ranger package, I have a question.
Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)
I want to calculate and attach class probabilities to the data frame using ranger using the following commands:
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)
The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.
Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?
My second question, when I attempt to calcuate confusion matrix through
pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)
I get the following error:
Error in table(Test_Set$BiClass, pred$predictions) :
all arguments must have the same length
What am I doing wrong?
For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:
library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)
We can always check whether they agree:
par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")
Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:
max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:
caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics
0 1
0 31 2
1 0 17
Accuracy : 0.96
95% CI : (0.8629, 0.9951)
No Information Rate : 0.62
P-Value [Acc > NIR] : 2.048e-08
In your case, you can just do:
confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))

Custom function to compute contrasts in emmeans

I want to create a custom contrast function in emmeans which could remove a given list of levels from the input vector and apply the built-in contrast method ("trt.vs.ctrl") on the remaining levels. An example dataset is available here. I am using the following R code for computing ANOVA and post hoc comparisons:
options(contrasts=c("contr.sum", "contr.poly"))
my_lm <- lm(D1 ~ C*R, data=df)
Anova(my_lm, type = "III")
#show Interaction effects using emmeans
emmip(my_lm, C ~ R )
emm = emmeans(my_lm, ~ C * R)
emm
contrast(emmeans(my_lm, ~ C * R), "consec", by = "C")
#compare 1st with next 3 groups (how to remove other three levels?)
contrast(emmeans(my_lm, ~ C * R), "trt.vs.ctrl", by = "R")
The built-in contrast option ("trt.vs.ctrl") compares the first level with everything that follows it (there are 7 factor levels in C, and I want to remove last 3 of them and compute the contrasts for the remaining 4). An example is provided in the official documentation to write a custom contrast function.
skip_comp.emmc <- function(levels, skip = 1, reverse = FALSE) {
if((k <- length(levels)) < skip + 1)
stop("Need at least ", skip + 1, " levels")
coef <- data.frame()
coef <- as.data.frame(lapply(seq_len(k - skip - 1), function(i) {
sgn <- ifelse(reverse, -1, 1)
sgn * c(rep(0, i - 1), 1, rep(0, skip), -1, rep(0, k - i - skip - 1))
}))
names(coef) <- sapply(coef, function(x)
paste(which(x == 1), "-", which(x == -1)))
attr(coef, "adjust") = "fdr" # default adjustment method
coef
}
However due to my limited understanding I am not very sure where to apply the modifications that I need to to customise the example. Any ideas?
Is this something you are going to want to do lots of times in the future? My guess is not, that you only want to do this once, or a few times at most; in which case it is way too much trouble to write a custom contrast function. Just get the contrast coefficients you need, and use that as the second argument in contrast.
Now, consider these results:
> con <- emmeans:::trt.vs.ctrl.emmc(1:7)
> con
2 - 1 3 - 1 4 - 1 5 - 1 6 - 1 7 - 1
1 -1 -1 -1 -1 -1 -1
2 1 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 1 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 0 1
From the description, I think you just want the first 3 sets of contrast coefficients. So use those columns:
contrast(emm, con[, 1:3], by = "R")
Update
StackOverflow can occasionally inspire developers to add software features. In this case, I decided it could be useful to add an exclude argument to most built-in .emmc functions in emmeans (all except poly.emmc()). This was fairly straightforward to do, and those features are now incorporated in the latest push to github -- https://github.com/rvlenth/emmeans. These features will be included in the next CRAN update as well.

How to use genetic algorithm for prediction correctly

I'm trying to use genetic algorithm for classification problem. However, I didn't succeed to get a summary for the model nor a prediction for a new data frame. How can I get the summary and the prediction for the new dataset?
Here is my toy example:
library(genalg)
dat <- read.table(text = " cats birds wolfs snakes
0 3 9 7
1 3 8 7
1 1 2 3
0 1 2 3
0 1 2 3
1 6 1 1
0 6 1 1
1 6 1 1 ", header = TRUE)
evalFunc <- function(x) {
if (dat$cats < 1)
return(0) else return(1)
}
iter = 100
GAmodel <- rbga.bin(size = 7, popSize = 200, iters = iter, mutationChance = 0.01,
elitism = T, evalFunc = evalFunc)
###########summary try#############
cat(summary.rbga(GAmodel))
# Error in cat(summary.rbga(GAmodel)) :
# could not find function "summary.rbga"
############# prediction try###########
dat$pred<-predict(GAmodel,newdata=dat)
# Error in UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "rbga"
Update:
After reading the answer given and reading this link:
Pattern prediction using Genetic Algorithm
I wonder how can I programmatically use the GA as part of a prediction mechanism? According to the link's text, one can use the GA for optimizing regression or NN and then use the predict function provided by them/
Genetic Algorithms are for optimization, not for classification. Therefore, there is no prediction method. Your summary statement was close to working.
cat(summary(GAmodel))
GA Settings
Type = binary chromosome
Population size = 200
Number of Generations = 100
Elitism = TRUE
Mutation Chance = 0.01
Search Domain
Var 1 = [,]
Var 0 = [,]
GA Results
Best Solution : 1 1 0 0 0 0 1
Some additional information is available from Imperial College London
Update in response to updated question:
I see from the paper that you mentioned how this makes sense. The idea is to use the genetic algorithm to optimize the weights for a neural network, then use the neural network for classification. This would be a big task, too big to respond here.

lpsolveAPI in RStudio

I am using the lpsolveAPI in RStudio. When I type the name of a model with few decision variables, I can read a printout of the current constraints in the model. For example
> lprec
Model name:
COLONE COLTWO COLTHREE COLFOUR
Minimize 1 3 6.24 0.1
THISROW 0 78.26 0 2.9 >= 92.3
THATROW 0.24 0 11.31 0 <= 14.8
LASTROW 12.68 0 0.08 0.9 >= 4
Type Real Real Real Real
Upper Inf Inf Inf 48.98
Lower 28.6 0 0 18
But when I make a model that has more than 9 decision variables, it no longer gives the full summary and I instead see:
> lprec
Model name:
a linear program with 13 decision variables and 258 constraints
Does anyone know how I can see the same detailed summary of the model when there are large numbers of decision variables?
Bonus Question: Is RStudio the best console for working with R?
Here is an example:
>lprec <- make.lp(0,5)
This makes a new model called lprec, with 0 constraints and 5 variables. Even if you call the name now you get:
>lprec
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
The C columns correspond to the 5 variables. Right now there are no constraints and the objective function is 0.
You can add a constraint with
>add.constraint(lprec, c(1,3,4,2,-8), "<=", 0)
This is the constraint C1 + 3*C2 + 4*C3 + 2*C4 - 8*C5 <= 0. Now the print out is:
Model name:
C1 C2 C3 C4 C5
Minimize 0 0 0 0 0
R1 1 3 4 2 -8 <= 0
Kind Std Std Std Std Std
Type Real Real Real Real Real
Upper Inf Inf Inf Inf Inf
Lower 0 0 0 0 0
Anyway the point is that no matter how many constraints, if there are more than 9 variables then I don't get the full print out.
>lprec <- make.lp(0,15)
>lprec
Model name:
a linear program with 15 decision variables and 0 constraints
Write it out to a file for examination
When I work with LPs using lpSolveAPI, I prefer to write them out to a file. The lp format works fine for my needs. I then examine the LP model using any text editor. If you click on the output file in the "Files" panel in RStudio, it will open it too, and you can inspect it.
write.lp(lprec, "lpfilename.lp", "lp") #write it to a file in LP format
You can also write it out as MPS format if you so choose.
Here's the help file on write.lp().
Hope that helps.
Since it is an S3 object of class lpExtPtr,
the function called to display it is print.lpExtPtr.
If you check its code, you will see that it displays the object
differently depending on its size --
details for very big objects would not be very useful.
Unfortunately, the threshold cannot be changed.
class(r)
# [1] "lpExtPtr"
print.lpExtPtr
# function (x, ...)
# {
# (...)
# if (n > 8) {
# cat(paste("Model name: ", name.lp(x), "\n", " a linear program with ",
# n, " decision variables and ", m, " constraints\n",
# sep = ""))
# return(invisible(x))
# }
# (...)
You can access the contents of the object with the various get.* functions,
as the print method does.
Alternatively, you can just change the print method.
# A function to modify functions
patch <- function( f, before, after ) {
f_text <- capture.output(dput(f))
g_text <- gsub( before, after, f_text )
g <- eval( parse( text = g_text ) )
environment(g) <- environment(f)
g
}
# Sample data
library(lpSolveAPI)
r <- make.lp(0,5)
r # Shows the details
r <- make.lp(0,20)
r # Does not show the details
# Set the threshold to 800 variables instead of 8
print.lpExtPtr <- patch( print.lpExtPtr, "8", "800" )
r # Shows the details

Resources