So I generated a random dataset online and I need to apply the C4.5 algorithm on it.
I installed the RWeka package and all its dependencies but I do not know how to execute it.
Can somebody help me with links to tutorials? Anything apart from the RWeka documentation. Or a sample C4.5 code in R to understand its working?
Thank you
I think it would be worth your time to check out the caret package. It standardizes the syntax for most machine learning packages in R, including RWeka.
It also has a ton of really useful helper functions and a great tutorial on their website
Here's the syntax for predicting Species on the iris dataset using the RWeka package with C4.5-like trees:
library(caret)
train_rows <- createDataPartition(iris$Species, list=FALSE)
train_set <- iris[train_rows, ]
test_set <- iris[-train_rows, ]
fit.rweka <- train(Species ~ ., data=train_set, method='J48')
pred <- predict(fit.rweka, newdata=test_set)
then, if you want to try a gradient boosting machine or some other algorithm, just change to method='gbm'
Related
I'm trying to run a boosted robust regression on Caret (with the Huber family), however I get an error when training the model:
library(caret)
X <- rnorm(300, 0, 100)
Y <- rnorm(300, 0, 100000)
data <- cbind(X,Y)
model <- train(Y~X, method="glmboost", data=data, family=Huber())
I get the error 'could not find function Huber()', however this is explicitly included in the mboost package (the one on which glmboost is based).
Any help would be really appreciated.
If you Just run library(caret) with method="glmboost" it will load the mboost package, but it will not attach the mboost package to your search path. Packages are discouraged from automatically attaching other packages since they may import functions that could conflict with other functions you have loaded. Thus most packages load dependencies privately. If you fully qualify the function name with the package name, then you can use it in your model
model <- train(Y~X, method="glmboost", data=data, family=mboost::Huber())
Or you could just also run library(mboost) to attach the package to your search path so you don't have to include the package name prefix.
I want to build a Bagged Logistic Regression Model in R. My dataset is really biased and has 0.007% of positive occurrences.
My thoughts to solve this was to use Bagged Logistic Regression. I came across the hybridEnsemble package in R. Does anyone have an example of how this package can be used? I searched online, but unfortunately did not find any examples.
Any help will be appreciated.
The way that I would try to solve this is use the h2o.stackedEnsemble() function in the h2o R package. You can automatically create more balanced classifiers by using the balance_classes = TRUE option in all of the base learners. More information about how to use this function to create ensembles is located in the Stacked Ensemble H2O docs.
Also, using H2O will be way faster than anything that's written in native R.
I am working with the R package 'zoib' for performing beta regression in R. I am trying to replicate the example included on page 41 in the paper the package authors published in The R Journal:
Lui F and Kong Y. 2015. zoib: An R Package for Bayesian Inference for Beta Regression and Zero/One Inflated Beta Regression. The R Journal 7(2)
I believe I am using the exact same data and code that they use:
library(zoib)
data("GasolineYield", package="zoib")
GasolineYield$batch <- as.factor(GasolineYield$batch)
d <- GasolineYield
eg1.fixed <- zoib(yield ~ temp + as.factor(batch) | 1, data=GasolineYield, joint=FALSE,
random=0, EUID=1:nrow(d), zero.inflation=F, one.inflation=F,
n.iter=1050, n.thin=5, n.burn=50)
sample1 <- eg1$coeff
traceplot(sample1)
autocorr.plot(sample1)
gelman.diag(sample1)
However, I am getting an error when I try to do the diagnostic plots on the samples. This is the error message:
Error in ts(seq(from = start(x), to = end(x), by = thin(x)), start = start(x), :
invalid time series parameters specified
I cannot understand why the code isn't working or what I can do to fix the problem. I can trace the error to the time function which is called by zoib, and it seems like maybe it is a problem that the sample object does not have a tsp attribute, but the zoib package authors make it clear that their model output is meant to be used with coda, so I am very confused. I don't have much experience working with MCMC or time series objects, so maybe I am just missing something obvious. Can anyone explain why the example provided by the package authors is failing, and what the solution is?
I e-mailed the package author (Fang Liu) and she informed me that there was in fact a bug in the version of the package I have, but that the bug is fixed in the most recent version of zoib (Version 1.4.2). Using the most recent version, the code now works.
I am trying to model a train data with caret package's classifiers, but it does not respond for a very long time (I have waited for 2 hours). On the other hand, it works for other datasets.
Here is the link to my train data: http://www.htmldersleri.org/train.csv (It is well-known Reuters-21570 data set)
And the command I am using is:
model<-train(class~.,data=train,method="knn")
Note: for any other method (eg: svm, naive bayes, etc.), is stucks anyway.
Note 2: For package e1071, naiveBayes classifier works, but with 0,08% accuacy!
Can anyone tell me what can be the problem? Thanks in advance.
This seems to be multiclass classification problem. I'm not sure if caret supports that. However, I can show you how you would do the same thing with the mlr package
library(mlr)
x <- read.csv("http://www.htmldersleri.org/train.csv")
tsk <- makeClassifTask(data = x, target = 'class')
#Assess the performane with 10-fold cross-validation
crossval('classif.knn', tsk)
If you want to know which learners are integrated in mlr that support this kind of task, type
listLearners(tsk)
I've been using randomForest.getTree to give me a text representation of each tree in my forest but have recently switched to the caret package to train the forests (method='rf'). How can I either get an object that randomForest.getTree understands (since caret is allegedly using the same underlying code) or print out the trees in some other analogous way?
just figured it out:
library(caret)
.... #load data
model <- train(x,y,method='rf')
getTree(model$finalModel)