Error when using cv.tree - r

Hi I tried using the function cv.tree from the package tree. I have a binary categorical response (called Label) and 30 predictors. I fit a tree object using all predictors.
I got the following error message that I don't understand:
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The data is the file 'training' taken from this site.
This is what I did:
x <- read.csv("training.csv")
attach(x)
library(tree)
Tree <- tree(Label~., x, subset=sample(1:nrow(x), nrow(x)/2))
CV <- cv.tree(Tree,FUN=prune.misclass)

The error occurs once cv.tree calls model.frame. The 'call' element of the tree object must contain a reference to a data frame whose name is also not the name of a loaded function.
Thus, not only will subsetting in the call to tree generate the error when cv.tree later uses the 'call' element of the tree object, using a dataframe with a name like "df" would give an error as well because model.frame will take this to be name of an existing function (i.e. the 'density of F distribution' from the stats package).

I think the problem is in the dependent variable list. The following works, but I think you need to read the problem description more carefully. First, setup the formula without weight.
x <- read.csv("training.csv")
vars<-setdiff(names(x),c("EventId","Label","Weight"))
fmla <- paste("Label", "~", vars[1], "+",
paste(vars[-c(1)], collapse=" + "))
Here's what you've been running
Tree <- tree(fmla, x, subset=sample(1:nrow(x), nrow(x)/2))
plot(Tree)
$size
[1] 6 5 4 3 1
$dev
[1] 25859 25859 27510 30075 42725
$k
[1] -Inf 0.0 1929.0 2791.0 6188.5
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree.sequence"
You may want to consider package rpart also
urows = sample(1:nrow(x), nrow(x)/2)
x_sub <- x[urows,]
Tree <- tree(fmla, x_sub)
plot(Tree)
CV <- cv.tree(Tree,FUN=prune.misclass)
CV
library(rpart)
tr <- rpart(fmla, data=x_sub, method="class")
printcp(tr)
Classification tree:
rpart(formula = fmla, data = x_sub, method = "class")
Variables actually used in tree construction:
[1] DER_mass_MMC DER_mass_transverse_met_lep
[3] DER_mass_vis
Root node error: 42616/125000 = 0.34093
n= 125000
CP nsplit rel error xerror xstd
1 0.153733 0 1.00000 1.00000 0.0039326
2 0.059274 2 0.69253 0.69479 0.0035273
3 0.020016 3 0.63326 0.63582 0.0034184
4 0.010000 5 0.59323 0.59651 0.0033393
If you include weight, then that is the only split.
vars<-setdiff(names(x),c("EventId","Label"))

Related

R keras model eval/predict error: Error in do.call(object$evaluate, args) : 'what' must be a function or character string

I have a keras model (using R) and TF as backend as follows
history <- model %>% fit( train_X1,train_y,batch_size=100,
epochs=80,validation_split = 0.2,
shuffle=TRUE)
> class(model)
[1] "keras.models.Sequential" "keras.engine.training.Model"
[3] "keras.engine.topology.Container" "keras.engine.topology.Layer"
[5] "python.builtin.object"
The dimensions are as follows:
> dim(train_X1)
[1] 4893 64 64 1
> dim(train_y)
[1] 4893 12
I am trying to evaluate and predict first on train set and then on test set.
model %>% evaluate(train_X1,train_y)
pred <- model %>% predict_classes(train_X1)
however I am getting the following error on running the evaluate command:
Error in do.call(object$evaluate, args) : 'what' must be a function or character string
Any help will be appreciated. Thanks

R extract terminal node info from partykit decision tree

I have created a constparty decision tree (customized split rules) and print out the tree result. The result looks like this:
Fitted party:
[1] root
| [2] value.a < 1651: 0.067 (n = 1419, err = 88.6)
| [3] value.a >= 1651: 0.571 (n = 7, err = 1.7)
I am trying to extract terminal node info
(the yval: 0.067 and 0.571;
the n on each node: 1419 and 7;
and err: 88.6 and 1.7) and put them into a list while having the corresponding node id (node ID 2 and 3) so that I can utilize those info later.
I have been looking into partykit functions for a while and could not find a function that could help me extracting those info I just listed.
Could someone help me please? Thank you!
As usual there are several approaches to obtain the information you are looking for. The technical way for extracting the info stored in a particular node is to use nodeapply(object, ids, info_node) where info_node returns a list of information stored in the respective node.
However, in the terminal nodes of constparty objects there is nothing stored. Instead, the whole distribution of the response by fitted node is stored and can be extracted by fitted(object). This contains a data frame with the observed (response) the (fitted) node and the observation (weights) (if any). And then you can easily use tapply() or aggregate() or something like that to compute node-wise means etc.
Alternatively, you can convert the constparty object to a simpleparty object which stores the printed information in the nodes and extract it.
A worked example for both strategies is a simple regression tree for the cars data:
library("partykit")
data("cars", package = "datasets")
ct <- ctree(dist ~ speed, data = cars)
Then you can easily compute node-wise means by
with(fitted(ct), tapply(`(response)`, `(fitted)`, mean))
## 3 4 5
## 18.20000 39.75000 65.26316
Of course, you can replace mean by any other summary statistic you are interested in.
The nodeapply() for the simpleparty can be obtained by:
nodeapply(as.simpleparty(ct), ids = nodeids(ct, terminal = TRUE), info_node)
## $`3`
## $`3`$prediction
## [1] 18.2
##
## $`3`$n
## n
## 15
##
## $`3`$error
## [1] 1176.4
##
## $`3`$distribution
## NULL
##
## $`3`$p.value
## NULL
##
##
## $`4`
## $`4`$prediction
## [1] 39.75
## ...

User Based Recommendation in R

I am trying to do user based recommendation in R by using recommenderlab package but all the time I am getting 0(no) prediction out of the model.
my code is :
library("recommenderlab")
# Loading to pre-computed affinity data
movie_data<-read.csv("D:/course/Colaborative filtering/data/UUCF Assignment Spreadsheet_user_row.csv")
movie_data[is.na(movie_data)] <- 0
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
# Convert it as a matrix
R<-as.matrix(movie_data)
# Convert R into realRatingMatrix data structure
# realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
recom
as(recom, "list")
all the time I am getting out put like :
as(recom, "list")
$`1648`
character(0)
I am using user-row data from this link:
https://drive.google.com/file/d/0BxANCLmMqAyIQ0ZWSy1KNUI4RWc/view
In that data column A contains user id and apart from that all are movie rating for each movie name.
Thanks.
The line of code movie_data[is.na(movie_data)] <- 0 is the source of the error. For realRatingMatrix (unlike the binaryRatingMatrix) the movies that are not rated by the users are expected to be NA values, not zero values. For example, the following code gives the correct predictions:
library("recommenderlab")
movie_data<-read.csv("UUCF Assignment Spreadsheet_user_row.csv")
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
R<-as.matrix(movie_data)
r <- as(R, "realRatingMatrix")
rec=Recommender(r,method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
as(recom, "list")
# [[1]]
# [1] "X13..Forrest.Gump..1994." "X550..Fight.Club..1999."
# [3] "X77..Memento..2000." "X122..The.Lord.of.the.Rings..The.Return.of.the.King..2003."
# [5] "X1572..Die.Hard..With.a.Vengeance..1995."

ctree() - How to get the list of splitting conditions for each terminal node when the response variable is Categorical variable [duplicate]

I am trying to extract the tree information from the output of ctree. I tried the Class "BinaryTree" info but with no success. Any input is appreciated.
Thank You
The ctree objects are S4 objects at least at the top, and the tree information is in the "tree" slot. The "tree slot can be access ed with the # operator. If you take the first example in the help(ctree) page you can get a graphical display with:
plot(airct)
And then you can look are branches of the tree by traversing with list operations. The "leaves" of the tree are descendents of nodes with "terminal"==TRUE:
> airct#tree$right$terminal
[1] FALSE
> airct#tree$left$terminal
[1] FALSE
> airct#tree$right$right$terminal
[1] TRUE
> airct#tree$right$left$terminal
[1] TRUE
> airct#tree$left$left$terminal
[1] TRUE
> airct#tree$left$right$terminal
[1] FALSE
Information at nodes above the leaves can also be recovered:
> airct#tree$left$right
4) Temp <= 77; criterion = 0.997, statistic = 11.599
5)* weights = 48
4) Temp > 77
6)* weights = 21
This is the same information that the nodes function will recover if you know the number of the node:
> nodes(airct,4)
[[1]]
4) Temp <= 77; criterion = 0.997, statistic = 11.599
5)* weights = 48
4) Temp > 77
6)* weights = 21
The mlmeta R package converts fitted ctree models to SAS code. It can be easily adapted to other languages and is generally instructive on the internals of the object.
Let's say your ctree model is named ct. Then
print(ct)
worked for me to see the tree structure.

Get the most expressed genes from one .CEL file in R

In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

Resources