I'm trying to make simple decision tree using C5.0 in R.
data has 3 columns(including target data) and 14 rows.
This is my 'jogging' data. target variable is 'CLASSIFICATION'
WEATHER JOGGED_YESTERDAY CLASSIFICATION
C N +
W Y -
Y Y -
C Y -
Y N -
W Y -
C N -
W N +
C Y -
W Y +
W N +
C N +
Y N -
W Y -
or as dput result:
structure(list(WEATHER = c("C", "W", "Y", "C", "Y", "W", "C",
"W", "C", "W", "W", "C", "Y", "W"), JOGGED_YESTERDAY = c("N",
"Y", "Y", "Y", "N", "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y"
), CLASSIFICATION = c("+", "-", "-", "-", "-", "-", "-", "+",
"-", "+", "+", "+", "-", "-")), class = "data.frame", row.names = c(NA,
-14L))
jogging <- read.csv("Jogging.csv")
jogging #training data
library(C50)
jogging$CLASSIFICATION <- as.factor(jogging$CLASSIFICATION)
jogging_model <- C5.0(jogging[-3], jogging$CLASSIFICATION)
jogging_model
summary(jogging_model)
plot(jogging_model)
but it does not make any decision tree.
I thought that it should have made 2 nodes(because of 2 columns except target variables)
I want to know what's wrong :(
For this answer I will use a different tree building package partykit just for the reason that I am more used to it. Let's do the following:
jogging <- read.table(header = TRUE, text = "WEATHER JOGGED_YESTERDAY CLASSIFICATION
C N +
W Y -
Y Y -
C Y -
Y N -
W Y -
C N -
W N +
C Y -
W Y +
W N +
C N +
Y N -
W Y -",
stringsAsFactors = TRUE)
library(partykit)
ctree(CLASSIFICATION ~ WEATHER + JOGGED_YESTERDAY, data = jogging,
minsplit = 1, minbucket = 1, mincriterion = 0) |> plot()
That will print the following tree:
That is a tree that uses up to three levels of splits and still does not find a perfect fit. The first split has a p-value of .2, indicating that there is not nearly enough data to justify even this first split, let alone those following it. This is a tree that is very likely to massively overfit the data and overfitting is bad. That is why usual tree algorithms come with measures to prevent overfitting and in your case, that prohibits growing a tree. I disabled those with the arguments in the ctree call.
So in short: You have not enough data. Just predicting - all the time is the most reasonable thing a classification tree can do.
Related
I am running this R script from Java (I am using Renjin):
getCoefficients <- function(x, y, Regions) {
nbRegions <- length(Regions)
lengthY <- length(y)
colRegions <- NULL
from <- 1
to <- lengthY / nbRegions
for (i in 1:nbRegions) {
region <- Regions[i]
c <- cbind(region, from:to)
colRegions <- rbind(colRegions, c)
}
indexCols <- as.data.frame(colRegions)
x_ <- t(x)
data_ <- cbind(indexCols, y, x_)
dataFrame <- data.frame(data_)
colnames(dataFrame) <- c("region", "date", "y", "a", "b", "c", "d", "e", "f", "r", "o")
print(colnames(dataFrame))
model_RE_S <- try(plm(y ~ x_, data = dataFrame, model = "random", index = c("region", "date"), effect = "twoways"))
summryModel <- summary(model_RE_S)
coeff <- as.numeric(summryModel$coefficients)
#print(summryModel)
return(coeff)
}
I am getting the following error and I have no idea how to resolve it:
Error : replacement has 3648 rows, data has 456
the x is a 4568 matrix, the y is 4561, and 3 regions.
**Update: an alternative: **
I got rid of Renjin and used Rserve instead.
I am trying to simulate data and adjust the model for choice-based conjoint analysis (mlogit). However, I am getting the error (Error in solve.default (H, g [! Fixed])). I believe it is because sometimes I have more than one choice for the same respondent, but I understand that this is one of the possibilities of this model and so I could not find a solution, someone who has worked with discrete choice and used the mlogit package can help me? Below is my code:
rm(list = ls())
cat("\014")
library(dplyr)
library(conjoint)
set.seed(0)
n <- 1000
#- create dummy data
data = expand.grid(Cor = c("black", "white"),
Brand = c("X", "Y"),
Price = c("low", "high"))
levn <- rbind("black", "white","X", "Y", "low", "high")
data$trat <- c("A", "B", "C", "D", "E", "F", "G", "H")
UA <- 7
UB <- 6.5
UC <- 6
UD <- 5.5
UE <- 5
UF <- 4.5
UG <- 4
UH <- 3.5
data$utility <- c(UA, UB, UC, UD, UE, UF, UG, UH)
data <- bind_rows(replicate(n, data, simplify = FALSE))
erro <- rnorm(n)
data$erro <- erro
data$determinist <- floor(rowSums(data[,5:6]))
data$id <- rep(1:n, each = 8)
data <- data %>% group_by(id) %>% mutate(determinist = (determinist == max(determinist)))
data$choice <- ifelse(data$determinist=="TRUE",1,0)
library(mlogit)
cbc.mlogit <- mlogit.data(data=data, choice="choice", shape="long", varying=1:3, alt.levels=paste("pos", 1:8), id.var="id")
cbc.ml <- mlogit(choice ~ 0 + Cor + Brand + Price, data = cbc.mlogit)
Using the nnet package I got a solution, however, I don't think it's correct, because in the choice-based joint analysis the intercept is zero.
library("nnet")
model <- multinom(choice ~ Cor + Brand + Price, data = cbc.mlogit)
summary(model)
Already researched other posts and could not solve, any help is welcome!
I found this awesome tutorial on changing the x and y axes labels in ggplot and used it successfully on bar charts and scatterplots:
http://www.moeding.net/archives/32-Metric-prefixes-for-ggplot2-scales.html#comments
I would like to extend this capability to this 'heat map' that I have developed as well. Here is the code:
library(plot3D)
format_si <- function(x,...) {
# Format a vector of numeric values according
# to the Semi-International System of Units.
#
# Based on code by Ben Tupper
# https://stat.ethz.ch/pipermail/r-help/2012-January/299804.html
# Args:
# ...: Args passed to format()
function(x) {
limits <- c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12,
1e-9, 1e-6, 1e-3, 1e0, 1e3,
1e6, 1e9, 1e12, 1e15, 1e18,
1e21, 1e24)
prefix <- c("y", "z", "a", "f", "p",
"n", "ยต", "m", "", "k",
"M", "B", "T", "P", "E",
"Z", "Y")
# Vector with array indices according to position in intervals
i <- findInterval(abs(x), limits)
# Set prefix to "" for very small values < 1e-24
i <- ifelse(i==0, which(limits == 1e0), i)
return(paste(format(round(x/limits[i], 1),
trim=TRUE, scientific=FALSE, ...),
prefix[i], sep=""))
}
}
#Generate some data
xs<-rnorm(50000)*100000
ys<-rnorm(50000)*100000
zs<-sin(xs)*10000000
df<-as.data.frame(cbind(xs,ys,zs))
#Heat Map
df$x<-cut(df$xs, breaks=50, labels=FALSE)
df$y<-cut(df$ys, breaks=50, labels=FALSE)
df.max<-expand.grid(x=1:50, y=1:50)
df.max<-merge(df.max, aggregate(zs~x+y, df, max), all.x=TRUE)
z<-t(matrix(df.max$zs, nr=50, nc=50))
x.values <- min(df$xs)+(0:49)*diff(range(df$xs))/50
y.values <- min(df$ys)+(0:49)*diff(range(df$ys))/50
image2D(x=x.values, y=y.values, z, rasterImage = TRUE, contour = list(lwd = 2, col = jet.col(11)),
main="Random Data", xlab="random xs",ylab="random ys", clab="zs=sin(xs)", ylim=c(0,max(df$ys)*1.05),
xlim=c(0,max(df$xs)*1.05))
In ggplot, I would just add scale_y_continuous(labels=format_si()) to my plot and it the axis labels would be 1M instead of 1.0e6 or 1000000. I have tried turning the labels off inside of image2d and then using axes(1, format_si()) but that doesn't work either. Has anybody tried this before?
My question
multiplying numbers and symbols in R was answered and here I would like to give an example of using this for quaternion multiplication. Actually, I am using this on a much larger set (a group of 256 elements) but the principle is the same. I'm very new to working with data.tables so any additional tips are appreciated.
groupMult = data.table(
e = c("i","j","k", "e"),
i = c("-e","-k","j", "i"),
j = c("k","-e","-i", "j"),
k = c("-j","i","-e", "k")
);
row.names(groupMult) = c("i", "j", "k", "e");
setkey(groupMult);
# Find X*Y with X = 2i - 3j, Y = k - 4e
X = data.table(i = 2, j = -3);
Y = data.table(k = 1, e = -4);
# reduce groupMult to the vectors we need for multiplication
multMa = groupMult[names(X), names(Y), with = F];
# repeat values of Y ncol(X) times
multY = Y[rep(seq_len(nrow(Y)), each=ncol(X)),];
# repeat values of X ncol(Y) times
multX = t(X[rep(seq_len(nrow(X)), each=ncol(Y)),]);
# coefficient matrix
multMaNum = multY*multX;
row.names(multMaNum) = names(X);
# elementwise multiplicaton of multMaNum with multMa
res = mapply(paste, multMaNum, multMa, MoreArgs=list(sep='*') )
res[] <- sapply(res , function(x) sub("(.*)([-])(.*)", "\\2\\1\\3", x));
# collapse all elements of the data.table to get final result
res = paste(lapply(res, paste, collapse = " "), collapse = " + ");
> res
[1] "-2*j + -3*i + -8*i + 12*j"
I want to assign subjects to classes based on probabilities that I provide. I will be doing this in a variety of cases, with different values. Sometimes, I want the probability of a particular class to be 0. I've been using
classlist <- cut(runif(p), c(0, pdrop, ptitrate, pcomplete, pnoise, 1), labels = c("D", "T", "C", "N", "O"))
but this fails when two of the p variables are the same. I could make them different by minimal amounts e.g. pdrop = .2 ptitrate = .200001. But is there some better way?
Thanks
Peter
I suggest sample():
> p <- 100
> groups <- c("D", "T", "C", "N", "O")
> probVec <- c(0.2, 0.2, 0.3, 0.25, 0.05)
> classlist <- factor(sample(groups, size=p, replace=TRUE, prob=probVec))
> table(classlist)
classlist
C D N O T
26 16 28 5 25