I Have a problem with using the apply function in R. I made the following function:
TrainSupportVectorMachines <- function(trainingData,kernel,G,C){
####train het model
fit<-svm(Device~.,data=trainingData,kernel=kernel,probability=TRUE,
gamma =G, costs=C)
return(fit);
}
I want to train the model with different values of Cost(c). Therefore, I tried the following commend:
cst = matrix(2^(-4:-2),ncol=3)
kernl = "sigmoid"
fitSVMBP <- apply(cst,2,function(x)TrainSupportVectorMachines(dtr1,kernl,0.625,x))
My opinion is that, fitSVMBP becomes a list with different SVM models with different values for cost. But I get a list with different SVM model but they have all a cost of 1.
Does anybody know what I do wrong?
EDIT:
I use the e1071 package.
And the dataset looks like:
> head(dtr1)
Device Geslacht Leeftijd Invultijd Type Maanden.geleden
1 pc M 45 16.0 A 15
2 pc V 43 27.5 A 3
3 pc V 28 16.0 A 15
4 pc V 17 10.0 A 13
5 pc M 56 16.0 A 15
6 pc M 50 27.5 A 3
You have called the argument costs and not cost. Here's an example using the sample data in ?svm so you can try this:
model <- svm(Species ~ ., data = iris, cost=.6)
model$cost
# [1] 0.6
model <- svm(Species ~ ., data = iris, costs=.6)
model$cost
# [1] 1
R will do partial matching (so in this case cos=.6 would work) but if you overspecify an argument it doesn't match.
Nor will it always complain if you give it an argument it doesn't expect:
> model <- svm(Species ~ ., data = iris, costs=.6, asjkdakjshd=1)
>
Because unmatched args get caught in the ... argument.
If you take this too far, you get:
> model <- svm(Species ~ ., data = iris, c=.122)
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
argument 4 matches multiple formal arguments
because c matches cost, coef0, class.weights and cachesize.
Related
I have a dataset of 17 columns and 500000 rows. I want to predict 250000 of one of these columns. so my training dataset has 250000 rows. after dividing to testing and training set, I ran "gbm" and "lm" model on the set. (
modellm <- train(DARAMAD ~ ., data = trainig, method = "lm", na.action = na.pass)
modelgbm <- train(DARAMAD ~., data = trainig, method = "gbm", na.action = na.omit)
the problem is that when I want to predict, I only receive a vector of 9976 elements while, I try to predict 250000 elements.
z <- predict(modelgbm, newdata = forPredict)
z <- predict(modellm, newdata = forPredict)
forPredict and training datasets both have dimensions of 250000.
your code didn't work for me, but I counted NAs as follows:
naCountFunc <- function(x) sum(is.na(x))
naCount <- sapply(trainData, naCountFunc)
as.data.frame(table(naCount))
naCount Freq
1 0 12
2 1 1
3 100 2
4 187722 1
5 188664 1
these two columns with high NAs are not the one I want to predict. the "daramad" column hasn't any NA.
I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")
I have a problem controlling the object types feeding into the predict function. Here's my simplified function that generates the glm object.
fitOneSample <- function(x,data,sampleSet)
{
#how big of a set are we going to analyze? Pick a number between 5,000 & 30,000, then select that many rows to study
sampleIndices <- 1:5000
#now randomly pick which columns to study
colIndices <- 1:10
xnames <- paste(names(data[,colIndices]),sep = "")
formula <- as.formula(paste("target ~ ", paste(xnames,collapse = "+")))
glm(formula,family=binomial(link=logit),data[sampleIndices,])
}
myFit <- fitOneSample(1,data,sampleSet)
fits <- sapply(1:2,fitOneSample,data,sampleSet)
all.equal(myFit,fits[,1]) #different object types
#this works
probability <- predict(myFit,newdata = data)
#this doesn't
probability2 <- predict(fits[,1],newdata = data)
# Error in UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "list"
How do I access the column in fits[,1] so that I can use the predict function to get same result that I did with myFit?
I think I am now able to recover your situation.
fits <- sapply(names(trees),
function (y) do.call(lm, list(formula = paste0(y, " ~ ."), data = trees)))
This uses built-in dataset trees as an example, fitting three linear models:
Girth ~ Height + Volume
Height ~ Girth + Volume
Volume ~ Height + Girth
Since we have used sapply, and each iteration returns the same lm object, or a length-12 list, results will be simplified to a 12 * 3 matrix:
class(fits)
# "matrix"
dim(fits)
# 12 3
Matrix indexing fits[, 1] is valid.
If you check str(fits[, 1]), it almost looks like a normal lm object. But if you further check:
class(fits[, 1])
# "list"
Em? It does not have "lm" class! As a result, S3 dispatch method will fails when you call generic function predict:
predict(x)
#Error in UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "list"
This can be seen as a good example that sapply is destructive. We want lapply, or at least, sapply(..., simplify = FALSE):
fits <- lapply(names(trees),
function (y) do.call(lm, list(formula = paste0(y, " ~ ."), data = trees)))
The results of lapply is easier to understood. It is a length-3 list, where each element is an lm object. We can access the first model via fits[[1]]. Now everything will work:
class(fits[[1]])
# "lm"
predict(fits[[1]])
# 1 2 3 4 5 6 7 8
# 9.642878 9.870295 9.941744 10.742507 10.801587 10.886282 10.859264 10.957380
# 9 10 11 12 13 14 15 16
#11.588754 11.289186 11.946525 11.458400 11.536472 11.835338 11.133042 11.783583
# 17 18 19 20 21 22 23 24
#13.547349 12.252715 12.603162 12.765403 14.002360 13.364889 14.535617 15.016944
# 25 26 27 28 29 30 31
#15.628799 17.945166 17.958236 18.556671 17.229448 17.131858 21.888147
You can fix your code by
fits <- lapply(1:2,fitOneSample,data,sampleSet)
probability2 <-predict(fits[[1]],newdata = data)
What is the difference between type="class" and type="response" in the predict function?
For instance between:
predict(modelName, newdata=testData, type = "class")
and
predict(modelName, newdata=testData, type = "response")
Response gives you the numerical result while class gives you the label assigned to that value.
Response lets you to determine your threshold. For instance,
glm.fit = glm(Direction~., data=data, family = binomial, subset = train)
glm.probs = predict(glm.fit, test, type = "response")
In glm.probs we have some numerical values between 0 and 1. Now we can determine the threshold value, let's say 0.6. Direction has two possible outcomes, up or down.
glm.pred = rep("Down",length(test))
glm.pred[glm.probs>.6] = "Up"
type = "response" is used in glm models and type = "class" is used in rpart models(CART).
See:
predict.glm
predict.rpart
see ?predict.lm:
predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. For type = "terms" this is a matrix with a column per term and may have an attribute "constant".
> d <- data.frame(x1=1:10,x2=rep(1:5,each=2),y=1:10+rnorm(10)+rep(1:5,each=2))
> l <- lm(y~x1+x2,d)
> predict(l)
1 2 3 4 5 6 7 8 9 10
2.254772 3.811761 4.959634 6.516623 7.664497 9.221486 10.369359 11.926348 13.074222 14.631211
> predict(l,type="terms")
x1 x2
1 -7.0064511 0.8182315
2 -5.4494620 0.8182315
3 -3.8924728 0.4091157
4 -2.3354837 0.4091157
5 -0.7784946 0.0000000
6 0.7784946 0.0000000
7 2.3354837 -0.4091157
8 3.8924728 -0.4091157
9 5.4494620 -0.8182315
10 7.0064511 -0.8182315
attr(,"constant")
[1] 8.442991
i.e. predict(l) is the row sums of predict(l,type="terms") + the constant
I want to fit a mixed model using nlme package in R which is equivalent to following SAS codes:
proc mixed data = one;
class var1 var2 year loc rep;
model yld = var1 * var2;
random loc year(loc) rep*year(loc);
EDITS: Explanation of what is experiment about
the same combination of var1 and var2 were tested in replicates (rep- replicates are numbered 1:3). The replicates (rep) is considered random. This set of experiment is repeated over locations (loc) and years (year). Although replicates are numbered 1:3 within each location and year for covinience because they do not have any name, replication 1 within a location and a year doesnot have correlation replication 1 within other location and other year
I tried the following codes:
require(nlme)
fm1 <- lme(yld ~ var1*var2, data = one, random = loc + year / loc + rep * year / loc)
Is my codes correct?
EDITS: data and model based on suggestions
you can download the example data file from the following link:
https://sites.google.com/site/johndatastuff/mydata1.csv
data$var1 <- as.factor(data$var1)
data$var2 <- as.factor(data$var2)
data$year <- as.factor(data$year)
data$loc <- as.factor(data$loc)
data$rep <- as.factor(data$rep)
following suggestions from the comments below:
fm1 <- lme(yld ~ var1*var2, data = data, random = ~ loc + year / loc + rep * year / loc)
Error in getGroups.data.frame(dataMix, groups) :
Invalid formula for groups
EXPECTED BASED ON SAS OUTPUT
Type 3 tests of fixed effects
var1*var2 14 238 F value 16.12 Pr >F = < 0.0001
Covariance parameters:
loc = 0, year(loc) = 922161, year*rep(loc) = 2077492, residual = 1109238
I tried the following model, I still getting some errors:
Edits: Just for information I tried the following model
require(lme4)
fm1 <- lmer(yld ~ var1*var2 + (1|loc) + (1|year / loc) + (1|rep : (year / loc)),
data = data)
Error in rep:`:` : NA/NaN argument
In addition: Warning message:
In rep:`:` : numerical expression has 270 elements: only the first used
Thanks for the more detailed information. I stored the data in d to avoid confusion with the data function and parameter; the commands works either way but this avoiding data is generally considered good practice.
Note that the interaction is hard to fit because of the lack of balance between var and var2; for reference here's the crosstabs:
> xtabs(~var1 + var2, data=d)
var2
var1 1 2 3 4 5
1 18 18 18 18 18
2 0 18 18 18 18
3 0 0 18 18 18
4 0 0 0 18 18
5 0 0 0 0 18
Normally to just fit the interaction (and no main effects) you'd use : instead of *, but here it works best to make a single factor, like this:
d$var12 <- factor(paste(d$var1, d$var2, sep=""))
Then with nlme, try
fm1 <- lme(yld ~ var12, random = ~ 1 | loc/year/rep, data = d)
anova(fm1)
and with lme4, try
fm1 <- lmer(yld ~ var12 + (1 | loc/year/rep), data=d)
anova(fm1)
Also note that because nlme and lme4 have overlap in their function names you need to only load one at time into your R session; to switch you need to close R and restart. (Other ways exist but that's the simplest to explain.)