everyone.
I have a problem. I have to realize a kNN classification on R using LOO. I've found packages "knncat" and "loo" for this. And I've written the code(without LOO):
library(knncat)
x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
train <- data.frame(x, y)
x1 <- c(9, 10, 11, 12)
y1 <- c(13, 14, 15, 16)
test <- data.frame(x1, y1)
answer <- knncat(train, test, classcol = 2)
And I've got an error "Some "train" columns aren't present in "test"". I don't understand, what am I doing wrong? How can I fix this error?
If something's wrong with my English, sorry, I'm from Russia:)
Well, there are some problems with your approach and knncat:
You have to specify class labels for the train and test data sets and set classcol accordingly.
Only class labels which appear in train must be present in test.
The columns names of train and test must be the same, or knncat will throw the error you've mentioned: "Some "train" columns aren't present in "test".
Moreover if you are using integer values as class labels, they have to start from zero or knncat will throw an error: "Number in class 0 is 0! Abort!".
Here is an working example:
train <- data.frame(x1=1:4, x2=5:8, y=c(0, 0, 1, 1))
test <- data.frame(x1=9:12, x2=13:16, y=c(1, 0, 0, 1))
knncat(train, test, classcol = 3)
With the result:
Test set misclass rate: 50%
Related
I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.
I want to iterate over a list of linear models and apply "clustered" standard errors to each model using the vcovCL function. My goal is to do this as efficiently as possible (I am running a linear model across many columns of a dataframe). My problem is trying to specify additional arguments inside of the anonymous function. Below I simulate some fake data. Precincts represent my cross-sectional dimension; months represent my time dimension (5 units observed across 4 months). The variable int is a dummy for when an intervention takes place.
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
df[1:10, ]
outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })
se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })
I receive the following error message when adding the cluster argument inside of the vcovCL function.
Error in eval(expr, envir, enclos) : object 'x' not found
The only way around it, in my estimation, would be to index the dataframe, i.e., df$, and then specify the 'clustering' variables. Could this be achieved by specifying an additional argument for df inside of the function call? Is this code efficient?
Maybe specifying the model equation formulaically is a better way to go, I suppose.
Any thoughts/comments are always helpful :)
Here is one approach that would retrieve clustered standard errors for multiple models:
library(sandwich)
# I am going to use the same model three times to get the "sequence" of linear models.
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)
# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}
# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)
If you want to get the entire output adjusted, the following might be helpful:
library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}
adjusted_models <- lapply(list(mod, mod, mod), adj_stats)
To address the multiple column issue:
In case you are struggling with running linear models over several columns, the following might be helpful. All the above would stay the same, except that you are passing your list of models to lapply.
First, let's use this dataframe here:
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
crime2 = rnorm(20, 10, 5),
crime3 = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
Let's define the outcome columns:
outcome_columns <- c("crime", "crime2", "crime3")
Now, let's run a regression with each outcome:
models <- lapply(outcome_columns,
function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )
And then you would just call
adjusted_models <- lapply(models, adj_stats)
Regarding efficiency:
The above code is efficient in that it is easily adjustable and quick to write up. For most use cases, it will be perfectly fine. For computational efficiency, note that your design matrix is the same in all cases, i.e. by precomputing the common elements (e.g. inv(X'X)*X'), you could save some computations. You would however lose out on the convenience of many built-in functions.
For the following minimization with objective function and constraints, boot.simplex returns an error:
Error in tab[-pr, ] <- tab[-pr, ] - (tab[-pr, pc]/pv) %o% tab[pr, ] :
NAs are not allowed in subscripted assignments
The code is here:
library(boot)
a = c(1, 1, 1)
A2 = rbind(c(2, 7.5, 3), c(20, 5, 10))
b2 = c(10000, 30000)
simplex(a=a, A2=A2, b2=b2, maxi=FALSE)
Note that the constraints are only of the greater-than-equal-to type.
But, if a bogus less-than-or-equal-to constraint is added as shown below, a valid result is returned.
A1 = c(1, 0, 0)
b1 = 1000000
simplex(a=a, A1=A1, b1=b1, A2=A2, b2=b2, maxi=FALSE)
The optimal solution to the second attempt is below
Optimal solution has the following values
x1 x2 x3
1250 1000 0
The optimal value of the objective function is 2250.
Why is there an error with the first attempt? Is it not allowed to only use >= type of constraints? What is a good way to approach this problem?
I changed the value of the default eps, and it works for me.
The default is eps=1e-10, try to increase it.
simplex(a=a, A2=A2, b2=b2, maxi=FALSE, eps=1e-6)
I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.
label must be in [0, num_class)
in your script add y<-y-1 before model <-...
I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.
I re-encoded the labels as integers and then num_class worked fine when set to N.
# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1 <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)
and
param <- list(booster="gbtree",
objective="multi:softprob",
eval_metric="mlogloss",
#nthread=13,
num_class=3,
eta_decay = .99,
eta = .005,
gamma = 1,
max_depth = 4,
min_child_weight = .9,#1,
subsample = .7,
colsample_bytree = .5
)
For example.
I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.
I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.
Tutorial said:
label = as.integer(iris$Species)-1
What worked for me (response is high_end):
label = as.integer(high_end)
Probably pretty easy, but I want to know, how to grab coefficients when using the zeroinfl command?
treatment <- factor(rep(c(1, 2), c(43, 41)),
levels = c(1, 2),labels = c("placebo", "treated"))
improved <- factor(rep(c(1, 2, 3, 1, 2, 3), c(29, 7, 7, 13, 7, 21)),
levels = c(1, 2, 3),labels = c("none", "some", "marked"))
numberofdrugs <- rpois(84, 2)
healthvalue <- rpois(84,0.5)
y <- data.frame(healthvalue,numberofdrugs, treatment, improved)
require(pscl)
ZIP<-zeroinfl(healthvalue~numberofdrugs+treatment+improved, y)
summary(ZIP)
I usually use ZIP$coef[1] to grab a coefficient, but unfortunately here you grab a whole bunch. So how can I grab one single coeficients from a ZIP model?
Use the coef extraction function to list all coefficients in one long vector, and then you can use single index notation to select them:
coef(ZIP)[1]
count_(Intercept)
0.1128742
Alternatively, you need to select which model you want to get the coefficients from first:
ZIP$coef$count[1]
(Intercept)
0.1128742
ZIP$coef[[1]][1]
(Intercept)
0.1128742
If you wanted to get fancy you could split the coefficients into a list:
clist <- function(m) {
cc <- coef(m)
ptype <- gsub("_.+$","",names(cc))
ss <- split(cc,ptype)
lapply(ss, function(x) names(x) <- gsub("^.*_","",names(x)))
}
> clist(ZIP)
$count
(Intercept) numberofdrugs treatmenttreated improvedsome
-1.16112045 0.16126724 -0.07200549 -0.34807344
improvedmarked
0.23593220
$zero
(Intercept) numberofdrugs treatmenttreated improvedsome
7.509235 -14.449669 -58.644743 -8.060501
improvedmarked
58.034805
c1 <- clist(ZIP)
c1$count["numberofdrugs"]