I'm attempting to implement a the XGBoost on a Methyl450k data set. The data has approximately 480000+ specific CpG sites with subsequent beta values between 0 and 1. Here is a look at the data (sample 10 columns with response):
cg13869341 cg14008030 cg12045430 cg20826792 cg00381604 cg20253340 cg21870274 cg03130891 cg24335620 cg16162899 response
1 0.8612869 0.6958909 0.07918330 0.10816711 0.03484078 0.4875475 0.7475878 0.11051578 0.7120003 0.8453396 0
2 0.8337106 0.6276754 0.09811698 0.08934333 0.03348864 0.6300766 0.7753453 0.08652890 0.6465146 0.8137132 0
3 0.8516102 0.6575332 0.13310207 0.07990076 0.04195286 0.4325115 0.7257208 0.14334007 0.7384455 0.8054013 0
4 0.8970384 0.6955810 0.08134887 0.08950676 0.03578006 0.4711689 0.7214661 0.08299838 0.7718571 0.8151683 0
5 0.8562323 0.7204416 0.08078766 0.14902533 0.04274820 0.4769631 0.8034706 0.16473891 0.7143823 0.8475410 0
6 0.8613325 0.6527599 0.10158672 0.15459204 0.04839691 0.4805285 0.8004808 0.12598627 0.8218743 0.8222552 0
7 0.9168869 0.5963966 0.11457045 0.13245761 0.03720798 0.5067649 0.6806004 0.13601034 0.7063457 0.8509160 0
8 0.9002366 0.6898320 0.07029171 0.07158694 0.03875135 0.7065322 0.8167016 0.15394095 0.7226098 0.8310477 0
9 0.8876504 0.6172154 0.13511072 0.15276686 0.06149520 0.5642073 0.7177438 0.14752285 0.6846876 0.8360360 0
10 0.8992898 0.6361644 0.15423780 0.19111275 0.05700406 0.4941239 0.7819968 0.10109936 0.6680640 0.8504023 0
11 0.8997905 0.5906462 0.10411472 0.15006796 0.04157008 0.4931531 0.7857664 0.13430963 0.6946644 0.8326747 0
12 0.9009607 0.6721858 0.09081460 0.11057752 0.05824153 0.4683763 0.7655608 0.01755990 0.7113345 0.8346149 0
13 0.9036750 0.6313643 0.07477824 0.12089404 0.04738597 0.5502747 0.7520128 0.16332395 0.7036665 0.8564414 0
14 0.8420276 0.6265071 0.15351674 0.13647090 0.04901864 0.5037902 0.7446693 0.10534171 0.7727812 0.8317943 0
15 0.8995276 0.6515500 0.09214429 0.08973162 0.04231420 0.5071999 0.7484940 0.21822470 0.6859165 0.7775508 0
16 0.9071643 0.7945852 0.15809474 0.11264440 0.04793316 0.5256078 0.8425513 0.17150603 0.7581367 0.8271037 0
17 0.8691358 0.6206902 0.11868549 0.15944891 0.03523320 0.4581166 0.8058461 0.11557264 0.6960848 0.8579109 1
18 0.8330247 0.7030860 0.12832663 0.12936172 0.03534059 0.4687507 0.7630222 0.12176819 0.7179690 0.8775521 1
19 0.9015574 0.6592869 0.12693119 0.14671845 0.03819418 0.4395692 0.7420882 0.10293369 0.7047038 0.8435531 1
20 0.8568249 0.6762936 0.18220218 0.10123198 0.04963466 0.5781550 0.6324743 0.06676272 0.6805745 0.8291353 1
21 0.8799152 0.6736554 0.15056617 0.16070673 0.04944037 0.4015415 0.4587438 0.10392791 0.7467060 0.7396137 1
22 0.8730770 0.6663321 0.10802390 0.14481460 0.04448009 0.5177664 0.6682854 0.16747621 0.7161234 0.8309462 1
23 0.9359656 0.7401368 0.16730300 0.11842173 0.03388908 0.4906018 0.5730439 0.15970761 0.7904663 0.8136450 1
24 0.9320397 0.6978085 0.10474803 0.10607080 0.03268366 0.5362214 0.7832729 0.15564091 0.7171350 0.8511477 1
25 0.8444256 0.7516799 0.16767449 0.12025258 0.04426417 0.5040725 0.6950104 0.16010829 0.7026808 0.8800469 1
26 0.8692707 0.7016945 0.10123979 0.09430876 0.04037325 0.4877716 0.7053603 0.09539885 0.8316933 0.8165352 1
27 0.8738410 0.6230674 0.12793232 0.14837137 0.04878595 0.4335648 0.6547601 0.13714725 0.6944921 0.8788708 1
28 0.9041870 0.6201079 0.12490195 0.16227251 0.04812720 0.4845896 0.6619842 0.13093443 0.7415606 0.8479339 1
29 0.8618622 0.7060291 0.09453812 0.14068246 0.04799782 0.5474036 0.6088231 0.23338428 0.6772588 0.7795908 1
30 0.8776350 0.7132561 0.12100425 0.17367148 0.04399987 0.5661632 0.6905305 0.12971867 0.6788903 0.8198201 1
31 0.9134456 0.7249370 0.07144695 0.08759897 0.04864476 0.6682650 0.7445900 0.16374150 0.7322691 0.8071598 1
32 0.8706637 0.6743936 0.15291891 0.11422262 0.04284591 0.5268217 0.7207478 0.14296945 0.7574967 0.8609048 1
33 0.8821504 0.6845216 0.12004074 0.14009196 0.05527732 0.5677475 0.6379840 0.14122421 0.7090634 0.8386022 1
34 0.9061180 0.5989445 0.09160787 0.14325261 0.05142950 0.5399465 0.6718870 0.08454002 0.6709083 0.8264233 1
35 0.8453511 0.6759766 0.13345672 0.16310764 0.05107034 0.4666146 0.7343603 0.12733287 0.7062292 0.8471812 1
36 0.9004188 0.6114532 0.11837118 0.14667433 0.05050403 0.4975502 0.7258132 0.14894363 0.7195090 0.8382364 1
37 0.9051729 0.6652954 0.15153241 0.14571184 0.05026702 0.4855397 0.7226850 0.12179138 0.7430388 0.8342340 1
38 0.9112012 0.6314450 0.12681305 0.16328649 0.04076789 0.5382251 0.7404122 0.13971506 0.6607798 0.8657917 1
39 0.8407927 0.7148585 0.12792107 0.15447060 0.05287096 0.6798039 0.7182050 0.06549068 0.7433669 0.7948445 1
40 0.8554747 0.7356683 0.22698080 0.21692162 0.05365043 0.4496654 0.7353112 0.13341649 0.8032266 0.7883068 1
41 0.8535359 0.5729331 0.14392737 0.16612463 0.04651752 0.5228045 0.7397588 0.09967424 0.7906682 0.8384434 1
42 0.8059968 0.7148594 0.16774123 0.19006840 0.04990847 0.5929818 0.7011064 0.17921090 0.8121909 0.8481069 1
43 0.8856906 0.6987405 0.19262137 0.18327412 0.04816967 0.4340002 0.6569263 0.13724290 0.7600389 0.7788117 1
44 0.8888717 0.6760166 0.17025712 0.21906969 0.04812641 0.4173613 0.7927178 0.17458413 0.6806101 0.8297604 1
45 0.8691575 0.6682723 0.11932277 0.13669098 0.04014911 0.4680455 0.6186511 0.10002737 0.8012731 0.7177891 1
46 0.9148742 0.7797494 0.13313955 0.15166151 0.03934042 0.4818276 0.7484973 0.16354624 0.6979735 0.8164431 1
47 0.9226736 0.7211714 0.08036409 0.10395457 0.04063595 0.4014187 0.8026643 0.17762644 0.7194800 0.8156545 1
I've attempted to implement the algorithm in R but I'm continuing to get errors.
Attempt:
> train <- beta_values1_updated[training1, ]
> test <- beta_values1_updated[-training1, ]
> labels <- train$response
> ts_label <- test$response
> new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
Error in `[.data.frame`(train, , -c("response"), with = F) :
unused argument (with = F)
> new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
Error in `[.data.frame`(test, , -c("response"), with = F) :
unused argument (with = F)
I am attempting to follow the tutorial here:
https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Any insight as to how I could correctly implement the XGBoost algorithm would be greatly appreciated.
Edit:
I'm adding additional code to show the point in the tutorial where I get stuck:
train<-data.table(train)
test<-data.table(test)
new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
#convert factor to numeric
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 20 rounds.
[11] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[21] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Stopping. Best iteration:
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
The author of the tutorial is using the data.table package. As you can read here, using with = F is sometimes used to get a single column. Make sure you have loaded and installed data.table and other packages to follow the tutorial. Also, make sure your data set is a data.table object.
Related
I am currently trying to build a neural network to predict what rank people within the data will place.
The Rank system is: A,B,C,D,E
Everything runs very smoothly until I get to my confusion matrix. I get the error "Error: data and reference should be factors with the same levels.". I have tried many different methods on other posts but none seem to work.
The levels are both the same in NNPredicitions and test$Rank. I checked them both with table().
library(readxl)
library(caret)
library(neuralnet)
library(forecast)
library(tidyverse)
library(ggplot2)
Indirect <-read_excel("C:/Users/Abdulazizs/Desktop/Projects/Indirect/FIltered Indirect.xlsx",
n_max = 500)
Indirect$Direct_or_Indirect <- NULL
Indirect$parentaccount <- NULL
sum(is.na(Indirect))
counts <- table(Indirect$Rank)
barplot(counts)
summary(counts)
part2 <- createDataPartition(Indirect$Rank, times = 1, p = .8, list = FALSE, groups = min(5, length(Indirect$Rank)))
train <- Indirect[part2, ]
test <- Indirect[-part2, ]
set.seed(1234)
TrainingParameters <- trainControl(method = "repeatedcv", number = 10, repeats=10)
as.data.frame(train)
as.data.frame(test)
NNModel <- train(train[,-7], train$Rank,
method = "nnet",
trControl= TrainingParameters,
preProcess=c("scale","center"),
na.action = na.omit
)
NNPredictions <-predict(NNModel, test, type = "raw")
summary(NNPredictions)
confusionMatrix(NNPredictions, test$Rank)
length(NNPredictions)
length(test$Rank)
length(NNPredictions)
[1] 98
length(test$Rank)
[1] 98
table(NNPredictions, test$Rank, useNA="ifany")
NNPredictions A B C D E
A 1 0 0 0 0
B 0 6 0 0 0
C 0 0 11 0 0
D 0 0 0 18 0
E 0 0 0 0 62
Also change method = "prob" to method = "raw"
Table1 <- table(NNPredictions, test$Rank, useNA = "ifany")
cnf1 <- confusionMatrix(Table1)
Answered provided by dclarson
I have a data frame like so:
log.comb CDEM_TWI Gruber_Ruggedness dNBR TC_Change_Sexton_Rel \
0 8.714914 10.70240 0.626106 0.701591 -27.12220
1 6.501334 10.65650 1.146360 0.693891 -35.52890
2 8.946111 13.58910 1.146360 0.513136 7.00000
3 8.955151 9.85036 1.126980 0.673891 13.81380
4 7.751379 7.28264 0.000000 0.256136 10.06940
5 8.895197 8.36555 0.000000 0.506000 -27.61340
6 8.676571 12.92650 0.000000 0.600627 -44.48400
7 8.562267 12.76980 0.519255 0.747009 -29.84790
8 9.052766 11.81580 0.519255 0.808336 -29.00900
9 9.133744 9.42046 0.484616 0.604891 -18.53550
10 8.221441 9.53682 0.484616 0.817336 -21.39920
11 8.398913 12.32050 0.519255 0.814745 -18.12080
12 7.587468 11.08880 1.274430 0.590282 92.85710
13 7.983136 8.95073 1.274430 0.316000 -10.34480
14 9.044404 11.18440 0.698818 0.608600 -14.77000
15 8.370293 11.96980 0.687634 0.323000 -9.60452
16 7.938134 12.42380 0.709549 0.374027 36.53140
17 8.183456 12.73490 1.439180 0.679627 -12.94420
18 8.322246 9.61600 0.551689 0.642900 37.50000
19 7.934997 7.77564 0.519255 0.690936 -25.29880
20 9.049387 11.16000 0.519255 0.789064 -35.73880
21 8.071323 6.17036 0.432980 0.574355 -22.43590
22 6.418345 5.98927 0.432980 0.584991 4.34783
23 7.950516 5.49527 0.422882 0.689009 25.22520
24 6.355529 7.35982 0.432980 0.419045 -18.81920
25 8.043683 5.18300 0.763596 0.582555 50.56180
26 6.013468 5.34018 0.493781 0.241155 -3.01205
27 7.961675 5.43264 0.493781 0.421527 -21.72290
28 8.074614 11.94630 0.493781 0.451800 11.61620
29 8.370570 6.34100 0.492384 0.550127 -12.50000
Pct_Pima Sand._15cm
0 75.62120 44.6667
1 69.30690 41.8333
2 59.47490 41.8333
3 66.08800 41.5000
4 34.31250 39.6667
5 35.04750 39.2424
6 62.32120 41.6667
7 57.14320 43.3333
8 57.35020 43.3333
9 72.90980 41.0000
10 57.61790 38.8333
11 57.35020 39.8333
12 69.30690 47.8333
13 69.30690 47.3333
14 76.58910 42.8333
15 75.62120 45.3333
16 76.69440 41.7727
17 59.47090 37.8333
18 61.10130 42.8333
19 72.67650 38.1818
20 57.35020 40.6667
21 23.15380 48.0000
22 17.15050 51.5000
23 0.00000 47.5000
24 6.67001 58.0000
25 15.18050 54.8333
26 5.89344 49.0000
27 5.89344 49.1667
28 13.18900 48.5000
29 13.30450 49.0000
I want to run a linear model through 10-fold cross validation repeated 100 times.
In python I do this:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import r2_score
X = df[['CDEM_TWI', 'Gruber_Ruggedness', 'dNBR', 'TC_Change_Sexton_Rel', 'Pct_Pima', 'Sand._15cm']].copy()
y = df[['log.comb']].copy()
all_r2 = []
rskf = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train_index, test_index in rskf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
lm = LinearRegression(fit_intercept = True)
lm.fit(X_train, y_train)
pred = lm.predict(X_test)
r2 = r2_score(y_test, pred)
all_r2.append(r2)
avg = np.mean(all_r2)
and here avg returns -0.11
In R I do this:
library(caret)
library(klaR)
train_control <- trainControl(method="repeatedcv", number=10, repeats=10)
model <- train(log.comb~., data=df, trControl=train_control, method="lm")
and model returns:
RMSE Rsquared MAE
0.7868838 0.6132806 0.7047198
I am curious why these results are so inconsistent with each other? I realize the folds between the two different't languages are different, but since I am repeating it so many times I don't get why the numbers aren't more similar.
I also tried a nested grid search in sklearn like so:
inner_cv = KFold(n_splits=10, shuffle=True, random_state=10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=10)
param_grid = {'fit_intercept': [True, False],
'normalize': [True, False]}
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=LinearRegression(), param_grid = param_grid, cv=inner_cv)
clf.fit(X, y)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
clf = GridSearchCV(estimator= LinearRegression(), param_grid = param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv).mean()
but both the nested_score and non_nested_score are both still negative.
The Python code is returning the average of the results, the R code is returning the best model found.
I want to use the function "CreateDataPartition" and define some training and testing data. But the programm always put the first 2 lines of my table in "training" and all other lines in "testing". I thought changing the percentage from "0.75" to "0.5" would channge something, but it didnĀ“t.
Table:
t<-read.csv2("test.csv")
> test
Spalte1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
First run with p=.75:
> inTraining <- createDataPartition(a1, p = .75, list = FALSE)
Warnmeldungen:
1: In createDataPartition(a1, p = 0.75, list = FALSE) :
Some classes have no records ( ) and these will be ignored
2: In createDataPartition(a1, p = 0.75, list = FALSE) :
Some classes have a single record ( ) and these will be selected for the sample
> training <- test[ inTraining,]
> testing <- test[-inTraining,]
> traing
Fehler: Objekt 'traing' nicht gefunden
> training
[1] 1 2
> testing
[1] 3 4 5 6 7 8 9 10
Second run with p=.5:
> inTraining <- createDataPartition(a1, p = .5, list = FALSE)
Warnmeldungen:
1: In createDataPartition(a1, p = 0.5, list = FALSE) :
Some classes have no records ( ) and these will be ignored
2: In createDataPartition(a1, p = 0.5, list = FALSE) :
Some classes have a single record ( ) and these will be selected for the sample
> training <- test[ inTraining,]
> testing <- test[-inTraining,]
> training
[1] 1 2
> testing
[1] 3 4 5 6 7 8 9 10
What do I have to change to be able to telle the programm how many lines I want for testing and how many lines I want for training?
If you are just looking for a solution to the apparent problem then you could just use base R function sample.int() to the same end:
# Dummy data
data <- data.frame(Spalte1 = 1:10)
# Simple way to partition without any external package:
p <- 0.75
set.seed(1)
inTraining <- sample.int(n = nrow(data),
size = floor(p * nrow(data)),
replace = FALSE)
train <- data[inTraining, ]
test <- data[-inTraining, ]
train
3 4 5 7 2 8 9
test
1 6 10
Note that here the number of lines in the training data is equal to floor(p * nrow(data)).
I am using h2o version 3.10.4.8.
library(magrittr)
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "6g")
data.url <- "https://raw.githubusercontent.com/DarrenCook/h2o/bk/datasets/"
iris.hex <- paste0(data.url, "iris_wheader.csv") %>%
h2o.importFile(destination_frame = "iris.hex")
y <- "class"
x <- setdiff(names(iris.hex), y)
model.glm <- h2o.glm(x, y, iris.hex, family = "multinomial")
preds <- h2o.predict(model.glm, iris.hex)
h2o.confusionMatrix(model.glm)
h2o.table(preds["predict"])
This is the output of h2o.confusionMatrix(model.glm):
Confusion Matrix: vertical: actual; across: predicted
Iris-setosa Iris-versicolor Iris-virginica Error Rate
Iris-setosa 50 0 0 0.0000 = 0 / 50
Iris-versicolor 0 48 2 0.0400 = 2 / 50
Iris-virginica 0 1 49 0.0200 = 1 / 50
Totals 50 49 51 0.0200 = 3 / 150
Since it says across:predicted, I interpret this to mean that the model made 50 (0 + 48 + 2) predictions that are Iris-versicolor.
This is the output of h2o.table(preds["predict"]):
predict Count
1 Iris-setosa 50
2 Iris-versicolor 49
3 Iris-virginica 51
This tells me that the model made 49 predictions that are Iris-versicolor.
Is the confusion matrix incorrectly labelled or did I make a mistake in interpreting the results?
Row names (vertical) are the actual labels.
Column names (across) are the predicted labels.
You did not make a mistake; the labels are confusing (and causing people to think that the rows and columns were switched). This was fixed recently and will be included in the next release of H2O.
I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.