Ensambling Classification Models

Ensambling Classification Models - r

I have made some classification models where 1 means it is the same person, and 0 means they are different.
If I print the head of my predictions it looks the following way:
> head(PredictCTree)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(PredictSVM)
1 1.1 1.2 1.3 1.7 1.14
0 0 0 0 0 0
Levels: 0 1
> head(PredictForest)
1.212 1.839 1.906 1.951 1.1011 1.1151
1 1 1 0 1 1
Levels: 0 1
So if I want to average them and add them up I have to make them numeric, but here is where I am struggling:
Example:
> PredictForest[1]
1.212
1
Levels: 0 1
basically I want to add 1 + 0 (for PredictForest and SVM)
as.numeric(PredictForest[1])
[1] 2
but I end up getting this answer:
> as.numeric(PredictForest[1]) + as.numeric(fitted.results[1] + as.numeric(PredictCTree[1] ))
[1] 4
Any suggestions?
My expected output would be:
> as.numeric(PredictForest[1]) + as.numeric(fitted.results[1] + as.numeric(PredictCTree[1] ))
[1] 1
So later on I could divide or give weights in order to test and get the most probable class.
Thank you!

If you try to convert a factor into a number, it'll give you the number of the level in the factor. To convert into numbers, you can first run as.character, which will safely turn it into a format that you can run as.numeric on.
test <- as.factor(c(0, 1))
as.numeric(test)
# [1] 1 2
as.numeric(as.character(test))
# [1] 0 1
The R FAQ recommends a different approach for speed
7.10 How do I convert factors to numeric?
It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use
as.numeric(as.character(f))
to get the numbers back. More efficient, but harder to remember, is
as.numeric(levels(f))[as.integer(f)]
In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).

Related

How to devide my dataset for using permanova

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !

I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required

So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0

I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.

max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

R predict - How to format output

I use the predict function to predict the results based on a model.
What I get is a vector of the predicted classes. I want to retrieve the same results but instead of the form
1 class_1
2 class_1
3 class_4
4 class_2
I want to have the results in the form
class_1 class_2 class_3 class_4
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
I have tried passing type=class and type=response but the results are the same.
I am completely new to R and I am still trying to find my way around R's documentation but I think that this is something trivial that I should be able to figure out, though I am pretty stuck.

After viewing the docs on predict.randomForest at
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
It appears the valid choices for type are response, prob. or votes.
Using the code below, I was able to reproduce your format, but using probabilities.
> predict(model, x, type='prob')
0 1
1 1.000 0.000
2 0.180 0.820
3 0.138 0.862
attr(,"class")
To obtain booleans, another option is you could do one hot encoding on the response results
result_classes = list()
for (level in levels(y)){
result_classes[[level]] <- predict(model, x, type='response') == level
}
data.frame(result_classes )
Result:
X0 X1
1 TRUE FALSE
2 FALSE TRUE
3 FALSE TRUE

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.

You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin

Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?

The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Ensambling Classification Models - r

Related

How to devide my dataset for using permanova

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

R predict - How to format output

How to perform a repeated G.test in R?

using graph.adjacency() in R

Categories

Resources