R - Compare performance of two types while controlling for interaction - r

I have been programming in R and have a dataset containing the results (succes or not) of two Machine Learning algorithms which have been tried out using different amounts of parameters. An example is provided below:
type success paramater_amount
a1 0 15639
a1 0 18623
a1 1 19875
a2 1 12513
a2 1 10256
a2 0 12548
I now want to compare both algorithms to see which one has the best overall performance. But there is a catch. It is known that the higher the parameter_amount, the higher the chances for success. When checking out the parameter amounts both algorithms were tested on, one can also notice that a1 has been tested with higher parameter amounts than a2 was. This would make simply counting the amount of successes of both algorithms unfair.
What would be a good approach to handle this scenario?

I will give you an answer but without any guarantees on the truth of what I'm telling you. Indeed for more precisions you should give more informations on the algorithm and other. I also propose to migrate this question to cross-validate.
Indeed, your question is a statistical question. Because, in statistics, we search for sparcity. We prefer a simpler model than a very complex one at given performance because we are worried of over-fitting : https://statisticsbyjim.com/regression/overfitting-regression-models/.
One way to do what you want is to compare the performance with respect to the complexity of the model like for this toy example :
library(tidyverse)
library(ggplot2)
set.seed(123)
# number of estimation for each models
n <- 1000
performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))
performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))
df <- data.frame(performance = c(performance_1, performance_2),
complexity = c(complexity_1, complexity_2),
models = as.factor(c(rep(1, n), rep(2, n))))
temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))
ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
geom_smooth() +
theme_classic()
It only works if you have many data points. Complexity for you is the number of parameters fitted. In that toy exemple, the first model seems a better because for each level of complexity it is better.

Related

Binary outcome, different trial #s across low N?

I have a sample of 4 individuals, all who have a varying number of trials (I work with a special population so what I get is what I get!)
The outcome is a binary yes/no
I want to know:
did the total sample select yes more often than chance?
did each individual select yes more often than chance?
Here is dummy data in R.
SbjEL <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, NA))
SbjKZ <- data.frame(Sbj = c('KZ'),
TrialNum = c(1:12),
Choice = c(0,1,1,1,1,1,1,1,1,1,1, 1))
SbjMA <- data.frame(Sbj = c('MA'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, 1))
SbjTC <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(1,1,1,1,1,1,1,1, NA,NA,NA, NA))
For a different experiment with the same sample, I had more trials and did a one sample t test for the sample, and a binomial distribution to see what # of trials of Yes would be higher than chance.
# Did group select YES more than chance? --> 43 yes/48
Response_v <- c(21,22)
t.test(Response_v, mu = 12, alternative = "two.sided")
# How many YES selections would be more often than chance?
# 24 trials were completed --> 21 yes / 24
binom.test(21, 24, 1/2)
My issue is this starts to fall apart when I get down to 8-12 trials.
Any ideas? I am lost
A t-test is not appropriate here for either Q1 or Q2. With large samples you can use some approximations, but your counts are very small. So, you’re on the right track with the binomial test, but not the t-test.
For your Q1: you first ought to decide how the subjects are assumed to relate to each other. Are you pretty confident that each is providing an estimate of the same Bernoulli probability, p? Or instead, a-priori do you want to allow the possibility that subjects have different p’s? There are further questions to answer, overlapping with those you need to consider for Q2.
For your Q2: The exact method of choice depends on a number of things: For example, do you want to incorporate prior information (e.g. using historical data as a reference)? If not, there are purely frequentist methods to use off the shelf. Next, do you expect the yes/no’s to be independent, or are they more like a ‘signal’ in which the order matters? Third, is it possible that there is a mixture of Bernoullis for any of the subjects? And so on. These questions can be considered through software such as that found at www.datatrie.com/advisor

DEA analysis: variables are excluded in analysis?

I’m working on a DEA (Data Envelopment Analysis) analysis to analyze the relative effects of different banks efficiencies.
The packages I’m using are rDEA and kableExtra.
What this analysis if doing is measuring the relative effect of input and output variables that I use to examine the efficiency for each individual bank.
The problem is that my code only includes two out of four output variables and I can’t find anywhere in the code where I ask it to do so.
Can some of you identify the problem?
Thank you in advance!
I have tried to format the data in several different ways, assign the created "inp_var" and "out_var" as a matrix'.
#install.packages('rDEA')
#install.packages('dplyr')
#install.packages('kableExtra')
library(kableExtra)
library(rDEA)
library(dplyr)
dea <- tbl_df(PANELDATA)
head(dea)
inp_var <- select(dea, 'IE', 'NIE')
out_var <- select(dea, 'L', 'D', 'II','NII')
inp_var <- as.matrix(inp_var)
out_var <- as.matrix(out_var)
model <- dea(XREF= inp_var, YREF = out_var, X = inp_var, Y = out_var, model= "output", RTS = "constant")
model
I want a number between 0 and 1 for every observation, where the most efficient one receives a 1. What I get now is the same result no matter if I include the two extra output variables L and II or not.
L stands for Loans to the public and II for interest income and it would be weird if these variables had NO effect for the efficiency of banks.
I think you could type this:
result <- cbind(round(model$thetaOpt, 3), round(model$lambda, 3))
rownames(result)<-dea[[1]]
colnames(result)<-c("Efficiency", rownames(result))
kable(result[,])

Clustering using categorical and continuous data together

I am trying to create a unsupervised model with categorical and continuous data together. I think I have worked it out, but is this the correct way to do this?
Load Libraries
library(tidyr)
library(dummies)
library(fastDummies)
library(cluster)
library(dplyr)
create sample data set
set.seed(3)
sampleData <- data.frame(id = 1:50,
gender = sample(c("Male", "Female"), 10, replace =
TRUE),
age_bracket = sample(c("0-10", "11-30","31-60",">60"),
10, replace = TRUE),
income = rnorm(10, 40, 10),
volume = rnorm(50, 40, 100))
Create sparse matrix and scale
sd1 <- sampleData %>%
dummy_cols(select_columns = c("gender","age_bracket"))%>%
mutate(id = factor(id))%>%
select(-c(gender,age_bracket))%>%
mutate_if(is.numeric, scale)
glimpse(sd1)
Generate a k-means model using the pam() function with a k = 3
sd2 <- pam(sd1, k =3)
Extract the vector of cluster assignments from the model
sd3 <- sd2$cluster
Build the segment_customers dataframe
sd4 <- mutate(sd1, cluster = sd3)
Calculate the size of each cluster
count(sd4, cluster)
Dummy coding of variables is fairly standard, but I am not a fan of it. In many cases this IMHO causes large bias, and hinders interpretability.
In your case, you may additionally be applying standardization to them, which makes variable bias even worse.
Your text claims to use k-means, but uses PAM. These are not the same. PAM is IMHO a better choice here, because of interpretability, and the ability to use other metrics such as Manhattan. The resulting cluster "centers" are data points, not means.
I recommend going down to the mathematical level. PAM tries to minimize the sum of distances to the centers. Now put in the distance you use, e.g., Manhattan. Now substitute the standardization and dummy encoding in there, and you get the actual problem your approach tries to solve. Now have a critical look at this (probably quite large) term: is that helpful for your problem, or are you optimizing the wrong function?

Data perturbation - How to perform it?

I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.

Visualization on Cluster for Mixed Data

So, i'm working with fuzzy clustering for Mixed data. Then i want to do Visualization for clustering result.
Here is my data
> head(x)
x1 x2 x3 x4
A C 8.461373 27.62996
B C 10.962334 27.22474
A C 9.452127 27.57246
B D 8.196687 27.29332
A D 8.961367 26.72793
B C 8.009029 27.97227
i followed this step https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
gower_dist <- daisy(x,
metric = "gower")
#type = list(logratio = 1))
tsne_obj <- Rtsne(gower_dist1, dims=2 ,is_distance = TRUE)
tsne_data = data.frame(tsne_obj1$Y, factor(g1$clusters))
colnames(tsne_data1)[3] = "cluster"
ggplot(aes(x = X1, y = X2), data = tsne_data1) +
geom_point(aes(color = cluster))
Based on the website, the first step has transformed the data using Gower distance (i guess), then applying R-tsne.
So My question is :
Is it good using Rtsne for mixed data (as Representative the points)? I have doubt, with Gower distance in the first step, its like force your categorical data to be numeric data.
But one thing that amazed me, my method always give a better result than a classic method based on the plot. so this is important for me to know better about this, can I use the plot as a tool to measure the goodness of clustering result? because based on the plot, it's not difficult to determine which method is better (by plotting clustering result), I give plot images below, I really impressed with it.
Classic Method
My Method

Resources