Clustering using categorical and continuous data together - r

I am trying to create a unsupervised model with categorical and continuous data together. I think I have worked it out, but is this the correct way to do this?
Load Libraries
library(tidyr)
library(dummies)
library(fastDummies)
library(cluster)
library(dplyr)
create sample data set
set.seed(3)
sampleData <- data.frame(id = 1:50,
gender = sample(c("Male", "Female"), 10, replace =
TRUE),
age_bracket = sample(c("0-10", "11-30","31-60",">60"),
10, replace = TRUE),
income = rnorm(10, 40, 10),
volume = rnorm(50, 40, 100))
Create sparse matrix and scale
sd1 <- sampleData %>%
dummy_cols(select_columns = c("gender","age_bracket"))%>%
mutate(id = factor(id))%>%
select(-c(gender,age_bracket))%>%
mutate_if(is.numeric, scale)
glimpse(sd1)
Generate a k-means model using the pam() function with a k = 3
sd2 <- pam(sd1, k =3)
Extract the vector of cluster assignments from the model
sd3 <- sd2$cluster
Build the segment_customers dataframe
sd4 <- mutate(sd1, cluster = sd3)
Calculate the size of each cluster
count(sd4, cluster)

Dummy coding of variables is fairly standard, but I am not a fan of it. In many cases this IMHO causes large bias, and hinders interpretability.
In your case, you may additionally be applying standardization to them, which makes variable bias even worse.
Your text claims to use k-means, but uses PAM. These are not the same. PAM is IMHO a better choice here, because of interpretability, and the ability to use other metrics such as Manhattan. The resulting cluster "centers" are data points, not means.
I recommend going down to the mathematical level. PAM tries to minimize the sum of distances to the centers. Now put in the distance you use, e.g., Manhattan. Now substitute the standardization and dummy encoding in there, and you get the actual problem your approach tries to solve. Now have a critical look at this (probably quite large) term: is that helpful for your problem, or are you optimizing the wrong function?

Related

How to generate a multivariate spline basis in R?

I want to obtain a multivariate spline basis using R. I do not know how to do it properly or the best approach for this. According to my limited research on the Internet, I think that the package that can help me is mgcv and the functions ti and smooth.construct.tensor.smooth.spec but I am not sure.
The structure of my data is simple. I have two vectors xdata and alphadata generated as
n = 200
T = 2
xdata = as.matrix(rnorm(T*n),T*n,1)
tau = seq(-2,2,by=0.1)
tau = as.matrix(tau,length(tau),1)
So basically I have two vectors xdata and alphadata of dimension n*T and 41, respectively. My goal is then obtain a spline basis (for example a cubic spline) which should be a function of both b(alphadata,xdata).
What I have tried so far is something like this
xdata_data <- data.frame("xdata" = xdata[,1])
tau_data <- data.frame("tau" = tau[,1])
basisobj1 <- ti(tau_data, xdata_data, bs = 'cr', k = c(6, 6), fx = TRUE) #cr:cubic regression splines
xdata_data <- data.frame("xdata_data" = xdata[,1])
tau_data <- data.frame("tau_data" = tau[,1])
basisobj2 <- smooth.construct.tensor.smooth.spec(basisobj1, data = c(tau_data,xdata_data), knots = NULL)
basis <- basisobj2[["X"]]
Note that I manipulated my data, otherwise I get some errors with smooth.construct.tensor.smooth.spec.
My questions are:
(1) With the previous approach I am doing what I want?
(2) Is this a smart approach to do what I want?
(3) When I do the above, the number of rows of basis is 41 but shouldn't the number of rows of basis be equal to the product of dimensions of xdata and alphadata as the basis is a function of two vectors?

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

How to downsample within recursive feature elimination using caret?

Consider the data frame data created here:
set.seed(123)
num = sample(5:20, replace = T, 20)
id = letters[1:20]
loc <- rep(id, num)
data <- data.frame(Location = loc)
data[paste0('var', seq_along(1:10))] <- rnorm(length(id) * sum(num))
Assuming data is my training data; Each row represents measurements that were taken on a randomly sampled individuals from populations identified by the grouping variable Location. I want to use recursive feature elimination to identify the best subset of predictors for predicting Location. Analogously, I want to understand how much variation each of the predictors explain in Location (i.e., which ones are most important, and how much more important are they). I have read how this can be done using the caret package using something like this:
library(caret)
subsets <- 1:9
ctrl <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 10, verbose = F)
lmProfile <- rfe(data[,2:10], data[,1], sizes = subsets, rfeControl = ctrl)
In my data example, considering the unbalanced number of samples in each Location, I want to use down sampling to ensure that the same number of samples is being considered across the levels of Location upon each iteration. Could someone demonstrate how I might do this?

R - Compare performance of two types while controlling for interaction

I have been programming in R and have a dataset containing the results (succes or not) of two Machine Learning algorithms which have been tried out using different amounts of parameters. An example is provided below:
type success paramater_amount
a1 0 15639
a1 0 18623
a1 1 19875
a2 1 12513
a2 1 10256
a2 0 12548
I now want to compare both algorithms to see which one has the best overall performance. But there is a catch. It is known that the higher the parameter_amount, the higher the chances for success. When checking out the parameter amounts both algorithms were tested on, one can also notice that a1 has been tested with higher parameter amounts than a2 was. This would make simply counting the amount of successes of both algorithms unfair.
What would be a good approach to handle this scenario?
I will give you an answer but without any guarantees on the truth of what I'm telling you. Indeed for more precisions you should give more informations on the algorithm and other. I also propose to migrate this question to cross-validate.
Indeed, your question is a statistical question. Because, in statistics, we search for sparcity. We prefer a simpler model than a very complex one at given performance because we are worried of over-fitting : https://statisticsbyjim.com/regression/overfitting-regression-models/.
One way to do what you want is to compare the performance with respect to the complexity of the model like for this toy example :
library(tidyverse)
library(ggplot2)
set.seed(123)
# number of estimation for each models
n <- 1000
performance_1 <- round(runif(n))
complexity_1 <- round(rnorm(n, mean = n, sd = 50))
performance_2 <- round(runif(n, min = 0, max = 0.6))
complexity_2 <- round(rnorm(n, mean = n, sd = 50))
df <- data.frame(performance = c(performance_1, performance_2),
complexity = c(complexity_1, complexity_2),
models = as.factor(c(rep(1, n), rep(2, n))))
temp <- df %>% group_by(complexity, models) %>% summarise(perf = sum(performance))
ggplot(temp, aes(x = complexity, y = perf, group = models, fill = models)) +
geom_smooth() +
theme_classic()
It only works if you have many data points. Complexity for you is the number of parameters fitted. In that toy exemple, the first model seems a better because for each level of complexity it is better.

DEA analysis: variables are excluded in analysis?

I’m working on a DEA (Data Envelopment Analysis) analysis to analyze the relative effects of different banks efficiencies.
The packages I’m using are rDEA and kableExtra.
What this analysis if doing is measuring the relative effect of input and output variables that I use to examine the efficiency for each individual bank.
The problem is that my code only includes two out of four output variables and I can’t find anywhere in the code where I ask it to do so.
Can some of you identify the problem?
Thank you in advance!
I have tried to format the data in several different ways, assign the created "inp_var" and "out_var" as a matrix'.
#install.packages('rDEA')
#install.packages('dplyr')
#install.packages('kableExtra')
library(kableExtra)
library(rDEA)
library(dplyr)
dea <- tbl_df(PANELDATA)
head(dea)
inp_var <- select(dea, 'IE', 'NIE')
out_var <- select(dea, 'L', 'D', 'II','NII')
inp_var <- as.matrix(inp_var)
out_var <- as.matrix(out_var)
model <- dea(XREF= inp_var, YREF = out_var, X = inp_var, Y = out_var, model= "output", RTS = "constant")
model
I want a number between 0 and 1 for every observation, where the most efficient one receives a 1. What I get now is the same result no matter if I include the two extra output variables L and II or not.
L stands for Loans to the public and II for interest income and it would be weird if these variables had NO effect for the efficiency of banks.
I think you could type this:
result <- cbind(round(model$thetaOpt, 3), round(model$lambda, 3))
rownames(result)<-dea[[1]]
colnames(result)<-c("Efficiency", rownames(result))
kable(result[,])

Resources