Generate random distribution - math

I need to generate the following random distribution:
200 persons and 250 entities of types A and B:
75% persons have entities of type A
10% persons have entities of type B
15% persons have 2 entities of type B
Is it possible to generate such random distribution?

What you wanted to say is maybe: There are three groups of persons.
Group 1: 75% of persons associates only with type A.
Group 2: 10% of persons associates only with type B.
Group 3: 15% of persons associates with both types.
total number of type A: 200*(0.75 + 0.15) = 180
total number of type B: 200*(0.10 + 0.15) = 50
There must be 230 entities of type A and type B.
Anyway, let’s forget about the number of entities for now.
import numpy as np
# two dimensional features: type A, type B
persons = np.zeros((200, 2))
persons[:180, 0] = 1
persons[150:, 1] = 1
np.random.shuffle(persons)
If distribution is fixed, then you can just shuffle persons id (sort of).

Related

Random Serial Dictatorship algorithm vs linear programming in solving an assignment problem

I just came across this post, which left me quite baffled.
If I had to assign N objects to N users, knowing that each user has expressed a rank preference for the objects, with the goal to achieve the lowest possible average rank (that is what the post states), my automatic choice would be linear programming.
Instead, this Random Serial Dictatorship is suggested, and it is stated that it achieves some form of optimality.
I was not sure how what looked to me like a greedy stochastic algorithm could guarantee optimality, so I tried it out.
Suppose that 3 users A, B, C have expressed rank preferences for 3 houses H1, H2, H3:
user
house A B C
H1 3 2 1
H2 1 1 2
H3 2 3 3
We want to assign a house to each user, so that the average (or sum) of ranks is minimal.
If I understood it correctly, the Random Serial Dictatorship algorithm requires choosing a random order for the users, and allow each of them to select the house they prefer.
For me it's clear that this kind of strategy may: 1) not result in the same sum of ranks each time, 2) consequently not guarantee optimality.
Imagine that A chooses first, B second and C third. A will select H2 (rank 1). But that's also B's preferred house, so B will have to go for H1 (rank 2), and C will be left with their least preferred house H3 (rank 3). Ranks = 1, 2, 3. Sum = 6, average = 2.
If, on the other hand, B chooses first, A second and C third: B = H2 (1), A = H3 (2), C = H1 (1). Ranks = 1, 2, 1. Sum = 4, average = 4/3.
As an R simulation:
# 3 users A, B, C want to buy a house each, chosen from H1, H2, H3.
# Their preferences are expressed by 'rank' (1 = first choice, 2 = second choice, etc).
d0 <- data.frame("user" = rep(c("A","B","C"), each = 3),
"house" = rep(c("H1", "H2", "H3"), 3),
"rank" = c(3,1,2,2,1,3,1,2,3)
)
# 1. Assignment by random serial dictatorship
set.seed(232425)
all_ranks <- numeric(0)
for (i in 1:1000) {
d <- d0
# Create a random order of priority for the users.
o <- setNames(sample(1:3), c("A","B","C"))
# Let users choose their preferred house in turn, according to the created order.
d["order"] <- o[d$user]
d <- d[order(d$order, d$rank),]
for (i in 1:2) {
h <- d[i, "house"]
d <- rbind(d[1:i,], d[(d$order > i) & (d$house != h),])
}
ranks <- d$rank
ranks <- ranks[order(d$user)]
all_ranks <- rbind(all_ranks, ranks)
#print(d)
}
all_ranks <- setNames(as.data.frame(all_ranks), c("A","B","C"))
all_ranks_summary <- cbind("ID" = 1, setNames(stack(all_ranks), c("rank", "user")))
all_ranks_summary <- aggregate(ID ~ user + rank, all_ranks_summary, length)
barplot(ID ~ rank + user, all_ranks_summary, beside = TRUE, col = 2:4, legend.text = TRUE)
boxplot(rowMeans(all_ranks), main = "average rank")
Frequency of each rank for the 3 users and the average rank:
Clearly the average rank is not minimal in all cases.
Using a linear programming assignment method instead:
# 2. Assignment by linear programming
require(lpSolve)
cm <- xtabs(rank ~ house + user, d0)
lp.out <- lp.assign(cm)
lp.out$solution * cm
yields a guaranteed optimal solution:
user
house A B C
H1 0 0 1
H2 0 1 0
H3 2 0 0
My questions are:
1. Am I misunderstanding the Random Serial Dictatorship algorithm? Can it actually be written in a way that guarantees optimality?
2. Is a linear programming assignment method almost as bad, computationally, as the brute force enumeration of all combinatorial possibilities?
Perhaps I am wrong, but I'm just puzzled that one should start from a goal like:
"assign each user an option so that the average rank of the assigned option in that user's ranked list is minimized across all users"
and then settle for an algorithm that, as 'fair' and as 'fast' as it may be, does not seem to achieve that goal at all.
Unless I am completely missing the point, which is possible.
It's "Pareto optimal", which is to say, not optimal in general. Pareto optimal just means that you would need to harm one or more of the participants to improve the objective.

Calculate 'Ranking' based on 'weights' - what's the formula , given different ranges of values

Given a list of cars and their top speeds, MPG and car cost. I want to rank them. With Speed a 'weight' of 50%, MPG 'weight' of 30% and car cost 20%.
The faster the car, the better.. The higher the MPG, the better... The lower the COST, the better...
What math formula can I use to rank the cars in order, based on this criteria?
So given this list. How can I rank them? Given then range of each are different?
CAR SPEED MPG COST
A 135 20 50,000
B 150 15 60,000
C 170 18 80,000
D 120 30 40,000
A more general term for your problem would be 'Multi-Criteria Decision Analysis' which is a well studied subject and you will be able to find different model for different use cases.
Let's take a simple model for your case, where we will create a score based on weights and calculate it for each car:
import pandas as pd
data = pd.DataFrame({
'CAR': ['A','B','C','D'],
'SPEED': [135, 150, 170, 120],
'MPG': [20, 15, 18, 30],
'COST': [50000, 60000, 80000, 4000]
})
def Score(df):
return 0.5*df['SPEED'] + 0.3*df['MPG'] + 0.2*df['COST']
data['SCORE'] = data.apply(lambda x: Score(x), axis=1)
data = data.sort_values(by=['SCORE'], ascending=False)
print(data)
This would give us:
CAR SPEED MPG COST SCORE
2 C 170 18 80000 16090.4
1 B 150 15 60000 12079.5
0 A 135 20 50000 10073.5
3 D 120 30 40000 8069.0
As you can see in the function "SCORE" we are simply multiplying the value by weight and summing them to get a new value based on which we list the items.
The important consideration here is whether you are happy with the formula we used in Score or not. You can change it however you want and for whatever purpose you are building your model.

Does this problem have an "exact" solution?

I am working with R.
Suppose you have the following data:
#generate data
set.seed(123)
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,10)
c1 = rnorm(1000,5,1)
train_data = data.frame(a1,b1,c1)
#view data
a1 b1 c1
1 94.39524 90.04201 4.488396
2 97.69823 89.60045 5.236938
3 115.58708 99.82020 4.458411
4 100.70508 98.67825 6.219228
5 101.29288 74.50657 5.174136
6 117.15065 110.40573 4.384732
We can visualize the data as follows:
#visualize data
par(mfrow=c(2,2))
plot(train_data$a1, train_data$b1, col = train_data$c1, main = "plot of a1 vs b1, points colored by c1")
hist(train_data$a1)
hist(train_data$b1)
hist(train_data$c1)
Here is the Problem :
From the data, only take variables "a1" and "b1" : using only 2 "logical conditions", split this data into 3 regions (e.g. Region 1 WHERE 20 > a1 >0 AND 0< b1 < 25)
In each region, you want the "average value of c1" within that region to be as small as possible - but each region must have at least some minimum number of data points, e.g. 100 data points (to prevent trivial solutions)
Goal : Is it possible to determine the "boundaries" of these 3 regions that minimizes :
the mean value of "c1" for region 1
the mean value of "c1" for region 2
the mean value of "c1" for region 3
the average "mean value of c1 for all 3 regions" (i.e. c_avg = (region1_c1_avg + region2_c1_avg + region3_c1_avg) / 3)
In the end, for a given combination, you would find the following, e.g. (made up numbers):
Region 1 : WHERE 20> a1 >0 AND 0 < b1 < 25 ; region1_c1_avg = 4
Region 2 : WHERE 50> a1 >20 AND 25 < b1 < 60 ; region2_c1_avg = 2.9
Region 3 : WHERE a1>50 AND b1 > 60 ; region3_c1_avg = 1.9
c_avg = (4 + 2.9 + 1.9) / 3 = 2.93
And hope that (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are minimized
My Question:
Does this kind of problem have an "exact solution"? The only thing I can think of is performing a "random search" that considers many different definitions of (Region 1, Region 2 and Region 3) and compares the corresponding values of (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg), until a minimum value is found. Is this an application of linear programming or multi-objective optimization (e.g. genetic algorithm)? Has anyone worked on something like this before?
I have done a lot of research and haven't found a similar problem like this. I decided to formulate this problem as a "multi-objective constrained optimization problem", and figured out how to implement algorithms like "random search" and "genetic algorithm".
Thanks
Note 1: In the context of multi-objective optimization, for a given set of definitions of (Region1, Region2 and Region3): to collectively compare whether a set of values for (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg) are satisfactory, the concept of "Pareto Optimality" (https://en.wikipedia.org/wiki/Multi-objective_optimization#Visualization_of_the_Pareto_front) is often used to make comparisons between different sets of {(Region1, Region2 and Region3) and (region1_c1_avg, region2_c1_avg, region3_c1_avg and c_avg)}
Note 2 : Ultimately, these 3 Regions can defined by any set of 4 numbers. If each of these 4 numbers can be between "0 and 100", and through 0.1 increments (e.g. 12, 12.1, 12.2, 12.3, etc) : this means that there exists 1000 ^ 4 = 1 e^12 possible solutions (roughly 1 trillion) to compare. There are simply far too many solutions to individually verify and compare. I am thinking that a mathematical based search/optimization problem can be used to strategically search for an optimal solution.

Plotting a mixed effects model with three fixed effects

I am trying to find a way to visualize a mixed effects model for a project I am working on, but am unsure how to do this when using multiple fixed and random effects.
The project I am working on is an attempt to estimate the helpfulness of online reviews, based on several different factors. A sample of the data looks like this:
Participant Product Type Star Rating Anonymous Product Helpfulness
1 Exp Extr Yes 12 8
1 Search Extr Yes 6 6
1 Search Mid Yes 13 7
...
30 Exp Mid No 11 2
30 Exp Mid No 14 4
30 Search Extr No 9 5
The data is significantly longer than this (30 participants, who each saw roughly two dozen reviews, resulting in approx. 700 entries). Each participant sees a mix of products, product types, and star ratings, but all of the reviews they see will either be anonymous or not anonymous (no mix).
As a result, I tried to fit a maximal mixed model, with the following:
mixed(helpfulness ~ product_type * star_rating * anonymity
+ (product_type * star_rating | participant)
+ (star_rating * anonymity | product))
What I would like to do now is to find a way of visually representing the data, likely color-coding the 8 different "groups" (essentially, the different unique combinations of the 3 binary independent variables (2 types of products * 2 types of star ratings * 2 types of anonymity), to show how they relate to the helpfulness rating.
Try something like this:
library(ggplot2)
# make notional data
df <- data.frame(participant = seq(1,30,1),
product_type = sample(x=c('Exp', 'Search'), size=30, replace=T),
star_rating = sample(x=c('Extr', 'Mid'), size=30, replace=T),
anonymous = sample(x=c('Yes', 'No'), size=30, replace=T),
product = rnorm(n=30, mean=10, sd=5),
helpfullness = rnorm(n=30, mean=5, sd=3))
ggplot(df) +
geom_col(aes(x=participant, y=helpfullness, fill=product, color=anonymous)) +
facet_grid(c('product_type', 'star_rating'))
This captures all six variables in your data. You can also work with alpha and other aesthetics to include the variables. Using alpha might be better if you do not want to use a facet_grid. If you include alpha as an aesthetic, I would recommend using facet_wrap instead of facet_grid.

NaNs produced and system is exactly singular error while using mnlogit R package

I am currently working on a behavior modelling project that involves estimating a multinomial logit model. After searching over the internet I came across the mnlogit package which seems very suitable for me.
The problem I am trying to model can be described as follows: A customer is offered 5 products from which he is to pick 1 or decide to not pick any. These products differ by price and delivery time. The prices and delivery time for these products are fixed across all customers.
So, a customer can pick from 6 alternatives, 1, 2, 3, 4, 5 and 0. Alternative 1 represents product 1, while alternative 0 represents the option of not picking any product. Products 1 and 2 cost $1, products 3 and 4 cost $2, and product 5 also costs $1. Alternative 0, on the other hand, costs 0.
Raw Data
In order to simulate customer's decision I self-generated 7 parameters. I defined 'Price' as an alternative independent variable, meaning that all alternatives' price will have the same weight on the products utility. Besides, I defined 'Alternative' as an alternative specific variable, what yields to another 6 parameters. My goal was to simulate the attractiveness of a product due to its delivery time, since each alternative has a fixed delivery time. I calculated the utility of a product using the following expression:
product_utility = (B_alternative[ alternativeNum ] * alternativeNum) + (B_price * productPrice)
Where B_alternative is a vector of my alternatives parameters: [0, 0.6, 0.5, 0.45, 0.3, 0.3], with each index of this vector representing one alternative number (B_alternative[0] : parameter for alternative 0);
And B_price is my price parameter: -0.5.
So, the utility I calculated for each product is : 0.00 ; 0.10 ; 0.50 ; 0.35 ; 0.20 ; 1.00 , being the first number the utility for alternative 0 and the last for product 5.
After calculating these utilities, I calculated the probability of a customer choosing the nth-product with the following expression:
Pn = exp(Un) / sum(exp(U))
Where 'sum(U)' is the sum of all utilities
And the probabilities (which adds up to 1) calculated were: 0.1097376 ; 0.1212788 ; 0.1809268 ; 0.1557251 ; 0.1340338 ; 0.2982978 , for each respective product from 0 to 5.
Using these probabilities and a random function, I generated a 'Mode' column in my table, representing the customer choice:
Data with choice column
Finally, following the documentation I found on CRAN, I made this code to estimate the model:
artificialData <- read.csv(PathToData, sep = ";")
# define model description (formula)
fm <- formula(MODE ~ PRICE - 1 | 1 | ALT)
# Define a mlogit data
TestData <- mlogit::mlogit.data(artificialData,
choice = "MODE", shape = "long",
alt.levels = c(1,2,3,4,5,0),
id.var = "CUSTOMER_ID")
# Estimate mnl
fit <- mnlogit::mnlogit(fm, TestData)
print(summary(fit))
However, no matter what parameters I set, I always get these two errors messages:
Error in solve.default(hessian, gradient, tol = 1e-24) : Lapack
routine dgesv: system is exactly singular: U[7,7] = 0
or
In sqrt(diag(vcov(object))) : NaNs produced

Resources