Choose optimal subset of size k given certain constraints R - r

I have in R a data.table of size 100K rows and 6 columns (let's say x_1, ... x_6).
I am looking for a subset of size 1K rows such that optimizes (maybe not the optimal one, but at least better than random or sorting) how to choose these thousand rows and optimizes a*sum(x_1) + ... + f*sum(x_6), where a,...,f are numbers.
Any suggestion of using an algorithm/library to solve this problem?
Thank you!
Reproducible Example:
# Creation of sinthetic data
set.seed(123)
total <- data.frame(id = 1:1000000, x1 = runif(1000000,0,1), x2 = 60*runif(100000,0,1),
x3 = runif(100000,0,1), x4 = runif(1000000,0,1), Last_interaction = sample(1:35, 1000000, replace= T))
total$x3 <- -total$x2 * total$x3 * runif(100000,0.7,1)
head(total)
# We are looking for a subset of 1000 rows such that
Cost_function <- function(x1,x2,x3,x4)
{
0.2*max(x1) + 0.4*sum(x2) - 0.3*sum(x2) - 0.1*max(x4)
}
# is maximized.
# We rank the dataset by weights in cost function
total <- total[with(total, order(-x2, x3,-x1,-x4)), ]
head(total)
# Want to improve (best choice by just ranking and getting top1000)
result_1 <- total[1:1000,]
# And of course random selection
result_2 <- total[sample(1:nrow(total), 1000,
replace=FALSE),]
# Wanna improve sorting and random selection if possible
Cost_function(result_1$x1,result_1$x2,result_1$x3,result_1$x4)
# [1] 5996.787
# (high value, but improvable)
Cost_function(result_2$x1,result_2$x2,result_2$x3,result_2$x4)
# [1] 3000
# low performace

This is a strange cost function: 0.2*max(x1) + 0.4*sum(x2) - 0.3*sum(x2) - 0.1*max(x4).. I don't think the proposed method to calculate Ax (followed by sorting) corresponds to this. The combination of max and sum in the cost function makes it not separable in the rows so we cannot just use sorting. The only thing I can come up with is a MIP formulation (a binary variable indicating if a row is selected).
The model is not completely trivial:
See here for details.
For a small data set it does the following:
Note that the LP model given in the other answer (now deleted) is not correct (even for all positive values). Modeling the max correctly for the non-convex case is not trivial.

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

simulation of normal distribution data contaiminated with outliers

I need to simulate 1000 sets of normal distribution(each 60 subgroups, n=5) by using r programming. Each set of normal distribution is contaiminated with 4 outliers(more than 1.5 IQR). can anyone help?
Thanks in advance
A very simple approach to create a data.frame with a few outliers :
# Create a vector with normally distributed values and a few outliers
# N - Number of random values
# n.out - number of outliers
my.rnorm <- function(N, num.out, mean=0, sd=1){
x <- rnorm(N, mean = mean, sd = sd)
ind <- sample(1:N, num.out, replace=FALSE )
x[ind] <- (abs(x[ind]) + 3*sd) * sign(x[ind])
x
}
N=60
num.out = 4
df <- data.frame( col1 = my.rnorm(N, num.out),
col2 = my.rnorm(N, num.out),
col3 = my.rnorm(N, num.out),
col4 = my.rnorm(N, num.out),
col5 = my.rnorm(N, num.out))
Please note that I used mean=0 and sd=1 as values mean=1, sd=0 that you provided in the comments do not make much sense.
The above approach does not guarantee that there will be exactly 4 outliers. There will be at least 4, but in some rare cases there could be more as rnorm() function does not guarantee that it never produces outliers.
Another note is that data.frames might not be the best objects to store numeric values. If all your 1000 data.frames are numeric, it is better to store them in matrices.
Depending on the final goal and the type of the object you store your data in (list, data.frame or matrix) there are faster ways to create 1000 objects filled with random values.

MCMC in R Modify Proposal

I've been working with MCMC for population genetics and I have some doubts.
I'm not experienced in statistics and because of that I have difficulty.
I have code to run MCMC, 1000 iterations. I start by creating a matrix with 0's (50 columns = 50 individuals and 1000 lines for 1000 iterations).
Then I create a random vector to substitute the first line of the matrix. This vector has 1's and 2's, representing population 1 or population 2.
I also have genotype frequencies and the genotypes of the 50 individuals.
What I want is to, according to the genotype frequencies and genotypes, determine to what population an individual belongs.
Then, I'll keep changing the population assigned to a random individual and checking if the new value should be accepted.
niter <- 1000
z <- matrix(0,nrow=niter,ncol=ncol(targetinds))
z[1,] <- sample(1:2, size=ncol(z), replace=T)
lhood <- numeric(niter)
lhood[1] <- compute_lhood_K2(targetinds, z[1,], freqPops)
accepted <- 0
priorz <- c(1e-6, 0.999999)
for(i in 2:niter) {
z[i,] <- z[i-1,]
# propose new vector z, by selecting a random individual, proposing a new zi value
selind <- sample(1:nind, size=1)
# proposal probability of selecting individual at random
proposal_ratio_ind <- log(1/nind)-log(1/nind)
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
# proposal probability of changing the index of individual is 1/2
proposal_ratio_cluster <- log(1/2)-log(1/2)
propratio <- proposal_ratio_ind+proposal_ratio_cluster
# compute f(x_i|z_i*, p)
# the probability of the selected individual given the two clusters
probindcluster <- compute_lhood_ind_K2(targetinds[,selind],freqPops)
# likelihood ratio f(x_i|z_i*,p)/f(x_i|z_i, p)
lhoodratio <- probindcluster[z[i,selind]]-probindcluster[z[i-1,selind]]
# prior ratio pi(z_i*)/pi(z_i)
priorratio <- log(priorz[z[i,selind]])-log(priorz[z[i-1,selind]])
# accept new value according to the MH ratio
mh <- lhoodratio+propratio+priorratio
# reject if the random value is larger than the MH ratio
if(runif(1)>exp(mh)) {
z[i,] <- z[i-1,] # keep the same z
lhood[i] <- lhood[i-1] # keep the same likelihood
} else { # if accepted
lhood[i] <- lhood[i-1]+lhoodratio # update the likelihood
accepted <- accepted+1 # increase the number of accepted
}
}
It is asked that I have to change the proposal probability so that the new proposed values are proportional to the likelihood. This leads to a Gibbs sampling MCMC algorithm, supposedly.
I don't know what to change in the code to do this. I also don't understand very well the concept of proposal probability and how to chose the prior.
Grateful if someone knows how to clarify my doubts.
Your current proposal is done here:
# propose a new index for the selected individual
if(z[i,selind]==1) {
z[i,selind] <- 2
} else {
z[i,selind] <- 1
}
if the individual is assigned to cluster 1, then you propose to switch assignment deterministically by assigning them to cluster 2 (and vice versa).
You didn't show us what freqPops is, but if you want to propose according to freqPops then I believe the above code has to be replaced by
z[i,selind] <- sample(c(1,2),size=1,prob=freqPops)
(at least that is what I understand when you say you want to propose based on the likelihood - however, that statement of yours is unclear).
For this now to be a valid mcmc gibbs sampling algorithm you also need to change the next line of code:
proposal_ratio_cluster <- log(freqPops[z[i-1,selind]])-log(fregPops[z[i,selind]])

Plot a table of binomial distributions in R

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Resources