I want to simulate a string of random non-negative integer values in R.
However, those values should not follow any particular probability distribution function and could be empirically distributed.
How do I go about doing it?
You will need a distribution; there is no alternative, philosophically. There's no such thing as a "random number," only numbers randomly distributed according to some distribution.
To sample from an empirical distribution stored as my_dist, you can use sample():
my_dist <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55) # first 10 Fibonacci numbers
sample(my_dist, 100, replace = T) # draw 100 numbers from my_dist w/ replacement
Or, for some uniformly-distributed numbers between (for instance) 1 and 10, you could do:
sample(1:10, 100, replace = T)
There are, of course, specific distributions implemented as functions in base R and various packages, but I'll avoid those since you said you weren't interested in them.
Editing per Rui's good suggestion: If you want non-uniform variables, you can specify the prob parameter:
sample(1:3, 100, replace = T, prob = c(6, 3, 1))
# draws a 1 with 60% prob., a 2 with 30% prob., and a 3 with 10% prob.
Related
I have been trying to do unsupervised feature selection using LASSO (by removing class column). The dataset includes categorical (factor) and continuous (numeric) variables. Here is the link. I built a design matrix using model.matrix() which creates dummy variables for each level of the categorical variables.
dataset <- read.xlsx("./hepatitis.data.xlsx", sheet = "hepatitis", na.strings = "")
names_df <- names(dataset)
formula_LASSO <- as.formula(paste("~ 0 +", paste(names_df, collapse = " + ")))
LASSO_df <- model.matrix(object = formula_LASSO, data = dataset, contrasts.arg = lapply(dataset[ ,sapply(dataset, is.factor)], contrasts, contrasts = FALSE ))
### Group LASSO using gglasso package
gglasso_group <- c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 15, 16, 17, 17)
fit <- gglasso(x = LASSO_df, y = y_k, group = gglasso_group, loss = "ls", intercept = FALSE, nlambda = 100)
# Cross validation
fit.cv <- cv.gglasso(x = LASSO_df, y = y_k, group = gglasso_group, nfolds = 10)
# Best lambda
best_lambda_fit.cv <- fit.cv$lambda.1se
# Final coefficients of variables
coefs = coef.gglasso(object = fit, s = best_lambda_fit.cv)
### Group LASSO with grpreg package
group_lasso <- grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso")
plot(group_lasso)
cv_group_lasso <- cv.grpreg(X = LASSO_df, y = y_k, group = gglasso_group, penalty = "grLasso", se = "quick")
# Best lambda
best_lambda_group_lasso <- cv_group_lasso$lambda.min
coef_mat_group_lasso <- as.matrix(coef(cv_group_lasso))
If you check coefs and coef_mat_group_lasso, you will realize that they are not the same. Also, the best lambda values are not the same. I am not sure which one to choose for feature selection.
Any idea of how to remove intercept in grpreg() function? intercept = FALSE is not working.
Any help is appreciated. Thanks in advance.
Please refer to the gglasso paper and the grpreg paper.
Different objective functions. On page 175 of grpreg paper, the author performs a step called group standardization, which normalizes the feature matrix within each group by right-multiplying an orthonormal matrix and a non-negative diagonal matrix. After the group lasso step with group standardization, the estimated coefficients are left-multiplied by the same matrices such that we obtain the coefficients of the original linear model. In such a way, however, the group lasso penalty is not equivalent to that without group standardization. For the detailed discussion, please also find it on page 175.
Different algorithms. The grpreg uses block coordinate descent, while gglasso uses an algorithm called groupwise-majorization-descent. It is natural to see small numerical differences when the algorithms are not the same.
I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.
I'm trying to draw samples from a runif(100,900,1100) population. Now I want to draw 25 samples of size n = 5 from this population with replacement, but it seems that sample() outputs only scalar samples. What is the best approach for this?
This gives you a 5*25 matrix (each column corresponds to one sample) with numbers generated from a uniform distribution.
matrix(runif(5*25,900,1100), nrow = 5, ncol = 25)
or you can do the following if instead, you want to first generate runif(100,900,1100), then draw 25 samples from the resulting vector:
sapply(1:25, function(x) sample(runif(100,900,1100), 5, replace = TRUE))
I have a two large datasets in R, one of actual measurements and one of the predictions I made for these measurements. I found that the trends of my predictions were accurate, but the amplitude was off. I am wondering if there is a way to find a constant in R that, when the predictions are multiplied by the constant, minimizes the error between the actuals and the predictions.
For example:
predictions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
actuals <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
The constant I would want to generate in this case would be 2.
I have looked into using the optim() function, but get the warning message that "one-dimensional optimization by Nelder-Mead is unreliable: use 'Brent' or optimize() directly."
f <- function(p) cor(p*observed, actual)
optim(
c(1),
f,
control = list(fnscale = -1)
)
I am not familiar with optimization, so it is probable that I am approaching this problem the wrong way. I appreciate the help!
First let's define an error function to minimize:
MultError <- function(constant, predictions, actuals) {
return(sum((constant*predictions - actuals)^2))
}
This is sum of squared errors...you could use a different one!
optimize() expects a function, an interval to search in (which you could get by inspecting the min and max of predictions / actuals), and any extra parameters. It will minimize by default
optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)
This returns
$minimum
[1] 2
$objective
[1] 0
Which is the value of the minimum and the value of the error function, respectively.
Presumably, your match is not perfect, so I also tried it with artificial noise
set.seed(1)
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
Then it returns
$minimum
[1] 2.087324
$objective
[1] 22.21434
Pretty good!
EDIT:
I answered this question using optimize because of the title and the direction the OP had gone, but in thinking harder, it seemed like it might overkill. What's wrong with simply taking mean(actuals / predictions)?
So I decided to test them both...
set.seed(1)
arithmetic <- opt <- numeric(10000)
for (trial in 1:10000) {
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
arithmetic[trial] <- mean(actuals / predictions)
opt[trial] <- optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)$minimum
}
For 10,000 possible datasets, we've recovered the constant using the average and by minimizing sum of squared errors. What are the mean and variance of our estimators?
> mean(arithmetic)
[1] 1.999102
> mean(opt)
[1] 1.998695
Both do pretty well on average.
> var(arithmetic)
[1] 0.0159136
> var(opt)
[1] 0.02724814
The arithmetic mean estimator has a tighter spread, however. So I would argue that you should just take the average!
You might get a pretty good approximation using linear regression, the lm() function.
m = lm(actuals ~ predictions)
m is the object where the linear regression model will be stored.
coef(m) will give you the constant to multiply with plus an offset.
This excerpt from the CRAN documentation for the adagio function knapsack() functions as expected -- it solves the knapsack problem with profit vector p, weight vector w, and capacity cap, selecting the subset of elements with maximum profit subject to the constraint that the total weight of selected elements does not exceed the capacity.
library(adagio)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
(is <- knapsack(w, p, cap))
How can I add a vector length constraint to the solution and still get an optimal answer? For example, the above exercise, but the selected subset must include exactly three elements.
One approach would be to explicitly model the problem as a mixed integer linear programming problem; the advantage of explicitly modeling it in this way is that linear constraints like "pick exactly three objects" are simple to model. Here is an example with the lpSolve package in R, where each element in the knapsack problem is represented by a binary variable in a mixed integer linear programming formulation. The requirement that we select exactly three elements is captured by the constraint requiring the decision variables to sum to exactly 3.
library(lpSolve)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
exact.num.elt <- 3
mod <- lp(direction = "max",
objective.in = p,
const.mat = rbind(w, rep(1, length(p))),
const.dir = c("<=", "="),
const.rhs = c(cap, exact.num.elt),
all.bin = TRUE)
# Solution
which(mod$solution >= 0.999)
# [1] 2 3 4
# Profit
mod$objval
# [1] 250
While subsetting the optimal solution from the adagio:::knapsack function to the desired size is a reasonable heuristic for the case when the desired subset size is smaller than the cardinality of the optimal solution to the standard problem, there exist examples where the optimal solution to the standard knapsack problem and the optimal solution to the size-constrained knapsack problem are disjoint. For instance, consider the following problem data:
p <- c(2, 2, 2, 2, 3, 3)
w <- c(1, 1, 1, 1, 2, 2)
cap <- 4
exact.num.elt <- 2
With capacity 4 and no size constraint, the standard knapsack problem will select the four elements with profit 2 and weight 1, getting total profit 8. However, with size limit 2 the optimal solution is instead to select the two elements with profit 3 and weight 2, getting total profit 6.