I'm making a code for bank loan optimization. In this problem I import a CSV with all the parameters (customer age, number of installments, loan amount, among others). First I need to model the multivariable function (there are 6 variables in total) and then apply this function in traditional methods such as quasi-newton and nelder-mead (it's for a university subject). For this I need to make a mathematical model Math model in order to find this function. Can you help me?
This is how my .csv looks like when I open it with pandas CSV
Thanks in advance folks!
I just did it.
x1=[1,1,1,1,1]
def func(x1):
a=0
for i in range(len(df)):
w = 1/linhas*((y1[i]*x1[0]+y2[i]*x1[1]+y3[i]*x1[2]+y4[i]*x1[3]+y5[i]*x1[4]-z[i])**2)
a = a + w
return a
b = func(x1)
print(b)
result = minimize(func, x1, method="nelder-mead")
print(result)
Related
Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.
The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:
data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")
The rs$which command produces the output of the regsubsets function and allows you to select the model once you've plotted the AIC and found the number of parameters that minimizes the AIC. But here's the problem: while the typed-up example works fine, I'm having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:
require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
I(temp^2)+I(gas^2)+I(pack^2)+
I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2
produces a warning message:
Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
longer object length is not a multiple of shorter object length
Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn't work with the original example!
So here's my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.
Ok, so you know the help page for regsubsets says
regsubsets returns an object of class "regsubsets" containing no
user-serviceable parts. It is designed to be processed by
summary.regsubsets.
You're about to find out why.
The code in regsubsets calls Alan Miller's Fortran 77 code for subset selection. That is, I didn't write it and it's in Fortran 77. I do understand the algorithm. In 1996 when I wrote leaps (and again in 2017 when I made a significant modification) I spent enough time reading the code to understand what the variables were doing, but regsubsets mostly followed the structure of the Fortran driver program that came with the code.
The rss field of the regsubsets object has that name because it stores a variable called RSS in the Fortran code. This variable is not the residual sum of squares of the best model. RSS is computed in the setup phase, before any subset selection is done, by the subroute SSLEAPS, which is commented 'Calculates partial residual sums of squares from an orthogonal reduction from AS75.1.' That is, RSS describes the RSS of the models with no selection fitted from left to right in the design matrix: the model with just the leftmost variable, then the leftmost two variables, and so on. There's no reason anyone would need to know this if they're not planning to read the Fortran so it's not documented.
The code in summary.regsubsets extracts the residual sum of squares in the output from the $ress component of the object, which comes from the RESS variable in the Fortran code. This is an array whose [i,j] element is the residual sum of squares of the j-th best model of size i.
All the model criteria are computed from $ress in the same loop of summary.regsubsets, which can be edited down to this:
for (i in ll$first:min(ll$last, ll$nvmax)) {
for (j in 1:nshow) {
vr <- ll$ress[i, j]/ll$nullrss
rssvec <- c(rssvec, ll$ress[i, j])
rsqvec <- c(rsqvec, 1 - vr)
adjr2vec <- c(adjr2vec, 1 - vr * n1/(n1 + ll$intercept -
i))
cpvec <- c(cpvec, ll$ress[i, j]/sigma2 - (n1 + ll$intercept -
2 * i))
bicvec <- c(bicvec, (n1 + ll$intercept) * log(vr) +
i * log(n1 + ll$intercept))
}
}
cpvec gives you the same information as AIC, but if you want AIC it would be straightforward to do the same loop and compute it.
regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.
Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.
I'm currently struggleing with some mixed models for repeated measures with R. I have read a lot of post and request for conversion of code from SAS to R, and I have found some elements but I am not sure of what I have done so far.
I am trying to model the effect of some products on subjects with different sequences patterns and different visits (following the pattern 1 product by visit).
I have some SAS code which is the "ground truth" and I would found the same results obtained by SAS with R ( with nlme package or equivalent) to display it through a Shiny App.
I've tried some models with R, with some results close to the one from SAS but some parts are still different, especially AIC, BIC and LogLik.
Below is the SAS code that I try to convert in R and my R implementation :
SAS Code
Proc mixed data = data method = reml;
Class A B C D;
Model variable = B C D / solution ddfm = kenwardroger ;
Random A(B);
etc.
etc.
Run;
R Code
library(nlme)
model <- test.lme <- lme(variable~ B + C+ D,
random = ~ 1| A / B, data = data, na.action = na.omit)
A : Subject ID
B : sequence pattern
C : visit number
D : product used
Is my conversion to R correct ? If it is, why I get different results in AIC, BIC and LogLik ?
Thanks in advance
I have a experiment being simulated. This experiment has 3 parameters a,b,c (variables?) but the result, r, cannot be "predicted" as it has a stochastic component. In order to minimize the stochastic component I've run this experiment several times(n). So in resume I have n 4-tuples a,b,c,r where a,b,c are the same but r varies. And each batch of experiments is run with different values for a, b, c (k batches) making the complete data-set having k times n sets of 4-tuples.
I would like to find out the best polynomial fit for this data and how to compare them like:
fit1: with
fit2: with
fit3: some 3rd degree polynomial function and corresponding error
fit4: another 3rd degree (simpler) polynomial function and corresponding error
and so on...
This could be done with R or MatlabĀ®. I've searched and found many examples but none handled same input values with different outputs.
I considered doing the multivariate polynomial regression n times adding some small delta to each parameter but I'd rather take a cleaner sollution before that.
Any help would be appreciated.
Thanks in advance,
Jacques
Polynomial regression should be able to handle stochastic simulations just fine. Just simulate r, n times, and perform a multivariate polynomial regression across all points you've simulated (I recommend polyfitn()).
You'll have multiple r values for the same [a,b,c] but a well-fit curve should be able to estimate the true distribution.
In polyfitn it will look something like this
n = 1000;
a = rand(500,1);
b = rand(500,1);
c = rand(500,1);
for n = 1:1000
for i = 1:length(a)
r(n,i) = foo(a,b,c);
end
end
my_functions = {'a^2 b^2 c^2 a b c',...};
for fun_id = 1:length(my_functions)
p{f_id} = polyfitn(repmat([a,b,c],[n,1]),r(:),myfunctions{fun_id})
end
It's not hard to iteratively/recursively generate a set of polynomial equations from a basis function; but for three variables there might not be a need to. Unless you have a specific reason for fitting higher order polynomials (planetary physics, particle physics, etc. physics), you shouldn't have too many functions to fit. It is generally not good practice to use higher-order polynomials to explain data unless you have a specific reason for doing so (risk of overfitting, sparse data inter-variable noise, more accurate non-linear methods).
I am doing a regression with several categorial variables and continuous variables mixed together. For simplify my question, I want to create a regression model that predicts the driving time given a certain driver in different zones with driving miles. That's say I have 5 different drivers and 2 zones in my training data.
I know I probably need to build 5*2=10 regression models for prediction. What I am using in R is
m <- lm(driving_time ~ factor(driver)+factor(zone)+miles)
But it seems like R doesn't expend the combination. My problem is if there are any smart way to do the expansion automatically in R. Or I have to write the 10 regression models one by one. Thank you.
Please read ?formula. + in a formula means include that variable as a main effect. You seem to be looking for an interaction term between driver and zone. You create an interaction term using the : operator. There is also a short cut to get both main and interaction effect via the * operator.
There is some confusion as to whether you want miles to also interact, but I'll assume not here as you only mention 2 x 5 terms.
foo <- transform(foo, driver = factor(driver), zone = factor(zone))
m <- lm(driving_time ~ driver * zone + miles, data = foo)
Here I assume your data are in data frame foo. The first line separates the data processing from the model specification/fitting by converting the variables of interest to factors before fitting.
The formula then specifies main and interactive effects for driver and zone plus main effect for miles.
If you want interactions between all three then:
m <- lm(driving_time ~ driver * zone * miles, data = foo)
or
m <- lm(driving_time ~ (driver + zone + miles)^3, data = foo)
would do that for you.
I've fitted a VECM model in R, and converted in to a VAR representation. I would like to use this model to predict the future value of a response variable based on different scenarios for the explanatory variables.
Here is the code for the model:
library(urca)
library(vars)
input <-read.csv("data.csv")
ts <- ts(input[16:52,],c(2000,1),frequency=4)
dat1 <- cbind(ts[,"dx"], ts[,"u"], ts[,"cci"],ts[,"bci"],ts[,"cpi"],ts[,"gdp"])
args('ca.jo')
vecm <- ca.jo(dat1, type = 'trace', K = 2, season = NULL,spec="longrun",dumvar=NULL)
vecm.var <- vec2var(vecm,r=2)
Now what I would like do is to predict "dx" into the future by varying the others. I am not sure if something like "predict dx if u=30,cpi=15,bci=50,gdp=..." in the next period would work. So what I have in mind is something along the lines of: increase "u" by 15% in the next period (which would obviously impact on all the other variables as well, including "dx") and predict the impact into the future.
Also, I am not sure if the "vec2var" step is necessary, so please ignore it if you think it is redundant.
Thanks
Karl
This subject is covered very nicely in Chapters 4 and 8 of Bernhard Pfaff's book, "Analysis of Integrated and Cointegrated Time Series with R", for which the vars and urca packages were written.
The vec2var step is necessary if you want to use the predict functionality that's available.
A more complete answer was provided on the R-Sig-Finance list. See also this related thread.
Here you go - ??forecast gave vars::predict, Predict method for objects of class varest and vec2var as an answer, which looks precisely as you want it. Increasing u looks like impulse response analysis, so look it up!