I'm doing a Gaussian Process Simulation. I have x and y. I want to divide 85% of them in training and 15% in testing and then model fitting them to predict. How should I write the code? I know in Python the function I use is train_test_split().
x=rand(100)
dis = [abs(i-j) for i in x, j in x]
exp(-dis)
σ2= 1
g = 1
l = Matrix(I,100,100)
μ = zeros(100)
Σ = (σ2*exp(-dis/g))+0.1l
y = MvNormal(μ,Σ)
Y = rand(y,100)
Use the partition function from MLJ:
using MLJ
MLJ.partition((x, Y), 0.85, multi=true)
Here is its documentation https://alan-turing-institute.github.io/MLJ.jl/dev/preparing_data/#Splitting-data.
You can use TrainTestSplit from package Lathe
using Lathe.preprocess: TrainTestSplit
traindf, testdf = TrainTestSplit(df,.85);
Check this link for more: https://github.com/emmettgb/Lathe-Books/tree/main/preprocess
Or, still use partition from BetaML:
using BetaML
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.85,0.15])
This generalise to n arrays (e.g. train/val/test) and you can also choose the dimension to where to partition and if randomise or not the partition.
Related
I am trying to solve an economic problem using the sympy package in Julia. In this economic problem I have exogenous variables and endogenous variables and I am indexing them all. I have two questions:
How to access the indexed variables to pass: calibrated values ( to exogenous variables, calibrated in other enveiroment) or formula (to endogenous variables, determined by the first order conditions of the agents' maximalization problem using pencil and paper). This will also allow me to study the behavior of equilibrium when I disturb exogenous variables. First, consider my attempto to pass calibrated values on exogenous variables.
using SymPy
# To index
n,N = sympy.symbols("n N", integer=True)
N = 3 # It can change
# Household
#exogenous variables
α = sympy.IndexedBase("α")
#syms γ
α2 = sympy.Sum(α[n], (n, 1, N))
equation_1 = Eq(α2 + γ, 1)
The equation_1 says that the alpha's plus gamma sums one. So I would like to pass values to the α vector according to another vector, alpha3, with calibrated parameters.
# Suposse
alpha3 = [1,2,3]
for n in 1:N
α[n]= alpha3[n]
end
MethodError: no method matching setindex!(::Sym, ::Int64, ::Int64)
I will certainly do this step once the system is solved. Now, I want to pass formulas or expressions as a function of prices. Prices are endogenous and unknown variables. (As said before, the expressions were calculated using paper and pencil)
# Price vector, Endogenous, unknown in the system equations
P = sympy.IndexedBase("P")
# Other exogenous variables to be calibrated.
z = sympy.IndexedBase("z")
s = sympy.IndexedBase("s")
Y = sympy.IndexedBase("Y")
# S[n] and D[n], Supply and Demand, are endogenous, but determined by the first order conditions of the maximalization problem of the agents
# Supply and Demand
S = sympy.IndexedBase("S")
D = sympy.IndexedBase("D")
# (Hypothetical functions that I have to pass)
# S[n] = s[n]*P[n]
# D[n] = z[n]/P[n]
Once I can write the formulas on S[n] and D[n], consider the second question:
How to specify the endogenous variables indexed (All prices in their indexed format P[n]) as being unknown in the system of non-linear equations? I will ignore the possibility of not solving my system. Suppose my system has a single solution or infinite (manifold). So let's assume that I have more equations than variables:
# For all n, I want determine N indexed equations (looping?)
Eq_n = Eq(S[n] - D[n],0)
# Some other equations relating the P[n]'s
Eq0 = Eq(sympy.Sum(P[n]*Y[n] , (n, 1, N)), 0 )
# Equations system
eq_system = [Eq_n,Eq0]
# Solving
solveset(eq_system,P[n])
Many thanks
There isn't any direct support for the IndexedBase feature of SymPy. As such, the syntax alpha[n] is not available. You can call the method __getitem__ directly, as with
alpha.__getitem__[n]
I don't see a corresponding __setitem__ documented, so I'm not sure whether
α[n]= alpha3[n]
is valid in sympy itself. But if there is some other assignment method, you would likely just call that instead of the using [ for assignment.
As for the last question about equations, I'm not sure but you would presumably find the size of the IndexedBase object and use that to loop.
If possible, using native julia constructs would be preferred, as possible. For this example, you might just consider an array of variables. The recently changed #syms macro makes this easy to generate.
For example, I think the following mostly replicates what you are trying to do:
#syms n::integer, N::integer
#exogenous variables
N = 3
#syms α[1:3] # hard code 3 here or use `α =[Sym("αᵢ$i") for i ∈ 1:N]`
#syms γ
α2 = sum(α[i] for i ∈ 1:N)
equation_1 = Eq(α2 + γ, 1)
alpha3 = [1,2,3]
for n in 1:N
α[n]= alpha3[n]
end
#syms P[1:3], z[1:3], s[1:3], γ[1:3], S[1:3], D[1:3]
Eq_n = [Eq(S[n], D[n]) for n ∈ 1:N]
Eq0 = Eq(sum(P .* Y), 0)
eq_system = [Eq_n,Eq0]
solveset(eq_system,P[n])
I have the following code that evaluates the likelihood function for a spatial autoregressive model in Julia, like so:
function like_sar2(betas,rho,sige,y,x,W)
n = length(y)
A = speye(n) - rho*W
e = y-x*betas-rho*sparse(W)*y
epe = e'*e
tmp2 = 1/(2*sige)
llike = -(n/2)*log(pi) - (n/2)*log(sige) + log(det(A)) - tmp2*epe
end
I am trying to maximize this function but I'm not sure how to pass the different sized function inputs so that the Optim.jl package will accept it. I have tried the following:
optimize(like_sar2,[betas;rho;sige;y;x;W],BFGS())
and
optimize(like_sar2,tuple(betas,rho,sige,y,x,W),BFGS())
In the first case, the matrix in brackets does not conform due to dimension mismatch and in the second, the Optim package doesn't allow tuples.
I'd like to try and maximize this likelihood function so that it can return the numerical Hessian matrix (using the Optim options) so that I can compute t-statistics for the parameters.
If there is any easier way to obtain the numerical Hessian for such a function I'd use that but it appears that packages like FowardDiff only accept single inputs.
Any help would be greatly appreciated!
Not 100% sure I correctly understand how your function works, but it seems to me like you're using the likelihood to estimate the coefficient vector beta, with the other input variables fixed. The way to do this would be to amend the function as follows:
using Optim
# Initialize some parameters
coeffs = rand(10)
rho = 0.1
ys = rand(10)
xs = rand(10,10)
Wmat = rand(10,10)
sige=0.5
# Construct likelihood with parameters fixed at pre-defined values
function like_sar2(β::Vector{Float64},ρ=rho,σε=sige,y=ys,x=xs,W=Wmat)
n = length(y)
A = speye(n) - ρ*W
ε = y-x*β-ρ*sparse(W)*y
epe = ε'*ε
tmp2 = 1/(2*σε)
llike = -(n/2)*log(π) - (n/2)*log(σε) + log(det(A)) - tmp2*epe
end
# Optimize, with starting value zero for all beta coefficients
optimize(like_sar2, zeros(10), NelderMead())
If you need to optimize more than your beta parameters (in the general autoregressive models I've used often the autocorrelation parameter was estimated jointly with other coefficients), you could do this by chugging it in with the beta vector and unpacking within the functions like so:
append!(coeffs,rho)
function like_sar3(coeffs::Vector{Float64},σε=sige,y=ys,x=xs,W=Wmat)
β = coeffs[1:10]; ρ = coeffs[11]
n = length(y)
A = speye(n) - ρ*W
ε = y-x*β-ρ*sparse(W)*y
epe = ε'*ε
tmp2 = 1/(2*σε)
llike = -(n/2)*log(π) - (n/2)*log(σε) + log(det(A)) - tmp2*epe
end
The key is that you end up with one vector of inputs to pass into your function.
I'm trying to learn a little Julia by doing some bayesian analysis. In Peter Hoff's textbook, he describes a process of sampling from a posterior predictive distribution of a Poisson-Gamma model in which he:
Samples values from the gamma distribution
Samples values from the poisson distribution, passing a vector of lambdas
Here is what this looks like in R:
a <- 2
b <- 1
sy1 <- 217; n1 <- 111
theta1.mc <- rgamma(1000, a+sy1, b+n1)
y1.mc <- rpois(1000, theta1.mc)
In Julia, I see that distributions can't take a vector of parameters. So, I end up doing something like this:
using Distributions
a = 2
b = 1
sy1 = 217; n1 = 111
theta_mc = rand(Gamma(a+217, 1/(b+n1)), 5000)
y1_mc = map(x -> rand(Poisson(x)), theta_mc)
While I was initially put off at the distribution function not taking a vector and working Just Like R™, I like that I'm not needing to set my number of samples more than once. That said, I'm not sure I'm doing this idiomatically, either in terms of how people would work with the distributions package, or more generically how to compose functions.
Can anyone suggest a better, more idiomatic approach than my example code?
I would usually do something like the following, which uses list comprehensions:
a, b = 2, 1
sy1, n1 = 217, 111
theta_mc = rand(Gamma(a + sy1, 1 / (b + n1)), 1000)
y1_mc = [rand(Poisson(theta)) for theta in theta_mc]
One source of confusion may be that Poisson isn't really a function, it's a type constructor and it returns an object. So vectorization over theta doesn't really make sense, since that wouldn't construct one object, but many -- which would then require another step to call rand on each generated object.
I'm adapting MATLAB code to R and trying to generate a waveform using ARMA formula. Is there a simple R equivalent function for MATLAB's filter to take AR/MA coefficients to build a waveform?
npts = 100;
a = [1 0.6]; % AR coeffs
b = [1 0.25 3]; % MA coeffs
e = randn(npts,1); % generate gaussian white noise
waveform = filter(b,a,e); % generate waveform
Hmm can't you achieve that with filter function in the package signal ?
require(signal)
a = c(1,0.6)
b = c(1,0.25,3)
e = rnorm(100)
waveform = filter(b,a,e)
Yeah, you can do this usring arima.sim, e.g.
arima.sim(npts, model=list(ar=a, ma=b), rand.gen=rnorm)
Note that the model is checked for stationarity and the model you have above is not stationary. If you want something integrated you can specify the order of integration in the model.
Thanks to R's evaluation of function arguments, it is possible to specify a consistent set of input parameters, and have the others automagically calculated.
Consider the following function, linking the concentration, mass, volume and molar weight for a dilution in chemistry,
concentration <- function(c = m / (M*V), m = c*M*V, V = m / (M*c), M = 417.84){
cat(c("c=", c*1e6, "micro.mol/L\n",
"m=", m*1e3, "mg\n",
"M=", M, "g/mol\n",
"V=", V*1e3, "mL\n"))
## mol/L, g, g/mol, L
invisible(list(c=c, m=m, M=M, V=V))
}
Is there a way to specify only one of the equations and have R figure out the others by inversion? I realise this is limited to simple linear relationships, as the inversion cannot generally be expressed analytically.
concentration <- function(c = m / (M*V), m, V, M = 417.84){
## { magic.incantation }
## mol/L, g, g/mol, L
invisible(list(c=c, m=m, M=M, V=V))
}
You might want to look at the BB package, and in particular the function BBsolve(). BBsolve does a Newton-Raphson backsolve of the equation(s) you feed it. As it happens :-) , I wrote and published a function "ktsolve" which allows you to enter a set of equations and some subset of the variables, and it'll return the values of the other variables. (It's named in honor of the commercial TK!Solver package). If you want to try it out, you can get it at http://witthoft.com/ktsolve.R (or http://witthoft.com/rtools.html and click on the link there).