How to perform cross validation(K-fold) in julia? - julia

Suppose I have a dataset with two columns. I have built linear regression model on my dataset, Now my question is how do I check the accuracy of my model.
I found that answer to my question is applying K-fold on my dataset. I know how K-fold works but I have no idea how to implement K-fold in my Julia program.
#suppose I have two columns x and y in my dataset
x= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y=[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]
# now how do I use K-fold to split dataset and also evaluate my algorithm?

As mentioned in the comment, it is easier to setup some code once any base source is given. For example in this case, K-fold cross-validation might need to go through preparation like the following:
julia> x= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20];
julia> y=[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21];
julia> K = 5 # number of folds in validation
5
julia> N = length(x) # number of samples in dataset
20
julia> stops = round.(Int,linspace(1,N,K+1))
6-element Array{Int64,1}:
1
5
9
12
16
20
julia> vsets = [s:e-(e<N)*1 for (s,e) in zip(stops[1:end-1],stops[2:end])]
5-element Array{UnitRange{Int64},1}:
1:4
5:8
9:11
12:15
16:20
julia> tsets1 = [1:s-1 for (s,e) in zip(stops[1:end-1],stops[2:end])]
5-element Array{UnitRange{Int64},1}:
1:0
1:4
1:8
1:11
1:15
julia> tsets2 = [e+(e<=N)*1:N for (s,e) in zip(stops[1:end-1],stops[2:end])]
5-element Array{UnitRange{Int64},1}:
6:20
10:20
13:20
17:20
21:20
julia> σ = randperm(N);
julia> [x[σ[vsets[i]]] for i=1:K] # validation sets
5-element Array{Array{Int64,1},1}:
[5, 13, 6, 10]
[16, 4, 2, 3]
[9, 19, 20]
[17, 12, 14, 11]
[8, 1, 18, 7, 15]
julia> [x[vcat(σ[tsets1[i]],σ[tsets2[i]])] for i=1:K] # training sets
5-element Array{Array{Int64,1},1}:
[4, 2, 3, 9, 19, 20, 17, 12, 14, 11, 8, 1, 18, 7, 15]
[5, 13, 6, 10, 19, 20, 17, 12, 14, 11, 8, 1, 18, 7, 15]
[5, 13, 6, 10, 16, 4, 2, 3, 12, 14, 11, 8, 1, 18, 7, 15]
[5, 13, 6, 10, 16, 4, 2, 3, 9, 19, 20, 1, 18, 7, 15]
[5, 13, 6, 10, 16, 4, 2, 3, 9, 19, 20, 17, 12, 14, 11]
This may be satisfactory. For more details regarding K-fold cross validation here is a link to Wikipedia: https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation

You can use folds from MLDataUtils.jl.
kfolds([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],5)

Related

How to get average marginal effects (AMEs) with standard errors of a multinomial logit model?

I want to get the average marginal effects (AME) of a multinomial logit model with standard errors. For this I've tried different methods, but they haven't led to the goal so far.
Best attempt
My best attempt was to get the AMEs by hand using mlogit which I show below.
library(mlogit)
ml.d <- mlogit.data(df1, choice="Y", shape="wide") # shape data for `mlogit()`
ml.fit <- mlogit(Y ~ 1 | D + x1 + x2, reflevel="1", data=ml.d) # fit the model
# coefficient names
c.names <- all.vars(ml.fit$call)[2:4]
# get marginal effects
ME.mnl <- sapply(c.names, function(x)
stats::effects(ml.fit, covariate=x, data=ml.d),
simplify=FALSE)
# get AMEs
(AME.mnl <- t(sapply(ME.mnl, colMeans)))
# 1 2 3 4 5
# D -0.03027080 -0.008806072 0.0015410569 0.017186531 0.02034928
# x1 -0.02913234 -0.015749598 0.0130577842 0.013240212 0.01858394
# x2 -0.02724650 -0.005482753 0.0008575982 0.005331181 0.02654047
I know these values are the correct ones. However, I could not get the correct standard errors by simply doing the columns' standard deviations:
# standard errors - WRONG!
(AME.mnl.se <- t(sapply(E.mnl, colSdColMeans)))
(Note: colSdColMeans() for columns' SD is provided here.)
Accordingly this also led me to the wrong t-values:
# t values - WRONG!
AME.mnl / AME.mnl.se
# 1 2 3 4 5
# D -0.7110537 -0.1615635 0.04013228 0.4190057 0.8951484
# x1 -0.7170813 -0.2765212 0.33325968 0.3656893 0.8907836
# x2 -0.7084573 -0.1155825 0.02600653 0.1281190 0.8559794
Whereas I know the correct t-values for this case are these:
# D -9.26 -1.84 0.31 4.29 8.05
# x1 -6.66 -2.48 1.60 1.50 3.22
# x2 -2.95 -0.39 0.06 0.42 3.21
I learned that there should be a "delta method", but I only found some code for a very special case with interactions at Cross Validated.
Failed attempts
1.) Package margins doesn't seem to be able to handle "mlogit"
objects:
library(margins)
summary(margins(ml.fit))
2.) There's another package for mlogits, nnet,
library(nnet)
ml.fit2 <- multinom(Y ~ D + x1 + x2, data=df1)
summary(ml.fit2)
but margins can't handle this correctly either:
> summary(margins(ml.fit2))
factor AME SE z p lower upper
D -0.0303 NA NA NA NA NA
x1 -0.0291 NA NA NA NA NA
x2 -0.0272 NA NA NA NA NA
3.) There's also a package around that claims to calculate "Average Effects for Multinomial Logistic Regression Models",
library(DAMisc)
mnlChange2(ml.fit2, varnames="D", data=df1)
but I couldn't get a drop of milk out of it, since the function yields just nothing (even not with the function's example).
How now can we get AMEs with standard errors / t-statistics of a multinomial logit model with R?
Data
df1 <- structure(list(Y = c(3, 4, 1, 2, 3, 4, 1, 5, 2, 3, 4, 2, 1, 4,
1, 5, 3, 3, 3, 5, 5, 4, 3, 5, 4, 2, 5, 4, 3, 2, 5, 3, 2, 5, 5,
4, 5, 1, 2, 4, 3, 1, 2, 3, 1, 1, 3, 2, 4, 2, 2, 4, 1, 5, 3, 1,
5, 2, 3, 4, 2, 4, 5, 2, 4, 1, 4, 2, 1, 5, 3, 2, 1, 4, 4, 1, 5,
1, 1, 1, 4, 5, 5, 3, 2, 3, 3, 2, 4, 4, 5, 3, 5, 1, 2, 5, 5, 1,
2, 3), D = c(12, 8, 6, 11, 5, 14, 0, 22, 15, 13, 18, 3, 5, 9,
10, 28, 9, 16, 17, 14, 26, 18, 18, 23, 23, 12, 28, 14, 10, 15,
26, 9, 2, 30, 18, 24, 27, 7, 6, 25, 13, 8, 4, 16, 1, 4, 5, 18,
21, 1, 2, 19, 4, 2, 16, 17, 23, 15, 13, 21, 24, 14, 27, 6, 20,
6, 19, 8, 7, 23, 11, 11, 1, 22, 21, 4, 27, 6, 2, 9, 18, 30, 26,
22, 10, 1, 4, 7, 26, 15, 26, 18, 30, 1, 11, 29, 25, 3, 19, 15
), x1 = c(13, 12, 4, 3, 16, 16, 15, 13, 1, 15, 10, 16, 1, 17,
7, 13, 12, 6, 8, 16, 16, 11, 7, 16, 5, 13, 12, 16, 17, 6, 16,
9, 14, 16, 15, 5, 7, 2, 8, 2, 9, 9, 15, 13, 9, 4, 16, 2, 11,
13, 11, 6, 4, 3, 7, 4, 12, 2, 16, 14, 3, 13, 10, 11, 10, 4, 11,
16, 8, 12, 14, 9, 4, 16, 16, 12, 9, 10, 6, 1, 3, 8, 7, 7, 5,
16, 17, 10, 4, 15, 10, 8, 3, 13, 9, 16, 12, 7, 4, 11), x2 = c(12,
19, 18, 19, 15, 12, 15, 16, 15, 11, 12, 16, 17, 14, 12, 17, 17,
16, 12, 20, 11, 11, 15, 14, 18, 10, 14, 13, 10, 14, 18, 18, 18,
17, 18, 14, 16, 19, 18, 16, 18, 14, 17, 10, 16, 12, 16, 15, 11,
18, 19, 15, 19, 11, 16, 10, 20, 14, 10, 12, 10, 15, 13, 15, 11,
20, 11, 12, 16, 16, 11, 15, 11, 11, 10, 10, 16, 11, 20, 17, 20,
17, 16, 11, 18, 19, 18, 14, 17, 11, 16, 11, 18, 14, 15, 16, 11,
14, 11, 13)), class = "data.frame", row.names = c(NA, -100L))
We can do something very similar to what is done in your linked answer. In particular, first we want a function that would compute AMEs at a given vector of coefficients. For that we can define
AME.fun <- function(betas) {
tmp <- ml.fit
tmp$coefficients <- betas
ME.mnl <- sapply(c.names, function(x)
effects(tmp, covariate = x, data = ml.d), simplify = FALSE)
c(sapply(ME.mnl, colMeans))
}
where the second half is yours, while in the first one I use a trick to take the same ml.fit object and to change its coefficients. Next we find the jacobian with
require(numDeriv)
grad <- jacobian(AME.fun, ml.fit$coef)
and apply the delta method. Square roots of the diagonal of grad %*% vcov(ml.fit) %*% t(grad) is what we want. Hence,
(AME.mnl.se <- matrix(sqrt(diag(grad %*% vcov(ml.fit) %*% t(grad))), nrow = 3, byrow = TRUE))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.003269320 0.004788536 0.004995723 0.004009762 0.002527462
# [2,] 0.004375795 0.006348496 0.008168883 0.008844684 0.005763966
# [3,] 0.009233616 0.014048212 0.014713090 0.012702188 0.008261734
AME.mnl / AME.mnl.se
# 1 2 3 4 5
# D -9.259050 -1.8389907 0.30847523 4.2861720 8.051269
# x1 -6.657611 -2.4808393 1.59847852 1.4969683 3.224159
# x2 -2.950794 -0.3902812 0.05828811 0.4197057 3.212458
which coincides with Stata's results.
If you use vce="bootstraps" within margin function then it provides SE with Confidence interval as well
summary(margins(ml.fit2,vce="bootstraps"))
The terminology for “marginal effects” is very inconsistent across
disciplines. Since you refer to the margins package, I assume that you
use the expression “Average Marginal Effects” in the same that that the
margins developers used it, which is the result of this procedure:
Compute the slope of the outcome with respect to D for every row
in the original dataset (unit-level marginal effects).
Take the average of the unit-level slopes (average marginal effect)
In models like nnet::multinom, the slopes will be different for every
level of the outcome variable. There will thus be one average marginal
effect per level, per regressor.
Using the marginaleffects package and the data you supplied, we get:
library(nnet)
library(marginaleffects)
mod <- nnet::multinom(Y ~ D + x1*x2, data=df1, trace = FALSE)
marginaleffects(mod) |> summary()
Group Term Effect Std. Error z value Pr(>|z|) 2.5 % 97.5 %
1 1 D -0.027558 0.004183 -6.5878 4.4625e-11 -3.576e-02 -0.019359
2 1 x1 -0.026789 0.003916 -6.8411 7.8596e-12 -3.446e-02 -0.019114
3 1 x2 -0.026542 0.009812 -2.7051 0.00682871 -4.577e-02 -0.007311
4 2 D -0.012115 0.004702 -2.5766 0.00997729 -2.133e-02 -0.002899
5 2 x1 -0.018223 0.006017 -3.0287 0.00245619 -3.002e-02 -0.006430
6 2 x2 -0.007045 0.013101 -0.5377 0.59078427 -3.272e-02 0.018633
7 3 D 0.001536 0.005877 0.2614 0.79380433 -9.982e-03 0.013054
8 3 x1 0.012451 0.008775 1.4189 0.15592516 -4.748e-03 0.029650
9 3 x2 0.002193 0.015573 0.1408 0.88801728 -2.833e-02 0.032715
10 4 D 0.016300 0.004325 3.7689 0.00016399 7.823e-03 0.024776
11 4 x1 0.018111 0.008789 2.0606 0.03934167 8.845e-04 0.035338
12 4 x2 0.013543 0.013266 1.0208 0.30733424 -1.246e-02 0.039544
13 5 D 0.021837 0.003387 6.4479 1.1343e-10 1.520e-02 0.028475
14 5 x1 0.014449 0.005402 2.6749 0.00747469 3.862e-03 0.025037
15 5 x2 0.017851 0.009072 1.9677 0.04909878 7.048e-05 0.035631
Model type: multinom
Prediction type: probs

Julia - the way of kings (generator performance)

I had some python code which I tried to port to Julia to learn this lovely language. I used generators in python. After porting it seems to me (at this moment) that Julia is really slow in this area!
I made part of my code simplified to this exercise:
Think 4x4 chess board. Find every N-moves long path, chess king could do. In this exercise, the king is not allowed to leap twice at the same position in one path. Don't waste memory -> make a generator of every path.
Algorithm is pretty simple:
if we sign every position with numbers:
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 16
point 0 has 3 neighbors (1, 4, 5). We could find a table for every neighbor for every point:
NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]]
PYTHON
A recursive function (generator) which enlarge given path from the list of points or from a generator of (generator of ...) points:
def enlarge(path):
if isinstance(path, list):
for i in NEIG[path[-1]]:
if i not in path:
yield path[:] + [i]
else:
for i in path:
yield from enlarge(i)
Function (generator) which give every path with given length
def paths(length):
steps = ([i] for i in range(16)) # first steps on every point on board
for _ in range(length-1):
nsteps = enlarge(steps)
steps = nsteps
yield from steps
We could see that there are 905776 paths with length 10:
sum(1 for i in paths(10))
Out[89]: 905776
JULIA
(this code was created by #gggg during our discussion here )
const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]
function enlarge(path::Vector{Int})
(push!(copy(path),loc) for loc in NEIG[path[end]] if !(loc in path))
end
collect(enlarge([1]))
function enlargepaths(paths)
Iterators.Flatten(enlarge(path) for path in paths)
end
collect(enlargepaths([[1],[2]]))
function paths(targetlen)
paths = ([i] for i=1:16)
for newlen in 2:targetlen
paths = enlargepaths(paths)
end
paths
end
p = sum(1 for path in paths(10))
benchmark
In ipython we could time it:
python 3.6.3:
%timeit sum(1 for i in paths(10))
1.25 s ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia 0.6.0
julia> #time sum(1 for path in paths(10))
2.690630 seconds (41.91 M allocations: 1.635 GiB, 11.39% gc time)
905776
Julia 0.7.0-DEV.0
julia> #time sum(1 for path in paths(10))
4.951745 seconds (35.69 M allocations: 1.504 GiB, 4.31% gc time)
905776
Question(s):
We Julians are saying this: It is important to note that the benchmark codes are not written for absolute maximal performance (the fastest code to compute recursion_fibonacci(20) is the constant literal 6765). Instead, the benchmarks are written to test the performance of identical algorithms and code patterns implemented in each language.
In this benchmark, we are using the same idea. Just simple for cycles over arrays enclosed to generators. (Nothing from numpy, numba, pandas or others c-written and compiled python packages)
Is assumption that Julia's generators are terribly slow right?
What could we do to make it really fast?
const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py];
function expandto(n, path, targetlen)
length(path) >= targetlen && return n+1
for loc in NEIG[path[end]]
loc in path && continue
n = expandto(n, (path..., loc), targetlen)
end
n
end
function npaths(targetlen)
n = 0
for i = 1:16
path = (i,)
n = expandto(n, path, targetlen)
end
n
end
Benchmark (after executing once for JIT-compilation):
julia> #time npaths(10)
0.069531 seconds (5 allocations: 176 bytes)
905776
which is considerably faster.
Julia's "better performance" than Python isn't magical. Most of it stems directly from the fact that Julia can figure out what each variable's type within a function will be, and then compile highly specialized code for those specific types. This even applies to the elements in many containers and iterables like generators; Julia often knows ahead of time what type the elements will be. Python isn't able to do this analysis nearly as easily (or at all, in many cases), so its optimizations have focused on improving the dynamic behaviors.
In order for Julia's generators to know ahead of time what kinds of types they might produce, they encapsulate information about both the operation they perform and the object they iterate over in the type:
julia> (1 for i in 1:16)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##27#28"))}(getfield(Main, Symbol("##27#28"))(), 1:16)
That weird ##27#28 thing is the type of an anonymous function that simply returns 1. By the time the generator gets to LLVM, it knows enough to perform quite a large number of optimizations:
julia> function naive_sum(c)
s = 0
for elt in c
s += elt
end
s
end
#code_llvm naive_sum(1 for i in 1:16)
; Function naive_sum
; Location: REPL[1]:2
define i64 #julia_naive_sum_62385({ { i64, i64 } } addrspace(11)* nocapture nonnull readonly dereferenceable(16)) {
top:
; Location: REPL[1]:3
%1 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 0
%2 = load i64, i64 addrspace(11)* %1, align 8
%3 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 1
%4 = load i64, i64 addrspace(11)* %3, align 8
%5 = add i64 %4, 1
%6 = sub i64 %5, %2
; Location: REPL[1]:6
ret i64 %6
}
It may take a minute to parse through the LLVM IR there, but you should be able to see that it's just extracting the endpoints of the UnitRange (getelementptr and load), subtracting them from each other (sub) and adding one to compute the sum without a single loop.
In this case, though, it works against Julia: paths(10) has a ridiculously complicated type! You're iteratively wrapping that one generator in filters and flattens and yet more generators. It becomes so complicated, in fact, that Julia just gives up trying to figure out with it and decides to live with the dynamic behavior. And at this point, it no longer has an inherent advantage over Python — in fact specializing on so many different types as it recursively walks through the object would be a distinct handicap. You can see this in action by looking at #code_warntype start(1 for i in paths(10)).
My rule of thumb for Julia's performance is that type-stable, devectorized code that avoids allocations is typically within an factor of 2 of C, and dynamic, unstable, or vectorized code is within an order of magnitude of Python/MATLAB/other higher level languages. Often it's a bit slower simply because the other higher level languages have pushed very hard to optimize their case, whereas the majority of Julia's optimizations have been focused on the type-stable side of things. This deeply nested construct puts you squarely in the dynamic camp.
So are Julia's generators terribly slow? Not inherently so; it's just when they become so deeply nested like this that you hit this bad case.
Not following the same algorithm (and don't know how fast Python would be doing it like this), but with the following code Julia is basically the same for solutions of length=10, and much better for solutions of length=16
In [48]: %timeit sum(1 for path in paths(10))
1.52 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia> #time sum(1 for path in pathsr(10))
1.566964 seconds (5.54 M allocations: 693.729 MiB, 16.24% gc time)
905776
In [49]: %timeit sum(1 for path in paths(16))
19.3 s ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
julia> #time sum(1 for path in pathsr(16))
6.491803 seconds (57.36 M allocations: 9.734 GiB, 33.79% gc time)
343184
Here is the code. I just learnt about tasks/channels yesterday, so probably it can be done better:
const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], \
[5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
function enlarger(num::Int,len::Int,pos::Int,sol::Array{Int64,1},c::Channel)
if pos == len
put!(c,copy(sol))
elseif pos == 0
for j=0:num
sol[1]=j
enlarger(num,len,pos+1,sol,c)
end
close(c)
else
for i in NEIG[sol[pos]+1]
if !in(i,sol[1:pos])
sol[pos+1]=i
enlarger(num,len,pos+1,sol,c)
end
end
end
end
function pathsr(len)
c=Channel(0)
sol = [0 for i=1:len]
#schedule enlarger(15,len,0,sol,c)
(i for i in c)
end
Following tholy's answer, since tuples seem to be very fast. This is like my previous code, but with the tuple stuff, and it gets substantially better results:
julia> #time sum(1 for i in pathst(10))
1.155639 seconds (1.83 M allocations: 97.632 MiB, 0.75% gc time)
905776
julia> #time sum(1 for i in pathst(16))
1.963470 seconds (1.39 M allocations: 147.555 MiB, 0.35% gc time)
343184
The code:
const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
function enlarget(path,len,c::Channel)
if length(path) >= len
put!(c,path)
else
for loc in NEIG[path[end]+1]
loc in path && continue
enlarget((path..., loc), len,c)
end
if length(path) == 1
path[1] == 15 ? close(c) : enlarget((path[1]+1,),len,c)
end
end
end
function pathst(len)
c=Channel(0)
path=(0,)
#schedule enlarget(path,len,c)
(i for i in c)
end
Since everybody is writing an answer... here is another version, this time using Iterators, which are kind-of more idiomatic than generators in current Julia (0.6.1). Iterators offer many of the benefits generators have. The iterator definition is in the following lines:
import Base.Iterators: start, next, done, eltype, iteratoreltype, iteratorsize
struct SAWsIterator
neigh::Vector{Vector{Int}}
pathlen::Int
pos::Int
end
SAWs(neigh, pathlen, pos) = SAWsIterator(neigh, pathlen, pos)
start(itr::SAWsIterator) =
([itr.pos ; zeros(Int, itr.pathlen-1)], Vector{Int}(itr.pathlen-1),
2, Ref{Bool}(false), Ref{Bool}(false))
#inline next(itr::SAWsIterator, s) =
( s[4][] ? s[4][] = false : calc_next!(itr, s) ;
(s[1], (s[1], s[2], itr.pathlen, s[4], s[5])) )
#inline done(itr::SAWsIterator, s) = ( s[4][] || calc_next!(itr, s) ; s[5][] )
function calc_next!(itr::SAWsIterator, s)
s[4][] = true ; s[5][] = false
curindex = s[3]
pathlength = itr.pathlen
path, options = s[1], s[2]
#inbounds while curindex<=pathlength
curindex == 1 && ( s[5][] = true ; break )
startindex = path[curindex] == 0 ? 1 : options[curindex-1]+1
path[curindex] = 0
i = findnext(x->!(x in path), neigh[path[curindex-1]], startindex)
if i==0
path[curindex] = 0 ; options[curindex-1] = 0 ; curindex -= 1
else
path[curindex] = neigh[path[curindex-1]][i]
options[curindex-1] = i ; curindex += 1
end
end
return nothing
end
eltype(::Type{SAWsIterator}) = Vector{Int}
iteratoreltype(::Type{SAWsIterator}) = Base.HasEltype()
iteratorsize(::Type{SAWsIterator}) = Base.SizeUnknown()
Cut-and-pasting the definition above works. The term SAW was used as an acronym of Self Avoiding Walk, which is sometimes used in mathematics for such a path.
Now, to use/test this iterator, the following code can be executed:
allSAWs(neigh, pathlen) =
Base.Flatten(SAWs(neigh,pathlen,k) for k in eachindex(neigh))
iterlength(itr) = mapfoldl(x->1, +, 0, itr)
using Base.Test
const neigh = [[2, 5, 6], [1, 3, 5, 6, 7], [2, 4, 6, 7, 8], [3, 7, 8],
[1, 2, 6, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 4, 6, 8, 10, 11, 12],
[3, 4, 7, 11, 12], [5, 6, 10, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15],
[6, 7, 8, 10, 12, 14, 15, 16], [7, 8, 11, 15, 16], [9, 10, 14],
[9, 10, 11, 13, 15], [10, 11, 12, 14, 16], [11, 12, 15]]
#test iterlength(allSAWs(neigh, 10)) == 905776
for (i,path) in enumerate(allSAWs(neigh, 10))
if i % 100_000 == 0
#show i,path
end
end
#time iterlength(allSAWs(neigh, 10))
It is relatively readable, and the output looks like this:
(i, path) = (100000, [2, 5, 10, 14, 9, 6, 7, 12, 15, 11])
(i, path) = (200000, [4, 3, 8, 7, 6, 10, 14, 11, 16, 15])
(i, path) = (300000, [5, 10, 11, 16, 15, 14, 9, 6, 7, 3])
(i, path) = (400000, [8, 3, 6, 5, 2, 7, 11, 14, 15, 10])
(i, path) = (500000, [9, 14, 10, 5, 2, 3, 8, 11, 6, 7])
(i, path) = (600000, [11, 16, 15, 14, 10, 6, 3, 8, 7, 12])
(i, path) = (700000, [13, 10, 15, 16, 11, 6, 2, 1, 5, 9])
(i, path) = (800000, [15, 11, 12, 7, 2, 3, 6, 1, 5, 9])
(i, path) = (900000, [16, 15, 14, 9, 5, 10, 7, 8, 12, 11])
0.130755 seconds (4.16 M allocations: 104.947 MiB, 11.37% gc time)
905776
0.13s is not too bad considering this is not as optimized as #tholy's answer, or some others'. Some tricks used in the other answers are deliberately not used here, specifically:
recursion basically uses the stack as a quick way to allocate.
Using tuples combined with specialization hides some run-time complexity in the first compile of methods for each tuple signature.
An optimization not seen in the answers yet could be important is using an efficient Bool array or Dict to speed up the check if a vertex was already used in the path. In this answer, the findnext triggers an allocation, which can be avoided and then this answer will be closer to the minimal memory allocation count.
This is my quick and dirty cheating experiment (I promised to add it here in comment) where I am trying to speedup Angel's code:
const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]
function enlargetc(path,len,c::Function)
if length(path) >= len
c(path)
else
for loc in NEIG[path[end]]
loc in path && continue
enlargetc((path..., loc), len,c)
end
if length(path) == 1
if path[1] == 16 return
else enlargetc((path[1]+1,),len,c)
end
end
end
end
function get_counter()
let helper = 0
function f(a)
helper += 1
return helper
end
return f
end
end
counter = get_counter()
#time enlargetc((1,), 10, counter) # 0.481986 seconds (2.62 M allocations: 154.576 MiB, 5.12% gc time)
counter.helper.contents # 905776
EDIT: time in comment is without recompilation! After recompilation it was 0.201669 seconds (2.53 M allocations: 150.036 MiB, 10.77% gc time).

Quade test in R

I would like to perform a Quade test with more than one covariate in R. I know the command quade.test and I have seen the example below:
## Conover (1999, p. 375f):
## Numbers of five brands of a new hand lotion sold in seven stores
## during one week.
y <- matrix(c( 5, 4, 7, 10, 12,
1, 3, 1, 0, 2,
16, 12, 22, 22, 35,
5, 4, 3, 5, 4,
10, 9, 7, 13, 10,
19, 18, 28, 37, 58,
10, 7, 6, 8, 7),
nrow = 7, byrow = TRUE,
dimnames =
list(Store = as.character(1:7),
Brand = LETTERS[1:5]))
y
quade.test(y)
My question is as follows: how could I introduce more than one covariate? In this example the covariate is the Store variable.

Error in data.frame() arguments imply differing number of rows: 1, 11, 10, 3, 5, 4, 9, 2, 6, 7, 8, 12, 22, 13, 16, 14, 15, 19, 17, 20, 18, 28, 2

I am using this command in R Studio to split the data present in one column:
CTE.info <- data.frame(strsplit(as.character(CTE$V11),'|',fixed=TRUE))
But, I am getting the error:
Error in data.frame("orderItems", "79542;2;24.000;24.000;5.310", "Credit;1;-15.000;-15.000;.000", :
arguments imply differing number of rows: 1, 11, 10, 3, 5, 4, 9, 2, 6, 7, 8, 12, 22, 13, 16, 14, 15, 19, 17, 20, 18, 28, 24
Could someone assist and let me know how can this be sorted?
You can make the length of the list element same and it should work.
lst <- strsplit(as.character(CTE$V11),'|',fixed=TRUE)
d1 <- data.frame(lapply(lst, `length<-`, max(lengths(lst))))
colnames(d1) <- paste0('V', seq_along(d1))
data
CTE <- data.frame(V11= c('a|b|c', 'a|b', 'a|b|c|d'))

How to get same results of Wilcoxon sign rank test in R and SAS

R code:
x <- c(9, 5, 9 ,10, 13, 8, 8, 13, 18, 30)
y <- c(10, 6, 9, 8, 11, 4, 1, 3, 3, 10)
library(exactRankTests)
wilcox.exact(y,x, paired = TRUE, alternative = "two.sided")
The results: V = 3, p-value = 0.01562
SAS code:
data aaa;
set aaa;
diff=x-y;
run;
proc univariate;
var diff;
run;
The results: S=19.5 Pr >= |S| 0.0156
How to get statistics S in R?
If n<=20 the exact P was same in SAS and R,but if n>20 the results were different.
x <- c(9, 5, 9 ,10, 13, 8, 8, 13, 18, 30,9, 5, 9 ,10, 13, 8, 8, 13, 18, 30,9,11,12,10)
y <- c(10, 6, 9, 8, 11, 4, 1, 3, 3, 10,10, 6, 9, 8, 11, 4, 1, 3, 3, 10,10,12,11,12)
wilcox.exact(y,x,paired=TRUE, alternative = "two.sided",exact = FALSE)
The results: V = 34, p-value = 0.002534
The SAS results:S=92.5 Pr >= |S| 0.0009
How to get the same statistics S and P value in SAS and R? Thank you!

Resources