How to round the results of a GLM predict in Julia - julia

I am trying to do a very simple Logistic regression in Julia. But Julia's typing system seems to be causing me problems. Basically, glm predict gives me an array of probabilities. I want to do a simple round so that if the probability >= 0.5, it is a 1, otherwise a 0. I would like those labels to also be integers.
No matter what I do, I can't convert the DataArray returned by predict to Int64. If I create an adhoc DataArray, I can round it just fine. Even though they both show a type of DataArrays.DataArray{Float64,1}. I've also tried things like pred>0.5, but that fails similarly. Clearly there is some magic with the return value from predict, beyond the type, that makes it different than the other DataArray in my short program.
using DataFrames;
using GLM;
df = readtable("./data/titanic-dataset.csv");
delete!(df, :PassengerId);
delete!(df, :Name);
delete!(df, :Ticket);
delete!(df, :Cabin);
pool!(df, [:Sex]);
pool!(df, [:Embarked]);
df[isna.(df[:Age]),:Age] = median(df[ .~isna.(df[:Age]),:Age])
model = glm(#formula(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked), df, Binomial(), LogitLink());
pred = predict(model,df);
z = DataArray([1.0,2.0,3.0]);
println(typeof(z));
println(typeof(pred));
println(round.(Int64,z)); # Why does this work?
println(round.(Int64,pred)); # But this does not?
The output is:
DataArrays.DataArray{Float64,1}
DataArrays.DataArray{Float64,1}
[1, 2, 3]
MethodError: no method matching round(::Type{Int64}, ::DataArrays.NAtype)
Closest candidates are:
round(::Type{T<:Integer}, ::Integer) where T<:Integer at int.jl:408
round(::Type{T<:Integer}, ::Float16) where T<:Integer at float.jl:338
round(::Type{T<:Union{Signed, Unsigned}}, ::BigFloat) where T<:Union{Signed, Unsigned} at mpfr.jl:214
...
Stacktrace:
[1] macro expansion at C:\Users\JHeaton\.julia\v0.6\DataArrays\src\broadcast.jl:32 [inlined]
[2] macro expansion at .\cartesian.jl:64 [inlined]
[3] macro expansion at C:\Users\JHeaton\.julia\v0.6\DataArrays\src\broadcast.jl:111 [inlined]
[4] _broadcast!(::DataArrays.##116#117{Int64,Base.#round}, ::DataArrays.DataArray{Int64,1}, ::DataArrays.DataArray{Float64,1}) at C:\Users\JHeaton\.julia\v0.6\DataArrays\src\broadcast.jl:67
[5] broadcast!(::Function, ::DataArrays.DataArray{Int64,1}, ::Type{Int64}, ::DataArrays.DataArray{Float64,1}) at C:\Users\JHeaton\.julia\v0.6\DataArrays\src\broadcast.jl:169
[6] broadcast(::Function, ::Type{T} where T, ::DataArrays.DataArray{Float64,1}) at .\broadcast.jl:434
[7] include_string(::String, ::String) at .\loading.jl:515

You can't create integers when you have NAs in z. You can round. them (in which case you'll get a DataArray of Floats), but when you try to make them Int it will complain because NA can't be Int64.
Instead do
convert(DataArray{Int}, round.(z))
Also, it is nicer to post an example using data available in a package rather than a local dataset on your computer.

Related

Julia's DifferentialEquations package fails when using fortran-wrapped right-hand-side

I'm trying to solve a system of ordinary differential equations with Julia's DifferentialEquations method. The right-hand-side of my ODEs is wrapped Fortran 90. Here is my Julia code:
using DifferentialEquations
function rhs(dNdt,N,p,t)
ccall((:__atmos_MOD_rhs, "./EvolveAtmFort.so"), Cvoid,(Ref{Float64}, Ref{Float64},Ref{Float64}),t,N,dNdt)
end
N0 = [0.0,298.9,0.0562,22.9,0.0166,35.96,0.0,0.0,0.0,0.0]*6.022e23
tspan = [0.0,1.0e6*365.0*24.0*60.0*60.0]
prob = ODEProblem(rhs,N0,tspan)
sol = solve(prob,Rodas5());
This produces the following long error, that has to do with calculating the derivative/jacobian of the right-hand-side. Below, I only include some portions of the Stacktrace that seem important.
MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.TimeGradientWrapper{ODEFunction{true,typeof(rhs),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{Float64,1},DiffEqBase.NullParameters},Float64},Float64,1})
Closest candidates are:
Float64(::Real, !Matched::RoundingMode) where T<:AbstractFloat at rounding.jl:200
Float64(::T) where T<:Number at boot.jl:715
Float64(!Matched::Int8) at float.jl:60
...
Stacktrace:
[1] convert(::Type{Float64},...
[2] Base.RefValue{Float64}...
[3] convert(::Type{Ref{Float64}}, ...
[4] cconvert(::Type{T} where T,
[5] rhs(::Array{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.TimeGradientWrapper{...
[6] (::ODEFunction{true,typeof(rhs),LinearAlgebra...
[7] (::DiffEqBase.TimeGradientWrapper{ODEFunction{true,typeof(rhs),LinearAlgebra...
[8] derivative!(::Array{Float64,1},...
[9] calc_tderivative!(::OrdinaryDiffEq...
[10] calc_rosenbrock_differentiation! at...
[etc...]
When I use a jacobian-free method, like Tsit5(), the integration works just fine. Only methods that require jacobian calculations fail. What am I doing wrong? How can I adjust my Fortran wrapper so that I can use implicit methods? Thanks!
This is due to the implicit automatic differentiation in some implicit solvers. You'll need to turn that off, i.e. Rodas5(autodiff=false).

Method dispatch when mixing S3 and S4

I'd like to understand the steps R goes through to find the appropriate function when mixing S3 and S4. Here's an example:
set.seed(1)
d <- data.frame(a=rep(c('a', 'b'), each=15),
b=rep(c('x', 'y', 'z'), times=5),
y=rnorm(30))
m <- lme4::lmer(y ~ b + (1|a), data=d)
l <- lsmeans::lsmeans(m, 'b')
multcomp::cld(l)
I don't fully understand what happens when the final line gets executed.
multcomp::cld prints UseMethod("cld"), so S3 method dispatch.
isS4(l) shows that l is an S4 class object.
It seems that, despite calling an S3 generic, the S3 dispatch system is completely ignored. Creating a function print.lsmobj <- function(obj) print('S3') (since class(l) is lsmobj) and running cld(l) does not print "S3".
showMethods(lsmobj) or showMethods(ref.grid) (the super class), do not list anything that resembles a cld function.
Using debugonce(multcomp::cld) shows that the function that is called eventually is cld.ref.grid from lsmeans.
I was wondering, however, how to realise that cld.ref.grid will eventually be called without any "tricks" like debugonce. That is, what are the steps R performs to get to cld.ref.grid.
In order for S3 methods to be registered, the generic has to be available. Here, I write a simple foo method for merMod objects:
> library(lme4)
> foo.merMod = function(object, ...) { "foo" }
> showMethods(class = "merMod")
Function ".DollarNames":
<not an S4 generic function>
Function "complete":
<not an S4 generic function>
Function "formals<-":
<not an S4 generic function>
Function "functions":
<not an S4 generic function>
Function: getL (package lme4)
x="merMod"
Function "prompt":
<not an S4 generic function>
Function: show (package methods)
object="merMod"
> methods(class = "merMod")
[1] anova as.function coef confint cooks.distance
[6] deviance df.residual drop1 extractAIC family
[11] fitted fixef formula getL getME
[16] hatvalues influence isGLMM isLMM isNLMM
[21] isREML logLik model.frame model.matrix ngrps
[26] nobs plot predict print profile
[31] ranef refit refitML rePCA residuals
[36] rstudent show sigma simulate summary
[41] terms update VarCorr vcov weights
Neither list includes foo. But if we define the generic, then it shows up in methods() results:
> foo = function(object, ...) UseMethod("foo")
> methods(class = "merMod")
[1] anova as.function coef confint cooks.distance
[6] deviance df.residual drop1 extractAIC family
[11] fitted fixef foo formula getL
[16] getME hatvalues influence isGLMM isLMM
[21] isNLMM isREML logLik model.frame model.matrix
[26] ngrps nobs plot predict print
[31] profile ranef refit refitML rePCA
[36] residuals rstudent show sigma simulate
[41] summary terms update VarCorr vcov
[46] weights
Now it includes foo
Similarly, in your example, methods() will reveal the existence of cld if you do library(multcomp), because that is where the generic for cld sits.
The older R documentation (pre-2016) used to contain more details than the current documentation but roughly speaking, the process is as follows in descending order of priority:
1) if the function is a standard S4 generic and any of the arguments in the signature are S4 (according to isS4), then the best S4 method is chosen according to the usual rules.
2) if the function is a nonstandard S4 generic then its body is executed, which at some point then calls S4 dispatch itself.
3) if the function is a S3 generic function then S3 dispatch takes place on the first argument (except for internal generic binary operators).
4) if the function isn't a generic at all, then it is evaluated in the usual way with lazy evaluation for all its arguments.
Note that from the help page from setGeneric:
"Functions that dispatch S3 methods by calling UseMethod are ordinary functions, not objects from the "genericFunction" class. They are made generic like any other function, but some special considerations apply to ensure that S4 and S3 method dispatch is consistent (see Methods_for_S3)."

Random Forest in R

If x is a Random Forest in R, for example,
x <- cforest (y~ a+b+c, data = football),
what does x[[9]] mean?
You can't subset this object, so in some sense, x[[9]] is nothing, it is not accessible as such.
x is an object of S4 class "RandomForest-class". This class is documented on help page ?'RandomForest-class'. The slots of this object are named and described there. You can also get the slot names via slotNames()
library("party")
foo <- cforest(ME ~ ., data = mammoexp, control = cforest_unbiased(ntree = 50))
> slotNames(foo)
[1] "ensemble" "where" "weights"
[4] "initweights" "data" "responses"
[7] "cond_distr_response" "predict_response" "prediction_weights"
[10] "get_where" "update"
If by x[[9]] you meant the 9th slot, then that is predict_weights and ?'RandomForest-class' tells us that this is
‘prediction_weights’: a function for extracting weights from
terminal nodes.

What is the datatype of "formula" for the e1071 package in R?

I am learning R's e1071 package to perform naive Bayes analysis. According to this tutorial, the package's naieveBayes method takes an input called "formula" -- which it says is "A formula of the form class ~ x1 + x2 + .."
How do I create such a formula? I have a dataset with gender, job and income columns and want to perform analysis on each of those dimensions/factors. Do I need to somehow turn them into a formula object? (I'm pretty new to R so I am unclear if R even supports specific data types like formula).
Just typing ~ x1 + x2 will create an object of the class formula.
Look at ?lm for examples of the basic idea. The domain-specific language is pretty flexible, so different models use it in different ways.
For instance:
dat <- data.frame(x=runif(10), y=runif(10))
lm( y ~ x, data=dat)
f <- y ~ x
class(f)
lm( f, data=dat )
Formulas are created by the use of the ~ function. It can be used either as a prefix operator or an infix operator:
form <- ~ atom
form <- atomA ~ atomB
is.function(`~`)
[1] TRUE
class(form[[2]])
#[1] "name"
is.language(form)
#[1] TRUE
is.language(form[[2]])
#[1] TRUE
Just like functions, formula objects get created with an environment:
str(form)
#Class 'formula' length 3 atomA ~ atomB
# ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
Different functions have different assumptions about which form (prefix or infix) to use. You can extract the components by positional number using list indexing. The first element will always be the tilde operator.
form[[1]]
# returns: ~>
form[[2]]
# atomA
form[[3]]
# atomB
You can demote a formula to a character object with as.character or promote a character object to a formula with as.formula. The formula function does not create formulas, but is rather a generic function that has specific behavior with different classed objects, generally serving to extract a formula from a regression object.
methods(formula)
#-----------
[1] formula.character*
[2] formula.data.frame*
[3] formula.default*
[4] formula.formula*
[5] formula.glm*
[6] formula.lm*
[7] formula.nls*
[8] formula.quantmod*
[9] formula.summary.formula.cross
Your results for methods(formula) will differ depending on which packages you have loaded at the time.
[10] formula.terms*

How to retrieve value of a function in R

when calling a function in R, how can I retrieve the result values. For example, I used 'roc' function and I need to extract AUC value and CI (0.6693 and 0.6196-0.7191 respectively in the following example).
> roc(tmpData[,lenCnames], fitted(model), ci=TRUE)
Call:
roc.default(response = tmpData[, lenCnames], predictor = fitted(model), ci = TRUE)
Data: fitted(model) in 127 controls (tmpData[, lenCnames] 0) < 3248 cases (tmpData[, lenCnames] 1).
Area under the curve: 0.6693
95% CI: 0.6196-0.7191 (DeLong)
I can use the following to fetch these values with associated texts.
> z$auc
Area under the curve: 0.6693
> z$ci
95% CI: 0.6196-0.7191 (DeLong)
Is there a way to get only the values and not the text.
I do now how to get these using 'regular expression' or 'strsplit' function, but I suspect there should be some other way to directly access these values.
It's helpful to use reproducible examples when asking a question. Also best to refer to the library you're asking about ("pROC"), since it is not loaded with base R. pROC has functions that extract auc and ci.auc objects from the roc object.
>library("pROC")
>data(aSAH)
# Basic example
>z <- roc(aSAH$outcome, aSAH$s100b,
levels=c("Good", "Poor"))
# Examining the class of 'auc' output shows us that it is also of class 'numeric'
> class(auc(z))
[1] "auc" "numeric"
# calling 'as.numeric' will extract the value
> as.numeric(auc(z))
[1] 0.7313686
# calling 'as.numeric' on the 'ci.auc' object extracts three values.
as.numeric(ci(z))
[1] 0.6301182 0.7313686 0.8326189
# The ones we want are 1 and 3
> as.numeric(ci(z))[c(1,3)]
[1] 0.6301182 0.8326189
Using the functions str, class, and attributes will often help you figure out how to get what you want out of an object.

Resources