lmRob() on mostly linear data: errors and defense - r

I am working with a legacy R script using robust::lmRob().
The documentation for lmRob() is quite clear, and the example runs fine (for docs and example, see here)
Based on that, I would have thought this simple script would work, but it fails with 'msg.UCV' not found
library(robust)
xx = c(c(2.1111,3.1111,4.1111),seq(1,7,by=0.3))
yy = 0.5+1.5*xx
yy[1] = -21; yy[2] = 0; yy[3] = -10
df = data.frame(x=xx,y=yy)
mf = lmRob(y~x, data=df)
Note that the data is 21 exactly linear rows, plus three outliers.
If instead one uses
yy = 0.5+1.5*xx + 0.01*xx^2
then no errors arise.
I would switch to robustbase::lmrob() but lmRob.fit.compute() seems to do some fairly nontrivial things.
What is a good defensive programming technique to prevent near-linear data from causing errors in my program?

Related

ksmooth function doesn't work with parameters via ellipsis

I am currently working with R due to a course at university, so I am still quite inexperienced.
We use R for exploratory data analysis. In a data analysis we are supposed to apply different regression models to the data and generate the same plots for each. Additionally, we are supposed to play a bit with the parameters for learning purposes. To avoid unattractive 10-20 times copy-pasting I wrote a function that shows the regression function and the parameters for it as an ellipsis (...). In this function I call the passed function with the ellipsis as parameter.
library("astsa")
data_glob <- globtemp
plot.data.and.reg <- function(data, reg.func, ...){
model <- reg.func(...)
par(mfrow = c(1, 2))
plot(data)
abline(model, col = "orange", lwd = 3)
qqnorm(data)
}
This works for the simple lm function, but unfortunately not for the ksmooth function.
When I pass this function I get the error message: "numeric y must be supplied. For density estimation use density()".
plot.data.and.reg(
data_g,
lm,
list(
formula = as.formula("data_glob ~ time(data_glob)"),
data = data_glob
)
)
plot.data.and.reg(
data_glob,
ksmooth,
list(
x = as.numeric(time(data_glob)),
y = as.numeric(data_glob),
kernel = "box",
bandwidth = 0.25
)
)
Thereupon I looked at the source code of ksmooth. It shows that this error message occurs because the check "missing(y)" fails. Apparently a problem occurs because I passed the parameters as an ellipsis and it doesn't seem to "unpack".
For simplicity, I wrote a dummy function to test if I can add this "unpack" myself.
test.wrapper <- function(func, ...){
func(...)
}
test <- function(x, y){
match.call()
if(missing(y))
print("Leider hatte ich Recht")
print(x)
print(y)
}
test.wrapper(test, list(x = 10, y = 20))
Unfortunately I have not found a solution yet.
From Python I know it so that as with kwargs a dictionary can be unpacked with the ** operator. Is there an equivalent in R? Or how to make sure in R that the parameters from the ellipsis are used correctly?
Since it worked with the lm function without errors I also looked again in their source code . Unfortunately, with my little experience in R, I can't see exactly where the essential difference is.
Overall, I would attribute the error to the fact that the ksmooth function is not yet designed for use with an ellipsis, but I am not sure. How would I need to adjust the ksmooth code to make it work with ...?
(For my Uni task, I will resort to the copy-paste (anti) pattern if in doubt. After searching for so long, I would still be interested in the solution and it may be useful in the future).
Thanks a lot for your help!
The closest equivalent of the */** splat in Python is the do.call function.
However, you don’t need this here. The actual issue is that you’re passing the extra arguments as a list rather than individually. Once you flatten the list, it works1:
plot.data.and.reg(
data_glob,
ksmooth,
x = as.numeric(time(data_glob)),
y = as.numeric(data_glob),
kernel = "box",
bandwidth = 0.25
)
I’m actually surprised that it works with a list for lm; that’s not intentional, it’s essentially an accident caused by how lm is currently implemented.
1 I say it “works” because there’s no error and it plots something, but with your example data there’s no visible regression line (abline is inappropriate for the output of ksmooth), and the smoothing parameters do nothing — the result is identical to the unsmoothed input.
To get this to work, use lines instead of abline. And as for the smoothing, for your example data a bandwidth of 10 works fine.

r Nomad categorical optimisation (snomadr)

I am trying to use the Nomad technique for blackbox optimisation from the crs package (C implementation), which is called via the snomadr function. The method works when trying straight numerical optimisation, but errors when categorical features are included. However the help for categorical optimisation is not very well documented, so I am struggling to see where I am going wrong. Reproducible code below:
library(crs)
library(randomForest)
Illustrating this on randomForest & the iris dataset.
Creating the randomForest model (leaving the last row out as starting points for the optimizer)
rfIris <- randomForest(x=iris[-150,-c(1)], y=unlist(iris[-150,1]))
The objective function (functions we want to optimize)
objFn <- function(x0,model){
preds <- predict(object = model, newdata = x0)
as.numeric(preds)
}
Test to see if the objective function works (should return ~6.37)
objOut <- objFn(x0=unlist(iris[150,-c(1)]),model = rfIris)
Creating initial conditions, options list, and upper/lower bounds for Nomad
x0 <- iris[150,-c(1)]
x0 <- unlist(x0)
options <- list("MAX_BB_EVAL"=10000,
"MIN_MESH_SIZE"=0.001,
"INITIAL_MESH_SIZE"=1,
"MIN_POLL_SIZE"=0.001,
"NEIGHBORS_EXE" = c(1,2,3),
"EXTENDED_POLL_ENABLED" = 'yes',
"EXTENDED_POLL_TRIGGER" = 'r0.01',
"VNS_SEARCH" = '1')
up <- c(10,10,10,10)
low <- c(0,0,0,0)
Calling the optimizer
opt <- snomadr(eval.f = objFn, n = 4, bbin = c(0,0,0,2), bbout = 0, x0= x0 ,model = rfIris, opts=options,
ub = up, lb = low)
and I get an error about the NEIGHBORS_EXE parameter in the options list. It seems as if I need to supply NEIGHBORS_EXE a file corresponding to a set of 'extended poll' coordinates, however is it not clear what these exactly are.
The method works by setting "EXTENDED_POLL_ENABLED" = 'no' in the options list, as it then ignores the categorical variables and defaults to numerical optimisation, but this is not what I want.
I also managed to pull up some additional information for NEIGHBORS_EXE using
snomadr(information=list("help"="-h NEIGHBORS_EXE"))
and again, do not understand what the 'neighbours.exe' is meant to be.
Any help would be much appreciated!
This is the response from Zhenghua who coded the R interface:
The issue is that he did not configure the parameter “NEIGHBORS_EXE” properly. He need to prepare an Executable file for defining the neighbors, put the executable file in the folder where R is called, and then set the parameter “NEIGHBORS_EXE” to the executable file name.
You can contact us at nomad#gerad.ca if you wish to continue the discussion.
About the neighbours_exe parameter you can refer to the section 7.1 of user guide of Nomad
https://www.gerad.ca/nomad/Downloads/user_guide.pdf

bnlearn package: unexpected cpdist (prediction) behaviour

I encounter a problem that goes beyond my understanding.
I have made a simple reproducible example for you to test it out.
Basically I create a Bayesian network with two strongly correlated variables that are linked together. One would expect that if one of them is high, the other one should also be (Since they are directly linked).
library(bnlearn)
Learning.set4 = cbind(c(1,2,1,8,9,9),c(2,0,1,10,10,10))
Learning.set4 = as.data.frame(Learning.set4)
colnames(Learning.set4) = c("Cause","Cons")
b.network = empty.graph(colnames(Learning.set4))
struct.mat = matrix(0,2,2)
colnames(struct.mat) = colnames(Learning.set4)
rownames(struct.mat) = colnames(struct.mat)
struct.mat[1,2] = 1
bnlearn::amat(b.network) = struct.mat
haha = bn.fit(b.network,Learning.set4)
# Here we get a mean that is close to 10
seems_logic_to_me=cpdist(haha, nodes="Cons",
evidence=list("Cause"=10), method="lw")
# Here I get a mean that is close to 5, so a high value
# of Cons wouldn't mean anything for Cause?
very_low_cause_values = cpdist(haha, nodes="Cause",
evidence=list("Cons"=10), method="lw")
Could anyone enlighten me here on the reason why it doesn't work with the lw method? (You can try with ls and it seems to work fine).
lw stands for likelihood weighting
UPDATE:
Got the solution from the maintainer.
Adding the following at the end will print the the expected prediction:
print (sum(very_low_cause_values[, 1] * attr(very_low_cause_values, "weights")) / sum(attr(very_low_cause_values, "weights")))

Export Linear Mixed Effects Model Outputs in csv using Julia Language

I am new to Julia programming language, however, I am fitting a Linear Mixed Effects Model and I find it difficult to save the fixed and random effects estimates in .csv files.
An example code can be found:
using MixedModels
#time modelOutput = fit(lmm(Y~ A + B + (0 + A | group), data))
There is available reference about how to obtain the fixed (fixef(modelOutput)) and random (ranef(modelOutput)) effects however using a DataFrame I am facing errors.
Any advice is appreciated.
Okay, I actually took the time to do this for you. A CoefTable is a type defined in statmodels here. Given this information, we can extract the relevant information from the CoefTable instance as follows:
df = DataFrame(variable = ct.rownms,
Estimate = ct.mat[:,1],
StdError = ct.mat[:,2],
z_val = ct.mat[:,3])
This will give an nvar-by-4 DataFrame which you can then write to csv as described earlier using writetable("output.csv",df)
I had a number of problems getting the accepted answer to work; Julia has evolved a lot since then. I rewrote it based primarily on code from the jglmm R package, with some adaptation/cobbling-together from other sources ...
"""
outfun(m, outfn="output.csv")
output the coefficient table of a fitted model to a file
"""
outfun = function(m, outfn="output.csv")
ct = coeftable(m)
coef_df = DataFrame(ct.cols);
rename!(coef_df, ct.colnms, makeunique = true)
coef_df[!, :term] = ct.rownms;
CSV.write(outfn, coef_df);
end

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources