What type should the returned scores from an R scoring script? - azure-machine-learning-studio

I am attempting to develop an Azure ML experiment that uses R to perform predictions of a continuous response variable. The initial experiment is relatively simple, incorporating only a few experiment items, including "Create R Model", "Train Model" and "Score Model", along with some data input.
I have written a training script and a scoring script, both of which appear to execute without errors when I run the experiment within ML Studio. However, when I examine the scored dataset, the score values are all missing values. So I am concerned that my scoring script could be returning scores incorrectly. Can anyone advise what type I should be returning? Is it meant to be a single column data.frame, or something else?
It is also possible that my scores are not being properly calculated within the scoring script, although I have run the training and scoring scripts within R Studio, which shows the expected results. It would also be helpful if someone could suggest how to perform debugging of my scoring script in some way, so that I could determine whereabouts the code is failing to behave as expected.
Thanks, Paul

Try using this sample and compare with yours - https://gallery.cortanaintelligence.com/Experiment/Compare-Sample-5-in-R-vs-Azure-ML-1

My suggestion is do data preprocessing before you do the data input. Clear the missing values and outliers. Use relevant data preprocessing techniques to perform those operations.

Related

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

Comparison of good vs bad dataset using R

Stuck in a problem. There are two datasets A and B. Say they're datasets of two factories. Factory A is performing really well whereas Factory B is not. I have the data-set of Factory A (data being output from the manufacturing units) as well as Factory B, both having the same variables. How can I identify the problematic variable in Factory B which needs to be fixed so that Factory B starts performing well too? Therefore, I need to identify the problematic variable which needs immediate attention.
Looking forward to your response.
p.s: coding language being used is R
Well this is shameless plug for the dataMaid package which I helped write and which sort of does what you are asking. The idea of the dataMaid package is to run a battery of tests on the variables in a data frame and produce a report that a human investigator (preferably someone with knowledge about the context) can look through in order to identify potential problems.
A super simple way to get started is to load the package and use the
clean function on a data frame (if you try to clean the same data
frame several times then it may be necessary to add the replace=TRUE
argument to overwrite the existing report).
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(trees)
clean(trees)
This will create a report with summaries and error checks for each
variable in the trees data frame. A summary of all the variables is provided and for the trees data it looks like this
while the information from each variable may look like this
Here we get a status about the variable type, summary statistics, a plot and - in this case - an indicator that there might be a problem with outliers.
The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset
data(toyData)
check(toyData$var2) # Individual check of var2
check(toyData) # Check all variables at once
By default the standard battery of tests is run depending on the
variable type, but it is possible to extend the package by providing your own checks.
In your case I would run the package on both datasets to get two reports, and any major differences in those would raise a flag about what could be problematic.

R Neural Network Forecasting - Aspect of Randomness?

I have been trying different methods of forecasting and stumbled upon the
nnetar()
function in the forecast package of R. I soon quickly realized that while this does work to forecast, it gives me something different every time I run it. Could anybody help to explain why this happens? I thought I had a decent understanding of neural nets and I don't see what could make drastic differences in forecasts, unless the nnetar() function randomly selects the number of nodes or something. Any help?
20, by default, networks are trained with random starting values and then their predictions are averaged when you use the function.
Because the function uses random starting values for each run, the forecasts will be different for each call too.
EDIT: new question from OP in the comments
In order to control the function and get the same random starting values each time, you can simple use the function set.seed() with the value of your choice.
For example:
set.seed(666)
forecast(nnetar(...),...)
set.seed(666)
forecast(nnetar(...),...)
set.seed(666)
forecast(nnetar(...),...)
will give the same results every time you run it with this "seed" value (666). You have to run set.seed(666) before every run of the rest of you code of course.
EDIT 2: new new question from OP in the comments
In order to have 100 different networks to fit with random starting weights:
nnetar(...,repeats=100,...)

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

Cluster analysis in R: How can I get deterministic results from pvclust?

pvclust is great for cluster analysis in R. However, when running it as part of a batch operation, it is annoying to get different results for the same data. Obviously, there are many "correct" clusterings of the same data, and it seems that pvclust uses some randomness to determine the clusters of a specific run. But is there any way to get deterministic results?
I want to be able to present a minimal, repeatable analysis package: the data plus an R script, and a separate written document that contains my interpretations of the clustering. It is then possible for others to add to the analysis, e.g. by changing the aesthetic appearance of plots. Now, the interpretations will always be out of sync with what someone else gets when they run the script containing pvclust.
Not only for cluster analysis, but when there is randomness involved, you can fix the random number generator so you always get the same results.
Try:
set.seed(seed=123)
# your code here
The seed can be any integer, or something that can be converted to integer. And that's all.
i've only used k means. There I had to set the number of 'runs' or iterations to a higher value than default to get the same custers at consecutive runs.

Resources