I have created a regression model and wanted to update some data using Predicion.update api for machine learning purpose.
The java api doc for update class "setOutput function" is not very clear for me. It says output value as "regression or class label". Does that mean that for regression model, I should call the setOutPut("regression")?.
Also, how can I verify if my update is successful?. Thanks.
prediction.trainedmodels.get,check the modelInfo/numberInstances,is the number add 1?
It should be the output value of your training instance.
for example if you have a movie rating instance like:
Rating Genre Title
10 Crime Godfather
The output should be 10 and the rest should go in the csvInstance.
Related
I'm trying to make a table for my STM model just like this.
I am new to R programming language and STM.
I have been searching in the documentation about this and do not know
if there is a function that makes just like this format
or if I have to manually make it like this.
Can I get some example on how to make a table like this and where can I get
Topic Proportions in % and if the topic has Appeared in Literate or not?
As for expected topic proportion tables, the STM packages gives you a few options. For example, once you've generated your stm topic model object, you can pass the model through plot.STM(mod, type = "summary", main = "title") where mod is your stored model object, summary is the default table that this function generates, showing the expected proportion of the text corpus that belongs to a given topic (STM package, Roberts et al., 2018), and main = "title" is simply an optional command you can use to label your figure. Alternatively, you can also pass your stored topic model through plot(mod, type = "summary") for a similar result.
As for extracting the exact % expected proportion by topic, I've had a similar question myself. Below is a custom workaround that I've recently implemented using the make.dt function from STM which outputs topic proportions by document.
# custom proportion extraction proportions_table <- make.dt(mod) summarize_all(proportions_table, mean)
Yes, you need to make it manually. Topic labels are manually defined. You can find theta which is the topic proportions matrix from the STM output.
I used the nowcast function from R package to use dynamic factor model to nowcast GDP using the extracted factors. I Have tried multiple combination of the initial variables and finally obtained this model which all variables in it seems significant and teh ales obtained for my variable of interest is acceptable.
enter image description here
But I can't find any reference about what tests on residuals that I need to do in order to validate this model.
I am really struggling and have been stuck in this for a month, I need to submit my graduation project this weekend and I really need this model to work. so any help will be very much appreciated. Thank you.
Update 1:
This is teh acf plot n residuals suggested by the same package nowcasting, I think my model passes that test and therefore I can use it. right?
enter image description here
I am attempting to give classifications to various bodies of text using Azure ML Studio and I have my successful output all the way until I deploy and test a web service. Once I deploy my web service and attempt to test it I get the following error:
Error 0035: Features for The vocabulary is empty. Please check the Minimum n-gram document frequency. required but not provided., Error code: ModuleExecutionError, Http status code: 400
The vocabularies for the extract n-gram modules are not empty. The only aspect that changes from the working model to the Web service error is the web service input.
Training Model
Predictive Model
You need to create 2 N-GRAMS modules in your "Training experiment" as shown on the screenshot of the link below
Follow the steps from the link:
COPY your NGRAMS module,
SET one in "CREATE" (for training) and the other one in "READ" (for testing)
CONNECT the vocabulary output from the training NGRAMS to the input vocabulary of the testing NGRAMS
SAVE the output vocabulary from testing part to a "Convert to Dataset" block.
You can then use this to feed your deployments NGRAMS steps in "READ" mode.
See steps here =>
https://learn.microsoft.com/fr-fr/azure/machine-learning/algorithm-module-reference/extract-n-gram-features-from-text
The MS documentation is misleading here.
What Azure ML Studio is looking for is a second input for the "extract n-grams" pill in the predictive experiment.
The desired second input is a dataset. The one you want is produced by the "extract n-grams" pill in the training experiment. To get and use this dataset, go to your training experiment and add a "Convert to CSV" pill on the second output node of the "Extract N-Grams" pill. Then save this as a dataset.
Now you use it as a second input in your predictive model's "n-grams" pill. You should be good to go!
From the Azure GitHub: https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/algorithm-module-reference/extract-n-gram-features-from-text.md
Score or publish a model that uses n-grams
Copy the Extract N-Gram Features from Text module from the training dataflow to the scoring dataflow.
Connect the Result Vocabulary output from the training dataflow to
Input Vocabulary on the scoring dataflow.
In the scoring workflow, modify the Extract N-Gram Features from
Text module and set the Vocabulary mode parameter to ReadOnly. Leave
all else the same.
To publish the pipeline, save Result Vocabulary as a dataset.
Connect the saved dataset to the Extract N-Gram Features from Text
module in your scoring graph.
I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
A customer buys a subscription under which he is allowed to use x$
amount of the service provided.
A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
Each subscription has a limit on the usage which is x$
Each subscription has a startdate and end date.
Subscription will no longer be used after enddate.
Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
The 2nd scenario after or condition described above is what I want to
predict. Among the subscriptions which are yet to end and customer
hasn't overused, WHEN will the limit be reached. i.e. predict the
ExceedanceMonth column in the above table.
Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!
Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).
The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict().
I am attempting to develop an Azure ML experiment that uses R to perform predictions of a continuous response variable. The initial experiment is relatively simple, incorporating only a few experiment items, including "Create R Model", "Train Model" and "Score Model", along with some data input.
I have written a training script and a scoring script, both of which appear to execute without errors when I run the experiment within ML Studio. However, when I examine the scored dataset, the score values are all missing values. So I am concerned that my scoring script could be returning scores incorrectly. Can anyone advise what type I should be returning? Is it meant to be a single column data.frame, or something else?
It is also possible that my scores are not being properly calculated within the scoring script, although I have run the training and scoring scripts within R Studio, which shows the expected results. It would also be helpful if someone could suggest how to perform debugging of my scoring script in some way, so that I could determine whereabouts the code is failing to behave as expected.
Thanks, Paul
Try using this sample and compare with yours - https://gallery.cortanaintelligence.com/Experiment/Compare-Sample-5-in-R-vs-Azure-ML-1
My suggestion is do data preprocessing before you do the data input. Clear the missing values and outliers. Use relevant data preprocessing techniques to perform those operations.