Is there a standard (or available) way to export a Time Series model in R? PMML would work, but when I I try to use the pmml library, perhaps incorrectly, I get an error:
For example, my code looks similar to this:
require(fpp)
library(forecast)
library(pmml)
data <- ts(livestock, start = 1970, end = 2000,frequency=3)
model <- ses(data , h=10 )
export <- pmml(model)
And the error I get is:
Error in UseMethod("pmml") : no applicable method for 'pmml' applied to an object of class "forecast"
Here is what I can tell:
When you use ses(), you're not creating a model; you're using a model to find a prediction (in particular, making a forecast via exponential smoothing for a time series). Your result is not a predictive model, but rather a particular prediction of a model for a particular data set. While I'm not that familiar with PMML, from what I can tell, it's not meant for the job you are trying to use it for.
If you want to export the time series and the result, I would say your best bet would be to just export a .csv file with the data; just about anything can read .csv's. A ts object is nothing more than a glorified vector, so you can export the data and the times. Additionally, model is just a table with data. So try this:
write.csv(model, file="forecast.csv")
If you want to write the ts object, try one of the following:
write.csv(data, file="ts1.csv") # No dates for index
write.csv(cbind("time" = time(data), "val" = data), file = "ts2.csv") # Adds dates
Related
I have a data set that includes t(time) which ranges from 1-243 and 5 other variables which are separate company stock prices each also containing 243 data points. I want to run exponential smoothing on my variable "HD". I am trying to run the following command:
library(smooth)
smoothhd <- es(mydata$HD, h=10, holdout=TRUE, silent=FALSE, cfTYPE=MSE)
However, when I do I receive the following error:
The provided data is not ts object. Only non-seasonal models are available.
Forming the pool of models based on... ANN, AAN, Estimation progress: 100%... Done!
Error in .External.graphics(C_layout, num.rows, num.cols, mat, as.integer(num.figures), :
invalid graphics state.
Does anyone have any insight as to what is wrong with my command or what might need to be changed with my data file in order for this command to give me the smoothed data?
It just seems that your mydata$HD is not a time series object.
Try run is.ts(mydata$HD) and if it is not just coerce it to it with as.ts(mydata$HD).
I am applying a linear regression model to data, and using the relaimpo package to find the most significant factors.
When running the following code in R, it works fine
library(readxl)
nba <- read_excel("XXXX")
View(nba)
library(relaimpo)
rec = lm(won ~ o_fgm + o_ftm + o_pts , data = nba)
x= calc.relimp(rec, type = c("lmg"), rela = TRUE, rank = TRUE)
x$lmg
I get output of:
o_fgm o_ftm o_pts
0.3374366 0.2628543 0.3997091
When connecting via Tableau I use the following code:
SCRIPT_REAL("
won=.arg1
o_fgm=.arg2
o_ftm=.arg3
o_pts=.arg4
library(relaimpo)
rec = lm(won ~ o_fgm + o_ftm + o_pts)
x= calc.relimp(rec, type = c('lmg'), rela = TRUE, rank = TRUE)
"
,MEDIAN([Won]),MEDIAN([O Fgm]),MEDIAN([O Ftm]),MEDIAN([O Pts]))
I am getting the following error:
An error occurred while communicating with the RServe service.
Error in calc.relimp.default.intern(object = structure(list(won = 39, : Too few complete observations for estimating this model
I have run it with just the regression and it runs fine; so it seems the issue is with the relaimpo package. There is limited documentation online on this package so I cannot find a fix; any help is really appreciated thanks!
Data is from kaggle at https://www.kaggle.com/open-source-sports/mens-professional-basketball
(the "basketball_teams.csv" file)
When Tableau calls R or Python using the SCRIPT_REAL() function, or any SCRIPT_XXX() function, it is using what Tableau calls a table calculation. This has the effect of passing R one or more vectors -- and receiving back vector results -- instead of calling the function once for each scalar cell.
However, you are responsible for specifying how to partition your aggregate results into vectors, and how to order the rows in the vectors you send to R or Python. You do that by specifying the "partitioning" and "addressing" of each table calc via the Edit Table Calc command (right click on a calc field).
So the most likely issue, is that you are sending R less data than you expect, perhaps many short vectors instead of the one long one you intend. Read about Table Calcs and partitioning and addressing in the online help. You specify partitioning in particular by the choice of which dimensions are not set to "compute using" (a synonym for addressing dimensions) The Table Calc editor gives you some visible feedback as you try different settings - I recommend using specific dimensions in most cases.
For table calcs, the choice of partitioning and addressing is as important as the actual formula.
I am using the ts function for time series prediction in R. I tried it using SSAS(SQL) and it gives pretty good analysis for my data set. But when I try it in R it is like I can't pass many input variables.This is the function which I passed.
m<-ts(myt$amount, start = c(2010,1), end = c(2016,12),frequency = 12)
Can anyone tell me that where can I pass other input variables. As example, in my case I predict future amounts using monthly time data set. But I have other additional variables like Sales_Cateory, Sales_subcategory,etc which can be used as input variables.
I tried passing them as last parameters but didn't see any change for my results.
I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances.
In R, I can construct models by group and save them to a list like this-
library(dplyr);library(randomForest);
Rf_model <- train %>% group_by(School_ID) %>%
do(school= randomForest(formula=Rf_formula, data=., importance = TRUE))
The list can be stored somewhere and I can call them when I need to use them like below -
save(Rf_model.school,file=paste0(Modelpath,"Rf_model.dat"))
load(file=paste0(Modelpath,"Rf_model.dat"))
pred <- predict(Rf_model.school$school[school_index][[1]], newdata=test)
I was wondering how to do that in Spark, whether or not I need to split the data by group first and how to do it efficiently if it's necessary.
I was able to split up the file by School_ID based on the below code but it seems it creates one individual job to subset for each iteration and takes a long time to finish the jobs. Is there a way to do it in one pass?
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID))
for( i <- 0 to programs.length - 1 ){
bySchoolArray(i).
write.format("com.databricks.spark.csv").
option("header", "true").
save("model_input_bySchool/model_input_"+ schools(i))
}
Source:
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
Edit 8/24/2015
I'm trying to convert my dataframe into a format that is accepted by the random forest model. I'm following the instruction on this thread
How to create correct data frame for classification in Spark ML
Basically, I create a new variable "label" and store my class in Double. Then I combine all my features using VectorAssembler function and transform my input data as follows-
val assembler = new VectorAssembler().
setInputCols(Array("COL1", "COL2", "COL3")).
setOutputCol("features")
val model_input = assembler.transform(model_input_raw).
select("SCHOOL_ID", "label", "features")
Partial error message(let me know if you need the complete log message) -
scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:57)
This is resolved after converting all the variables to numeric types.
Edit 8/25/2015
The ml model doesn't accept the label I coded manually so I need to use StringIndexer to go around the problem as indicated here. According to the official documentation, the most frequent label gets 0. It causes inconsistent labels across School_ID. I was wondering if there's a way to create the labels without resetting the order of the values.
val indexer = new StringIndexer().
setInputCol("label_orig").
setOutputCol("label")
Any suggestions or directions would be helpful and feel free to raise any questions. Thanks!
Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:
Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
Train models on each subset
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
Save models
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.
If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.
Edit
According to the official documentation:
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.
Since error indicates your column is a String you should transform it first, for example using StringIndexer.
I'm currently struggling with creating a nice output file for performed regressions.
I stored all my performed regressions (25) with the lm function in a list. It is very inconvenient to analyze the output in R (at least i don't know how to do it efficiently). For that reason I d like to export all my results in an Excel Sheet.
The list called "myregressions" has 25 entries of the 25 regressions which are named after the performed year : myregression$year_1988 - myregression$year_2011 with all the information on coefficients, residuals, effects rank etc.
Is there an easy way to accomplish that? Thank you all in advance.
You could convert the list into a data frame, and then do something like this:
dataFrameToCSV = function(dataFrame, path)
{
write.table(dataFrame, path, sep=",", row.names=FALSE)
}