How do I feed a TFRecord dataset into a keras model if the dataset has a dictionary with mutiple inputs and a single target? - dictionary

I have a TFRecord dataset I would like to feed to a keras model. It consists of a dictionary with multiple input features and a single ouput feature. In the tensorflow documentation it says for the input x: "A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights)."
If i simply feed it the way it is I get "ValueError: Missing data for input "input_2". Expected the following keys: ['input_2']". How do I now separate the inputs from the target in the dataset and create a tuple to feed the model, as stated in the docs? I've looked everywhere and only found cases where the featres were not a time-series or the input was a single one.

Related

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

Integration of Time series model of R in Tableau

I am trying to integrate Time series Model of R in Tableau and I am new to integration. Please help me in resolving below mentioned Error. Below is my code in tableau for integration with R. Calculation is Valid bur getting an error.
SCRIPT_REAL(
"library(forecast);
cln_count_ts <- ts(.arg1,frequency = 7);
arima.fit <- auto.arima(log10(cln_count_ts));
forecast_ts <- forecast(arima.fit, h =10);",
SUM([Count]))
Error : Error in auto.arima(log10(cln_count_ts)) : No suitable ARIMA model found
When Tableau calls R, Python, or another tool, it does so as a "table calc". That means it sends the external system one or more vectors as arguments and expects a single vector in response.
Depending on your data and calculation, you may want to send all your data to R in a single call, passing a very large vector, or call it several times with different vectors - say forecasting each region separately. Or even call R multiple times with many vectors of size one (aka scalars).
So with table calcs, you have other decisions to make beyond just choosing the function to invoke. Chiefly, you have to decide how to partition your data for analysis. And in some cases, you also need to determine the order that the data appears in the vectors you send to R - say if the order implies a time series.
The Tableau terms for specifying how to divide and order data for table calculations are "partitioning and addressing". See the section on that topic in the online help. You can change those settings by using the "Edit Table Calc" menu item.

keras embedding vector back to one-hot

I am using keras in NLP problem. There comes a question about word embedding when I try to predict next word according to previous words. I have already turn the one-hot word to word vector via keras Embedding layer like this:
word_vector = Embedding(input_dim=2000,output_dim=100)(word_one_hot)
And use this word_vector to do something and the model gives another word_vector at last. But I have to see what the prediction word really is. How I can turn the word_vector back to word_one_hot?
This question is old but seems to be linked to a common point of confusion about what embeddings are and what purpose they serve.
First off, you should never convert to one-hot if you're going to embed afterwards. This is just a wasted step.
Starting with your raw data, you need to tokenize it. This is simply the process of assigning a unique integer to each element in your vocabulary (the set of all possible words/characters [your choice] in your data). Keras has convenience functions for this:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
max_words = 100 # just a random example,
# it is the number of most frequently occurring words in your data set that you want to use in your model.
tokenizer = Tokenizer(num_words=max_words)
# This builds the word index
tokenizer.fit_on_texts(df['column'])
# This turns strings into lists of integer indices.
train_sequences = tokenizer.texts_to_sequences(df['column'])
# This is how you can recover the word index that was computed
print(tokenizer.word_index)
Embeddings generate a representation. Later layers in your model use earlier representations to generate more abstract representations. The final representation is used to generate a probability distribution over the number of possible classes (assuming classification).
When your model makes a prediction, it provides a probability estimate for each of the integers in the word_index. So, 'cat' as the most likely next word, and your word_index had something like {cat:666}, ideally the model would have provided a high likelihood for 666 (not 'cat'). Does this make sense? The model doesn't predict an embedding vector ever, the embedding vectors are intermediary representations of the input data that are (hopefully) useful for predicting an integer associated with a word/character/class.

Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances.
In R, I can construct models by group and save them to a list like this-
library(dplyr);library(randomForest);
Rf_model <- train %>% group_by(School_ID) %>%
do(school= randomForest(formula=Rf_formula, data=., importance = TRUE))
The list can be stored somewhere and I can call them when I need to use them like below -
save(Rf_model.school,file=paste0(Modelpath,"Rf_model.dat"))
load(file=paste0(Modelpath,"Rf_model.dat"))
pred <- predict(Rf_model.school$school[school_index][[1]], newdata=test)
I was wondering how to do that in Spark, whether or not I need to split the data by group first and how to do it efficiently if it's necessary.
I was able to split up the file by School_ID based on the below code but it seems it creates one individual job to subset for each iteration and takes a long time to finish the jobs. Is there a way to do it in one pass?
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID))
for( i <- 0 to programs.length - 1 ){
bySchoolArray(i).
write.format("com.databricks.spark.csv").
option("header", "true").
save("model_input_bySchool/model_input_"+ schools(i))
}
Source:
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
Edit 8/24/2015
I'm trying to convert my dataframe into a format that is accepted by the random forest model. I'm following the instruction on this thread
How to create correct data frame for classification in Spark ML
Basically, I create a new variable "label" and store my class in Double. Then I combine all my features using VectorAssembler function and transform my input data as follows-
val assembler = new VectorAssembler().
setInputCols(Array("COL1", "COL2", "COL3")).
setOutputCol("features")
val model_input = assembler.transform(model_input_raw).
select("SCHOOL_ID", "label", "features")
Partial error message(let me know if you need the complete log message) -
scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:57)
This is resolved after converting all the variables to numeric types.
Edit 8/25/2015
The ml model doesn't accept the label I coded manually so I need to use StringIndexer to go around the problem as indicated here. According to the official documentation, the most frequent label gets 0. It causes inconsistent labels across School_ID. I was wondering if there's a way to create the labels without resetting the order of the values.
val indexer = new StringIndexer().
setInputCol("label_orig").
setOutputCol("label")
Any suggestions or directions would be helpful and feel free to raise any questions. Thanks!
Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:
Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
Train models on each subset
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
Save models
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.
If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.
Edit
According to the official documentation:
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.
Since error indicates your column is a String you should transform it first, for example using StringIndexer.

Accessing class values in R's poLCA

I am trying my hand at learning Latent Component Analysis, while also learning R. I'm using the poLCA package, and am having a bit of trouble accessing the attributes. I can run the sample code just fine:
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds = within(ds, (cesdcut = ifelse(cesd>20, 1, 0)))
library(poLCA)
res2 = poLCA(cbind(homeless=homeless+1,
cesdcut=cesdcut+1, satreat=satreat+1,
linkstatus=linkstatus+1) ~ 1,
maxiter=50000, nclass=3,
nrep=10, data=ds)
but in order to make this more useful, I'd like to access the attributes within the objects created by the poLCA class as such:
attr(res2, 'Nobs')
attr(res2, 'maxiter')
but they both come up as 'Null'. I expect Nobs to be 453 (determined by the function) and maxiter to be 50000 (dictated by my input value).
I'm sure I'm just being naive, but I could use any help available. Thanks a lot!
Welcome to R. You've got the model-fitting syntax right, in that you can get a model out (don't know how latent component analysis works, so can't speak to the statistical validity of your result). However, you've mixed up the different ways in which R can store information pertaining to a model.
poLCA returns an object of class poLCA, which is
a list containing the following elements:
(. . .)
Nobs number of fully observed cases (less than or equal to N).
maxiter maximum number of iterations through which the estimation algorithm was set
to run.
Since it's a list, you can extract individual elements from your model object using the $ operator:
res2$Nobs # number of observations
res2$maxiter # maximum iterations
In some cases, there might be extractor functions to get this information without having to do low-level indexing. For example, many model-fitting functions will have a fitted method, which pulls out the vector of fitted values on the training data; and similarly residuals pulls out the vector of residuals. You should check whether there are such extractor functions provided by the poLCA package and use them if possible; that way, you're not making assumptions about the structure of the model object that might be broken in the future.
This is distinct to getting the attributes of an object, which is what you use attr for. Attributes in R are what you might call metadata: they contain R-specific information about an object itself, rather than information about whatever it is the object relates to. Examples of common attributes include class (the class of an object), dim (the dimensions of an array or matrix), names (names of individual elements of a vector/list/array) and so on.

Resources