I am confused to make mlmodel updatable using coreml3 tools - coreml

I have a regressor mlmodel trained using mobilenetv2。The last several layers are as follows:
I wanna to make this mlmodel to a updatable mlmodel and train the innerProduct layer (fully-connected layer in pytorch). I have converted the mlmodel referencing to this blog:
https://machinethink.net/blog/coreml-training-part4/ . But I found that the updatable mlmodel's second training input is default set to "score_true" and it is just a value(datatype: int32).
However, the output of softmax layer is a vector with 10 float values. So how can I set the second training input to a vector, because the ground truth is a vector with 10 float values.
And I look up the API of CrossEntropyLoss int coremltools3.3. Its input param can accept a vector of length N. So how can I change the default generated score_true from a intVal to a vector?
Thanks very much.

What you pass into the score_true MLMultiArray is the index of the class. You don't need to one-hot encode this yourself, i.e. no need to turn it into a vector of length N.

Related

keras embedding vector back to one-hot

I am using keras in NLP problem. There comes a question about word embedding when I try to predict next word according to previous words. I have already turn the one-hot word to word vector via keras Embedding layer like this:
word_vector = Embedding(input_dim=2000,output_dim=100)(word_one_hot)
And use this word_vector to do something and the model gives another word_vector at last. But I have to see what the prediction word really is. How I can turn the word_vector back to word_one_hot?
This question is old but seems to be linked to a common point of confusion about what embeddings are and what purpose they serve.
First off, you should never convert to one-hot if you're going to embed afterwards. This is just a wasted step.
Starting with your raw data, you need to tokenize it. This is simply the process of assigning a unique integer to each element in your vocabulary (the set of all possible words/characters [your choice] in your data). Keras has convenience functions for this:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
max_words = 100 # just a random example,
# it is the number of most frequently occurring words in your data set that you want to use in your model.
tokenizer = Tokenizer(num_words=max_words)
# This builds the word index
tokenizer.fit_on_texts(df['column'])
# This turns strings into lists of integer indices.
train_sequences = tokenizer.texts_to_sequences(df['column'])
# This is how you can recover the word index that was computed
print(tokenizer.word_index)
Embeddings generate a representation. Later layers in your model use earlier representations to generate more abstract representations. The final representation is used to generate a probability distribution over the number of possible classes (assuming classification).
When your model makes a prediction, it provides a probability estimate for each of the integers in the word_index. So, 'cat' as the most likely next word, and your word_index had something like {cat:666}, ideally the model would have provided a high likelihood for 666 (not 'cat'). Does this make sense? The model doesn't predict an embedding vector ever, the embedding vectors are intermediary representations of the input data that are (hopefully) useful for predicting an integer associated with a word/character/class.

netcdf dimension variable interpretation

I'm trying to understand if this is allowed by NetCDF standards. It does not make sence to me, but maybe there is a reason why it is not forbidden at library level. Ncdump:
netcdf tt {
dimensions:
one = 2 ;
two = 1 ;
variables:
int64 one(two) ;
data:
one = 1 ;
}
And code to produce this file in python:
from netCDF4 import Dataset
rr=Dataset('tt.nc','w')
rr.createDimension('one',2)
rr.createDimension('two',1)
var1=rr.createVariable('one','i8',('two'))
var1[:]=1
rr.close()
Note the variable with the same name as dimension, but with a different dimension than itself?!
So two questions:
is this allowed by standard?
if not, should it be restricted by libraries?
It's valid because the names of attributes, names of dimensions, and names of variables all exist in different namespaces.
It's valid, but obviously makes for confusing code and output and would not be acceptable in a professional sense. Though, note that single-dimension arrays that have the same name and size as the dimension they are assigned to are called "coordinate variables."
For example, you'll often see a variable named latitude that is 1D and has a dimension named latitude. ncks or ncdump should reveal a (CRD) next to that variable display, indicating that it is indeed coordinated to the array of latitudes.

How to find out which seed the MICE R-package chose for multiple imputation when using seed=NA?

I´m doing a multiple imputation for a dataframe named "mydata" with this code:
library(mice)
imp<-mice(mydata,pred=pred,method="pmm", m=10)
Because the default argument for this is function is "seed=NA", the seed-number is chosen randomly. I would like to keep it like this, because i don´t know which number i should choose as a seed. But for replication i would like to know which seed this function chose for me. Is there a possibility to inspect the mids-object "imp" for the seed-value? Or should i just use a random number generator and set the seed to a generated value?
If you look at the documentation, there is no such thing as a set.seed argument for mice function. There is, however a seed argument which takes an integer. If left alone, the integer is generated randomly.
An integer that is used as argument by the `set.seed()` for offsetting the random
number generator. Default is to leave the random number generator alone
You can choose your own integer. If you're stuck at what to choose, try your lucky number, or some random integer, with sky or architecture of your system being the limit.
The function sets seed in the following manner, which translates to "set seed only if specified, otherwise leave alone" as mentioned in the documentation.
if (!is.na(seed))
set.seed(seed) ## FEH 1apr02

Markov Algorithm for Random Writing

I got a litte problem understanding conceptually the structure of a random writing program (that takes input in form of a text file) and uses the Markov algorithm to create a somewhat sensible output.
So the data structure i am using is to use cases ranging from 0-10. Where at case 0: I count the number a letter/symbol or digit appears and base my new text on this to simulate the input. I have already implemented this by using an Map type that holds each unique letter in the input text and a array of how many there are in the text. So I can simply ask for the size of the array for the specific letter and create output text easy like this.
But now I Need to create case1/2/3 and so on... case 1 also holds what letter is most likely to appear after any letter aswell. Do i need to create 10 seperate arrays for these cases, or are there an easier way?
There are a lot of ways to model this. One approach is as you describe, with an multi-dimensional array where each index is the following character in the chain and the final result is the count.
# Two character sample:
int counts[][] = new int[26][26]
# ... initialize all entries to zero
# 'a' => 0, 'b' => 1, ... 'z' => 25
# For example for the string 'apple'
# Note: I'm only writing this like this to show what the result is, it should be in a
# loop or function ...
counts['a'-'a']['p'-'a']++
counts['p'-'a']['p'-'a']++
counts['p'-'a']['l'-'a']++
counts['l'-'a']['l'-'e']++
Then to randomly generate names you would count the number of total outcomes for a given character (ex: 2 outcomes for 'p' in the previous example) and pick a weighted random number for one of the possible outcomes.
For smaller sizes (say up to 4 characters) that should work fine. For anything larger you may start to run into memory issues since (assuming you're using A-Z) 26^N entries for an N-length chain.
I wrote something like a couple of years ago. I think I used random pages from Wikipedia to for seed data to generate the weights.

Generating a SequenceFile

Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318#N06/4019040356 http://flickr.com/photos/46857830#N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178#N07/5441742937
...
Before this I would turn the input into csv (or arff) as follows
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)?
Some references that I found, but not sure if they are relevant:
Writing a SequenceFile
Formatting input matrix for svd matrix factorization (can I store my matrix in this form?)
RandomAccessSparseVector (considering I only list images that are assigned with a given tag instead of all the images in a line, is it possible to represent it using this vector?)
SequenceFile write
SequenceFile explanation
You just need a SequenceFile.Writer, which is explained in your link #4. This lets you write key-value pairs to the file. What the key and value are depends on your use case, of course. It's not at all the same for clustering versus matrix decomposition versus collaborative filtering. There's not one SequenceFile format.
Chances are that the key or value will be a Mahout Vector. The thing that knows how to write a Vector is VectorWritable. This is the class you would use to wrap a Vector and write it with SequenceFile.Writer.
You would need to look at the job that will consume it to make sure you're passing what it expects. For clustering, for example, I think the key is ignored and the value is a Vector.

Resources