Can I update the spacy's Entity Linking knowledge base after training? - spacy-3

Let's suppose I have successfully trained an Entity Linking model, and it is working just fine. But, eventually, I'm going to update some aliases of the knowledge base. Just some aliases not the description nor new entities.
I know that spacy has a method to do so which is: kb.add_alias(alias="Emerson", entities=qids, probabilities=probs). But, what if I have to do that after the training process? Should I re-run everything, or updating the KB will do?

Best thing is to try it and see.
If you're just adding new aliases, it really depends on how much they overlap with existing aliases. If there's no overlap it won't make any difference, but if there is overlap that could have resulted in different evaluations in training, which could modify the model. Whether those differences are significant or not is hard to say.

Related

Can you resolve identity queries whilst Face API is undergoing training?

I am investigating solutions for identifying people utilising facial recognition and I am interested in using Microsoft's Face API.
I have noted that when adding new people the model needs to be trained again before those people will be recognised.
For our application it is crucial that whilst training is happening that the model continues to resolve identify requests so that the service runs uninterrupted.
It seems to make sense that the old model would continue to respond to identify requests whilst the new model is being trained up but I am not sure if this assumption is correct.
I would be grateful if someone with knowledge of the API could advise if this is the case or if not if there is another way round to ensure continuous resolution of identify requests. I have thought about creating a whole new person group with all the new images but this involves copying a lot of data and seems an inefficient way to go.
From the same documentation link in the previous answer:
During the training period, it is still possible to perform Identification and FindSimilar if a successful training is done before. However, the drawback is that the new added persons/faces will not appear in the result until a new post migration to large-scale training is completed.
My understanding is that this would work with LargePersonGroups (hence "post migration to large-scale training") but is unclear if would work for the legacy PersonGroups.
I tried a few manipulations on my projects using face API but the collection of faces is too small and the training is too fast to check. I think it's not blocking the previous version but cannot guarantee it.
Anyway, you will be interested in the following part of the documentation addressing the problems of training latency: https://learn.microsoft.com/en-us/azure/cognitive-services/face/face-api-how-to-topics/how-to-use-large-scale#buffer
It shows how you could avoid the problem you depict by using a "buffer" group

Transforming h2o model into non-h2o one

I know that there is possibility to export/import h2o model, that was previously trained.
My question is - is there a way to transform h2o model to a non-h2o one (that just works in plain R)?
I mean that I don't want to launch the h2o environment (JVM) since I know that predicting on trained model is simply multiplying matrices, applying activation function etc.
Of course it would be possible to extract weights manually etc., but I want to know if there is any better way to do it.
I do not see any previous posts on SA about this problem.
No.
Remember that R is just the client, sending API calls: the algorithms (those matrix multiplications, etc.) are all implemented in Java.
What they do offer is a POJO, which is what you are asking for, but in Java. (POJO stands for Plain Old Java Object.) If you call h2o.download_pojo() on one of your models you will see it is quite straightforward. It may even be possible to write a script to convert it to R code? (Though it might be better, if you were going to go to that trouble, to convert it to C++ code, and then use Rcpp!)
Your other option is to export the weights and biases, in the case of deep learning, implement your own activation function, and use them directly.
But, personally, I've never found the Java side to be a bottleneck, either from the point of view of dev ops (install is easy) or computation (the Java code is well optimized).

ECS with Go - circular imports

I'm exploring both Go and Entity-Component-Systems. I understand how ECS works, and I'm trying to replicate what seems to be the go-to document of ECS, namely http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy/
For performance, the document recommends to use static arrays of every component type. That is, not arrays of component interfaces (arrays of pointers). The problem with this in Go is circular imports.
I have one package, ecs, which contains the definitions for Entity, Component and System types/interfaces as well as an EntityManager. Another package, ecs/components, contains the various components. Obviously, the ecs/components package depends on ecs. But, to declare arrays of specific components in EntityManager, ecs would depend on ecs/components, therefore creating a circular import.
Is there any way of avoiding this? I am aware that normally a high level system should not depend on lower systems. I'm also want to point out that using an array of pointers is probably fast enough for my purposes, but I'm interested in possible workarounds (for future reference)
Thank you for your help!
For performance, the document recommends to use static arrays of every
component type.
I'm just going to start off saying that I may be blind, but I ctrl+f'd and read that document multiple times and didn't see anything close to that. (Certainly some optimizations could be made this way with regards to things like avoiding cache misses, but I'm dubious it in any way outweighs the clerical overhead).
There's the easy answer to the exact question you asked first, the . import. Any package with an import statement like import . "some/other/package" will treat that package's contents as its own, ignoring circular dependencies. Don't do this.
Unfortunately, without merging the packages, you won't be able to do this (without using interfaces, I mean). Don't fear, though. The article you posted explicitly says this under "implementation details".
Giving each component a common interface means deriving from a base
class with virtual functions. This introduces some additional
overhead. Do not let this turn you against the idea, as the additional
overhead is small, compared to the savings due to simplification of
objects.
It's outright telling you to use interfaces (okay, C++ virtual inheritance, but close enough). It's okay, it's necessary. Especially if you want two slightly different AI components or something, it's a godsend then.

R-neuralnet: does it randomize the data?

I need to know if the data for training that is passed in the neuralnet call is randomized in the routine or does the routine uses the data in the same order that is given. I really need to know this info for a project that I am working on, and I have not being able to figure it out by looking at the source.
Thnx!
Look into the code - thats one of the most important advantages of FOSS: you can actually check what it is doing (neuralnet is pure R, so you don't even need to fear that you need to dig into FORTRAN or C code, and you can use debug to step through the code with example data to get an overview).
Moreover, if necessary, you can even introduce e.g. a new parameter that allows you to switch off randomization if needed.
Possibly maintainer ("neuralnet") would be willing to help you as well (and able to answer much faster than about everyone else here on SE).

'make'-like dependency-tracking library?

There are many nice things to like about Makefiles, and many pains in the butt.
In the course of doing various project (I'm a research scientist, "data scientist", or whatever) I often find myself starting out with a few data objects on disk, generating various artifacts from those, generating artifacts from those artifacts, and so on.
It would be nice if I could just say "this object depends on these other objects", and "this object is created in the following manner from these objects", and then ask a Make-like framework to handle the details of actually building them, figuring out which objects need to be updated, farming out work to multiple processors (like Make's -j option), and so on. Makefiles can do all this - but the huge problem is that all the actions have to be written as shell commands. This is not convenient if I'm working in R or Perl or another similar environment. Furthermore, a strong assumption in Make is that all targets are files - there are some exceptions and workarounds, but if my targets are e.g. rows in a database, that would be pretty painful.
To be clear, I'm not after a software-build system. I'm interested in something that (more generally?) deals with dependency webs of artifacts.
Anyone know of a framework for these kinds of dependency webs? Seems like it could be a nice tool for doing data science, & visually showing how results were generated, etc.
One extremely interesting example I saw recently was IncPy, but it looks like it hasn't been touched in quite a while, and it's very closely coupled with Python. It's probably also much more ambitious than I'm hoping for, which is why it has to be so closely coupled with Python.
Sorry for the vague question, let me know if some clarification would be helpful.
A new system called "Drake" was announced today that targets this exact situation: http://blog.factual.com/introducing-drake-a-kind-of-make-for-data . Looks very promising, though I haven't actually tried it yet.
This question is several years old, but I thought adding a link to remake here would be relevant.
From the GitHub repository:
The idea here is to re-imagine a set of ideas from make but built for R. Rather than having a series of calls to different instances of R (as happens if you run make on R scripts), the idea is to define pieces of a pipeline within an R session. Rather than being language agnostic (like make must be), remake is unapologetically R focussed.
It is not on CRAN yet, and I haven't tried it, but it looks very interesting.
I would give Bazel a try for this. It is primarily a software build system, but with its genrule type of artifacts it can perform pretty arbitrary file generation, too.
Bazel is very extendable, using its Python-like Starlark language which should be far easier to use for complicated tasks than make. You can start by writing simple genrule steps by hand, then refactor common patterns into macros, and if things become more complicated even write your own rules. So you should be able to express your individual transformations at a high level that models how you think about them, then turn that representation into lower level constructs using something that feels like a proper programming language.
Where make depends on timestamps, Bazel checks fingerprints. So if at any one step produces the same output even though one of its inputs changed, then subsequent steps won't need to get re-computed again. If some of your data processing steps project or filter data, there might be a high probability of this kind of thing happening.
I see your question is tagged for R, even though it doesn't mention it much. Under the hood, R computations would in Bazel still boil down to R CMD invocations on the shell. But you could have complicated muliti-line commands assembled in complicated ways, to read your inputs, process them and store the outputs. If the cost of initialization of the R binary is a concern, Rserve might help although using it would make the setup depend on a locally accessible Rserve instance I believe. Even with that I see nothing that would avoid the cost of storing the data to file, and loading it back from file. If you want something that avoids that cost by keeping things in memory between steps, then you'd be looking into a very R-specific tool, not a generic tool like you requested.
In terms of “visually showing how results were generated”, bazel query --output graph can be used to generate a graphviz dot file of the dependency graph.
Disclaimer: I'm currently working at Google, which internally uses a variant of Bazel called Blaze. Actually Bazel is the open-source released version of Blaze. I'm very familiar with using Blaze, but not with setting up Bazel from scratch.
Red-R has a concept of data flow programming. I have not tried it yet.

Resources