spaCy adding pointer to another token in custom component - pointers

I am trying to find how token.head and token.children are implemented. I want to replicate this implementation as I add a custom component to my spaCy pipeline for SRL.
That is, each token can point to predicates for which it is an argument. Intuitively, I think that this should work kind of like token.children wherein (I think) it returns a generator of the actually dependent child token objects.
I assume that I should not simply store an attribute of that token as this does not seem very memory efficient and rather redundant. Does anyone know the correct way to implement this? Or is this handled implicitly by the spaCy Underscore.set method?
Thanks!

The Token object is only a view -- it's sort of like holding a reference to the Doc object, and an index to the token. The Span object is like this too. This ensures there's a single source of truth, and only one copy of the data.
You can find the definition of the key structs in the spacy/structs.pxd file. This defines the attributes of the TokenC struct. The Doc object then holds an array of these, and a length. The Token objects are created on the fly when you index into the Doc. The data definition for the Doc object can be found in spacy/tokens/doc.pxd, and the implementation of the token access is in spacy/tokens/doc.pyx.
The way the parse tree is encoded in spaCy is a bit unsatisfying. I've made an issue about this on the tracker --- it feels like there should be a better solution.
What we do is encode the offset of the head relative to the token. So if you do &doc.c[i] + doc.c[i].head you'll get a pointer to the head. That part is okay. The part that's a bit weirder is that we track the left and right edges of the token's subtree, and the number of direct left and right children. To get the rightmost or leftmost child, we navigate around within this region. In practice this actually works pretty well because we're dealing with a contiguous block of memory, and loops in Cython are fast. But it still feels a bit janky.
As far as what you'll be able to do as a user...If you run your own fork of spaCy you can happily define your own data on the structs. But then you're running your own fork.
There's no way to attach "real" attributes to the Doc or Token objects, as these are defined as C-level types --- so their structure is defined statically; it's not dynamic. You could subclass the Doc but this is quite ugly: you need to also subclass.
This is why we have the underscore attributes, and the doc.user_data dictionary. It's really the only way to extend the objects. Fortunately you shouldn't really face a data redundancy problem. Nothing is stored on the Token objects. The definitions of your extensions are stored globally, within the Underscore class. Data is stored on the Doc object, even if it applies to a token --- again, the Token is a view. It can't own anything. So the Doc has to note that we have some value assigned to token i.
If you're defining a tree-navigation system, I'd recommend considering defining it as your own Cython class, so you can use structs. If you use native Python types it'll be pretty slow and pretty large. If you pack the data into numpy arrays the representation will be more compact, but writing the code will be a pretty miserable experience, and the performance is likely to be not great.
In short:
Define your own types in Cython. Put the data into a struct owned by a cdef class, and give the class accessor methods.
Use the underscore attributes to access the data from spaCy's Doc, Span and Token objects.
If you come up with a compelling API for SRL and the data can be coded compactly into the TokenC struct, we'd consider adding it as native support.

Related

Json.Net - Partial Deserialization with Anonymous Types

After briefly examining this source code, it seems that Json.Net's JsonConvert.DeserializeAnonymousType method parses the entire JSON input from beginning to end, even if the specified anonymous type only has a single property that matches at the beginning of the input, and even if the input is 5GB long (to exaggerate the problem).
Furthermore, the CheckAdditionalContent property seems to only apply to content that appears after the object ends; i.e., bad JSON syntax and/or unexpected data.
I did not find a setting that will cause a partial read; i.e., stop reading when all of the properties have been discovered in the input.
To be clear, I'm expecting a behavior similar to this partial deserialization example to be the default when specifying an anonymous type.
The reasons for this are simple:
I have no control over the source of the JSON in my service (it comes from a third-party).
The content is typically 100 times larger than what I actually need from it (e.g., I need to read only 1 property, but there's 100 properties in the content).
Performance is the most important factor to me. I understand that the worst-case performance is still similar to reading the entire object, but in many cases you can be quite confident that you won't be reading most of the object; e.g., if I only need to read an ID property, and in every JSON input that my service receives, the ID property is the first one in the object!
Of course, the obvious alternative is to not use the Anonymous Type overload; however, I'm asking this question because it's about EXPECTED behavior. And anyway, I greatly prefer this method over the alternatives, syntactically speaking.
Am I wrong, or does Json.Net in fact always deserialize the entire input when specifying an anonymous type as the contract?

Identify the ObjectFactory method used to create the object

Given an object o which was or could have been created with a JAXB ObjectFactory, what's the best way to find the method in that ObjectFactory which would be used to create the object?
My objective is to be able to generate Java code sufficient to recreate that object (ie one or more createXYZ statements).
Does the answer change if I commit to a specific JAXB implementation, say MOXy, for example?
Suppose I just know the object is from some JAXBContext (so one of several ObjectFactory classes could have been used to create it). Does this change the answer at all?
Where the object is a JAXBElement, #XmlElementDecl comes in to play. #XmlElementDecl can have a scope. My JAXB objects know their parent, so hopefully this matches scope.
I've written some proof of concept code which uses getGenericReturnType and getAnnotation(XmlElementDecl.class), to find the method, but I'm guessing there is likely to be stuff in one of the JAXB implementations, which could be re-used to do this more effectively/elegantly.

Determining type of CollectionBase via Reflections (or Microsoft.Cci)

Question:
Is there a static way to reliably determine the type contained by a type derived from CollectionBase, using Reflection or Microsoft.Cci?
Background:
I am working on a code generator that copies types, makes customized versions of those types, and converters between. It walks the types in the source assembly via Microsoft.Cci. It prints out source code using textual templates. It does a lot of conversion and customization, and tosses out code that I don't care about.
In my resulting code, I intend to replace List<T> everywhere that a CollectionBase, IEnumerable<T>, or T[] was previously used. I want to use List<T> because I am pretty sure I can serialize it without extra work, which is important for my application. T is concrete in every case. I am trying not to copy CollectionBase classes because I'd have to copy over the custom implementation, and I'd like to avoid having to do that in my code generator.
The only part I'm having a problem with is determining T for List<T> when replacing a custom CollectionBase.
What I've done so far:
I have briefly looked at the MSDN docs and samples for CollectionBase, and they mention creating a custom Add method on your derived type. I don't think this is in any way enforced, so I'm not sure I can rely on that. An implementor could name it something else, or worse, have a collection that supports multiple types, with Object as their only common ancestor.
Alternatives I have considered:
Maybe the default serialization does some tricks that I can take advantage of. Is there a default serialization for CollectionBase collections, or do you generally have to implement it yourself? If you have to do it yourself, is there some reliable metadata I could look at in order to determine the types? If it supports default serialization, does it rely on the runtime types of the items in the collection?
I could make a mapping in my code generator of known CollectionBase types, mapped to their corresponding T for List<T>. If a given CollectionBase type that I encounter isn't in the list, throw an exception. This is probably what I'll go with if I there isn't a reliable alternative.
I'm still not sure enough about what you want to do to give advice. Still, do your CollectionBase-derived classes all implement Add(T)? If so, you could look for an Add method with single parameter of type other than object, and use that type for T.

Flash: AMF3 with reference tables?

AMF3 specification defines use of so called "reference tables" (see Section 2.2 of this specification).
I implemented this behavior in my AMF3 encoder/decoder I developed in Erlang, but being not very experienced with Flash API, I can hardly find how can I easily force Flash to use these reference tables when serializing objects to AMF3; for example if I use ByteArray, it seems that it just repeats full object encodings
var ba:ByteArray = new ByteArray();
ba.writeObject("some string1");
ba.writeObject("some string1");
# =>
# <<6,25,115,111,109,101,32,115,116,114,105,110,103,49,
# 6,25,115,111,109,101,32,115,116,114,105,110,103,49>>
(which is clearly a repetition).
However, if these two strings are in a one single writeObject call, it does seem to use references:
ba.writeObject(["some string1", "some string1"]);
# => <<9,5,1,6,25,115,111,109,101,32,115,116,114,105,110,103,49,6,0>>
Socket seems to behave the same way.
So, can I make use of reference tables in Flash code? (provided I might have a non-standard protocol between Flash application and server )
Thank you!
I think the difference is that in the first example you're writing two string literals. In the second example you're writing an array (or Complex Object in Adobe's specs) that has a reference to two strings. So if you reference the string from an object or an array it will write it in the reference table.
This isn't necessarily a way to enforce it but it seems logical that the AMF serializer built into flash would serialize objects this way so it is probably a reliable way to get the behavior your want (reference table strings).
I hope that is helpful to you!
As per the final sentence of the AMF3 specification (AMF 3.0 Spec at Adobe.com):
Also note that ByteArray uses a new set of implicit reference tables for objects, object traits and strings for each readObject and writeObject call.
It appears that the intention with ByteArray.writeObject is to create a serialization which could be stored or recovered on a per-object basis.
The NetConnection object's behavior is similar to what you had hoped for.
When updating the string-references table, it is important to not add empty strings to the reference table.
When maintaining the object-references table, you may be able to implement defensive programming as follows: the object-references table is constructed recursively and at some times contains objects for which the traits are not yet completely known. If the table indices are not allocated in advance, the numbering will be inconsistent across applications. An AMF3 decoder should not use the traits from a partially-constructed object -- such input should be flagged as erroneous.
The strings-reference table is implemented at the encoder by 'tagging' in-memory string objects as they are serialized. Encoding two different string objects with the same content (matching strings) do not seem to be encoded with one string referencing the other. Both strings will be output and a string-by-reference will not be used.
There may be a solution to your original question. If you have a number of objects all belonging to the same class, and you would like to store those objects all in one storage, I suggest the following: Create a "parent object" with references to all the objects you intend to store. Then use ByteArray.writeObject to persist that parent object. AMF will encode all of the referenced objects and will represent the traits of repeated object classes in an efficient way.
Look at the last page of the official AMF3 spec and you will see that ByteArray is pretty much worthless. You will have to write your own AMF3 serializer/deserializer.

Is there a tool to capture an objects state to disk?

What I would like to do is capture an object that's in memory to disk for testing purposes. Since it takes many steps to get to this state, I would like to capture it once and skip the steps.
I realize that I could mock these objects up manually but I'd rather "record" and "replay" real objects because I think this would be faster.
Edit: The question is regarding this entire process, not just the serialization of the object (also file operations) and my hope that a tool exists to do this process on standard objects.
I am interested in Actionscript specifically for this is application but...
Are there examples of this in other
programming languages?
What is this process commonly called?
How would this be done in
Actionscript?
Edit:
Are there tools that make serialization and file operations automatic (i.e. no special interfaces)?
Would anybody else find the proposed tool useful (if it doesn't exist)?
Use case of what I am thinking of:
ObjectSaver.save(objZombie,"zombie"); //save the object
var zombieClone:Zombie = ObjectSaver.get("zombie"); // get the object
and the disk location being configurable somewhere.
Converting objects to bytes (so that they can be saved to disk or transmitted over network etc.) is called serialization.
But in your case, I don't think that serialization is that useful for testing purposes. When the test creates all its test data every time that the test is run, then you can always trust that the test data is what you expect it to be, and that there are no side-effect leaking from previous test runs.
I asked the same question for Flex a few days ago. ActionScript specifically doesn't have much support for serialization, though the JSON libraries mentioned in one of the responses looked promising.
Serialize Flex Objects to Save Restore Application State
I think you are talking about "object serialization".
It's called Serialization
Perl uses the Storable module to do this, I'm not sure about Actionscript.
This used to be called "checkpointing" (although that usually means saving the state of the entire system). Have you considered serializing your object to some intermediate format, and then creating a constructor that can accept an object in that format and re-create the object based on that? That might be a more straightforward way to go.
What is this process commonly called?
Serializing / deserializing
Marshalling / unmarshalling
Deflating / inflating
Check out the flash.utils.IExternalizable interface. It can be used to serialize ActionScript objects into a ByteArray. The resulting data could easily be written to disk or used to clone objects.
Note that this is not "automatic". You have to manually implement the interface and write the readExternal() and writeExternal() functions for each class you want to serialize. You'll be hard pressed to find a way to serialize custom classes "automatically" because private members are only accessible within the class itself. You'll need to make everything that you need serialized public if you want to create an external serialization method.
The closest I've come to this is using the appcorelib ClassUtil to create XML objects from existing objects (saving the xml manually) and create objects from this xml. For objects with arrays of custom types it takes configuring ArrayElementType Metadata tags and compiler options correctly as described in the docs.
ClassUtil.createXMLfromObject(obj);
CreateClassFromXMLObject(obj,targetClass);
If you're using AIR, you can store Objects in the included local database.
Here's a simple example using local SQLite database on the Adobe site, and more info on how data is stored in the database.

Resources