Optimizing Relation Catalog Use =============================== There are several best practices and optimization opportunities in regards to the catalog. - Use integer-keyed BTree sets when possible. They can use the BTrees' `multiunion` for a speed boost. Integers' __cmp__ is reliable, and in C. - Never use persistent objects as keys. They will cause a database load every time you need to look at them, they take up memory and object caches, and they (as of this writing) disable conflict resolution. Intids (or similar) are your best bet for representing objects, and some other immutable such as strings are the next-best bet, and zope.app.keyreferences (or similar) are after that. - Use multiple-token values in your queries when possible, especially in your transitive query factories. - Use the cache when you are loading and dumping tokens, and in your transitive query factories. - When possible, don't load or dump tokens (the values themselves may be used as tokens). This is especially important when you have multiple tokens: store them in a BTree structure in the same module as the zc.relation module for the value. For some operations, particularly with hundreds or thousands of members in a single relation value, some of these optimizations can speed up some common-case reindexing work by around 100 times. The easiest (and perhaps least useful) optimization is that all dump calls and all load calls generated by a single operation share a cache dictionary per call type (dump/load), per indexed relation value. Therefore, for instance, we could stash an intids utility, so that we only had to do a utility lookup once, and thereafter it was only a single dictionary lookup. This is what the default `generateToken` and `resolveToken` functions in zc.relationship's index.py do: look at them for an example. A further optimization is to not load or dump tokens at all, but use values that may be tokens. This will be particularly useful if the tokens have __cmp__ (or equivalent) in C, such as built-in types like ints. To specify this behavior, you create an index with the 'load' and 'dump' values for the indexed attribute descriptions explicitly set to None. >>> import zope.interface >>> class IRelation(zope.interface.Interface): ... subjects = zope.interface.Attribute( ... 'The sources of the relation; the subject of the sentence') ... relationtype = zope.interface.Attribute( ... '''unicode: the single relation type of this relation; ... usually contains the verb of the sentence.''') ... objects = zope.interface.Attribute( ... '''the targets of the relation; usually a direct or ... indirect object in the sentence''') ... >>> import BTrees >>> relations = BTrees.family32.IO.BTree() >>> relations[99] = None # just to give us a start >>> class Relation(object): ... zope.interface.implements(IRelation) ... def __init__(self, subjects, relationtype, objects): ... self.subjects = subjects ... assert relationtype in relTypes ... self.relationtype = relationtype ... self.objects = objects ... self.id = relations.maxKey() + 1 ... relations[self.id] = self ... def __repr__(self): ... return '<%r %s %r>' % ( ... self.subjects, self.relationtype, self.objects) >>> def token(rel, self): ... return rel.token ... >>> def children(rel, self): ... return rel.children ... >>> def dumpRelation(obj, index, cache): ... return obj.id ... >>> def loadRelation(token, index, cache): ... return relations[token] ... >>> relTypes = ['has the role of'] >>> def relTypeDump(obj, index, cache): ... assert obj in relTypes, 'unknown relationtype' ... return obj ... >>> def relTypeLoad(token, index, cache): ... assert token in relTypes, 'unknown relationtype' ... return token ... >>> import zc.relation.catalog >>> catalog = zc.relation.catalog.Catalog( ... dumpRelation, loadRelation) >>> catalog.addValueIndex(IRelation['subjects'], multiple=True) >>> catalog.addValueIndex( ... IRelation['relationtype'], relTypeDump, relTypeLoad, ... BTrees.family32.OI, name='reltype') >>> catalog.addValueIndex(IRelation['objects'], multiple=True) >>> import zc.relation.queryfactory >>> factory = zc.relation.queryfactory.TransposingTransitive( ... 'subjects', 'objects') >>> catalog.addDefaultQueryFactory(factory) >>> rel = Relation((1,), 'has the role of', (2,)) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 1})) [2] If you have single relations that relate hundreds or thousands of objects, it can be a huge win if the value is a 'multiple' of the same type as the stored BTree for the given attribute. The default BTree family for attributes is IFBTree; IOBTree is also a good choice, and may be preferrable for some applications. >>> catalog.unindex(rel) >>> rel = Relation( ... BTrees.family32.IF.TreeSet((1,)), 'has the role of', ... BTrees.family32.IF.TreeSet()) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 1})) [] >>> list(catalog.findValueTokens('subjects', {'objects': None})) [1] Reindexing is where some of the big improvements can happen. The following gyrations exercise the optimization code. >>> rel.objects.insert(2) 1 >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 1})) [2] >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5)) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 3})) [2] >>> rel.subjects.insert(6) 1 >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 6})) [2] >>> rel.subjects.update(range(100, 200)) 100 >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 100})) [2] >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5,6)) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 3})) [2] >>> rel.subjects = BTrees.family32.IF.TreeSet(()) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 3})) [] >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5)) >>> catalog.index(rel) >>> list(catalog.findValueTokens('objects', {'subjects': 3})) [2] tokenizeValues and resolveValueTokens work correctly without loaders and dumpers--that is, they do nothing. >>> catalog.tokenizeValues((3,4,5), 'subjects') (3, 4, 5) >>> catalog.resolveValueTokens((3,4,5), 'subjects') (3, 4, 5)