Optimizing Relation Catalog Use
===============================

There are several best practices and optimization opportunities in regards to
the catalog.

- Use integer-keyed BTree sets when possible.  They can use the BTrees'
  `multiunion` for a speed boost.  Integers' __cmp__ is reliable, and in C.

- Never use persistent objects as keys.  They will cause a database load every
  time you need to look at them, they take up memory and object caches, and
  they (as of this writing) disable conflict resolution.  Intids (or similar)
  are your best bet for representing objects, and some other immutable such as
  strings are the next-best bet, and zope.app.keyreferences (or similar) are
  after that.

- Use multiple-token values in your queries when possible, especially in your
  transitive query factories.

- Use the cache when you are loading and dumping tokens, and in your
  transitive query factories.

- When possible, don't load or dump tokens (the values themselves may be used
  as tokens).  This is especially important when you have multiple tokens:
  store them in a BTree structure in the same module as the zc.relation module
  for the value.

For some operations, particularly with hundreds or thousands of members in a
single relation value, some of these optimizations can speed up some
common-case reindexing work by around 100 times.

The easiest (and perhaps least useful) optimization is that all dump
calls and all load calls generated by a single operation share a cache
dictionary per call type (dump/load), per indexed relation value.
Therefore, for instance, we could stash an intids utility, so that we
only had to do a utility lookup once, and thereafter it was only a
single dictionary lookup. This is what the default `generateToken` and
`resolveToken` functions in zc.relationship's index.py do: look at them
for an example.

A further optimization is to not load or dump tokens at all, but use values
that may be tokens.  This will be particularly useful if the tokens have
__cmp__ (or equivalent) in C, such as built-in types like ints.  To specify
this behavior, you create an index with the 'load' and 'dump' values for the
indexed attribute descriptions explicitly set to None.


    >>> import zope.interface
    >>> class IRelation(zope.interface.Interface):
    ...     subjects = zope.interface.Attribute(
    ...         'The sources of the relation; the subject of the sentence')
    ...     relationtype = zope.interface.Attribute(
    ...         '''unicode: the single relation type of this relation;
    ...         usually contains the verb of the sentence.''')
    ...     objects = zope.interface.Attribute(
    ...         '''the targets of the relation; usually a direct or
    ...         indirect object in the sentence''')
    ...

    >>> import BTrees
    >>> relations = BTrees.family32.IO.BTree()
    >>> relations[99] = None # just to give us a start

    >>> class Relation(object):
    ...     zope.interface.implements(IRelation)
    ...     def __init__(self, subjects, relationtype, objects):
    ...         self.subjects = subjects
    ...         assert relationtype in relTypes
    ...         self.relationtype = relationtype
    ...         self.objects = objects
    ...         self.id = relations.maxKey() + 1
    ...         relations[self.id] = self
    ...     def __repr__(self):
    ...         return '<%r %s %r>' % (
    ...             self.subjects, self.relationtype, self.objects)

    >>> def token(rel, self):
    ...     return rel.token
    ...
    >>> def children(rel, self):
    ...     return rel.children
    ...
    >>> def dumpRelation(obj, index, cache):
    ...     return obj.id
    ...
    >>> def loadRelation(token, index, cache):
    ...     return relations[token]
    ...

    >>> relTypes = ['has the role of']
    >>> def relTypeDump(obj, index, cache):
    ...     assert obj in relTypes, 'unknown relationtype'
    ...     return obj
    ...
    >>> def relTypeLoad(token, index, cache):
    ...     assert token in relTypes, 'unknown relationtype'
    ...     return token
    ...

    >>> import zc.relation.catalog
    >>> catalog = zc.relation.catalog.Catalog(
    ...     dumpRelation, loadRelation)
    >>> catalog.addValueIndex(IRelation['subjects'], multiple=True)
    >>> catalog.addValueIndex(
    ...     IRelation['relationtype'], relTypeDump, relTypeLoad,
    ...     BTrees.family32.OI, name='reltype')
    >>> catalog.addValueIndex(IRelation['objects'], multiple=True)
    >>> import zc.relation.queryfactory
    >>> factory = zc.relation.queryfactory.TransposingTransitive(
    ...     'subjects', 'objects')
    >>> catalog.addDefaultQueryFactory(factory)

    >>> rel = Relation((1,), 'has the role of', (2,))
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 1}))
    [2]

If you have single relations that relate hundreds or thousands of
objects, it can be a huge win if the value is a 'multiple' of the same
type as the stored BTree for the given attribute.  The default BTree
family for attributes is IFBTree; IOBTree is also a good choice, and may
be preferrable for some applications.

    >>> catalog.unindex(rel)
    >>> rel = Relation(
    ...     BTrees.family32.IF.TreeSet((1,)), 'has the role of',
    ...     BTrees.family32.IF.TreeSet())
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 1}))
    []
    >>> list(catalog.findValueTokens('subjects', {'objects': None}))
    [1]

Reindexing is where some of the big improvements can happen.  The following
gyrations exercise the optimization code.

    >>> rel.objects.insert(2)
    1
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 1}))
    [2]
    >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5))
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 3}))
    [2]

    >>> rel.subjects.insert(6)
    1
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 6}))
    [2]

    >>> rel.subjects.update(range(100, 200))
    100
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 100}))
    [2]

    >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5,6))
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 3}))
    [2]

    >>> rel.subjects = BTrees.family32.IF.TreeSet(())
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 3}))
    []

    >>> rel.subjects = BTrees.family32.IF.TreeSet((3,4,5))
    >>> catalog.index(rel)
    >>> list(catalog.findValueTokens('objects', {'subjects': 3}))
    [2]

tokenizeValues and resolveValueTokens work correctly without loaders and
dumpers--that is, they do nothing.

    >>> catalog.tokenizeValues((3,4,5), 'subjects')
    (3, 4, 5)
    >>> catalog.resolveValueTokens((3,4,5), 'subjects')
    (3, 4, 5)