=================================================================
Working with Search Indexes: zc.relation Catalog Extended Example
=================================================================

.. contents::
   :local:

Introduction
============

This document assumes you have read the README.txt document, and want to learn
a bit more by example. In it, we will explore a set of relations that
demonstrates most of the aspects of working with search indexes and listeners.
It will not explain the topics that the other documents already addressed. It
also describes an advanced use case.

As we have seen in the other documents, the relation catalog supports
search indexes.  These can return the results of any search, as desired.
 Of course, the intent is that you supply an index that optimizes the
particular searches it claims.

The searchindex module supplies a few search indexes, optimizing
specified transitive and intransitive searches.  We have seen them working
in other documents.  We will examine them more in depth in this document.

Search indexes update themselves by receiving messages via a "listener"
interface.  We will also look at how this works.

The example described in this file examines a use case similar to that in
the zc.revision or zc.vault packages: a relation describes a graph of
other objects.  Therefore, this is our first concrete example of purely
extrinsic relations.

Let's build the example story a bit.  Imagine we have a graph, often a
hierarchy, of tokens--integers.  Relations specify that a given integer
token relates to other integer tokens, with a containment denotation or
other meaning.

The integers may also have relations that specify that they represent an
object or objects.

This allows us to have a graph of objects in which changing one part of the
graph does not require changing the rest.  zc.revision and zc.vault thus
are able to model graphs that can have multiple revisions efficiently and
with quite a bit of metadata to support merges.

Let's imagine a simple hierarchy.  The relation has a `token` attribute
and a `children` attribute; children point to tokens. Relations will
identify themselves with ids.

    >>> import BTrees
    >>> relations = BTrees.family64.IO.BTree()
    >>> relations[99] = None # just to give us a start
    
    >>> class Relation(object):
    ...     def __init__(self, token, children=()):
    ...         self.token = token
    ...         self.children = BTrees.family64.IF.TreeSet(children)
    ...         self.id = relations.maxKey() + 1
    ...         relations[self.id] = self
    ...

    >>> def token(rel, self):
    ...     return rel.token
    ...
    >>> def children(rel, self):
    ...     return rel.children
    ...
    >>> def dumpRelation(obj, index, cache):
    ...     return obj.id
    ...
    >>> def loadRelation(token, index, cache):
    ...     return relations[token]
    ...

The standard TransposingTransitiveQueriesFactory will be able to handle this
quite well, so we'll use that for our index.

    >>> import zc.relation.queryfactory
    >>> factory = zc.relation.queryfactory.TransposingTransitive(
    ...     'token', 'children')
    >>> import zc.relation.catalog
    >>> catalog = zc.relation.catalog.Catalog(
    ...     dumpRelation, loadRelation, BTrees.family64.IO, BTrees.family64)
    >>> catalog.addValueIndex(token)
    >>> catalog.addValueIndex(children, multiple=True)
    >>> catalog.addDefaultQueryFactory(factory)

Now let's quickly create a hierarchy and index it.

    >>> for token, children in (
    ...     (0, (1, 2)), (1, (3, 4)), (2, (10, 11, 12)), (3, (5, 6)),
    ...     (4, (13, 14)), (5, (7, 8, 9)), (6, (15, 16)), (7, (17, 18, 19)),
    ...     (8, (20, 21, 22)), (9, (23, 24)), (10, (25, 26)),
    ...     (11, (27, 28, 29, 30, 31, 32))):
    ...     catalog.index(Relation(token, children))
    ...

[#queryFactory]_ That hierarchy is arbitrary.  Here's what we have, in terms of tokens
pointing to children::

                                  _____________0_____________
                                 /                           \
                        ________1_______                ______2____________
                       /                \              /          |        \
                ______3_____            _4_          10       ____11_____   12
               /            \          /   \         / \     / /  |  \ \ \
       _______5_______       6       13     14     25  26  27 28 29 30 31 32
      /       |       \     / \
    _7_      _8_       9   15 16
   / | \    / | \     / \
 17 18 19  20 21 22  23 24

Twelve relations, with tokens 0 through 11, and a total of 33 tokens,
including children.  The ids for the 12 relations are 100 through 111,
corresponding with the tokens of 0 through 11.

Without a transitive search index, we can get all transitive results. 
The results are iterators.

    >>> res = catalog.findRelationTokens({'token': 0})
    >>> getattr(res, 'next') is None
    False
    >>> print getattr(res, '__len__', None)
    None
    >>> sorted(res)
    [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
    >>> list(res)
    []
    
    >>> res = catalog.findValueTokens('children', {'token': 0})
    >>> sorted(res) == range(1, 33)
    True
    >>> list(res)
    []

[#findValuesUnindexed]_ `canFind` also can work transitively, and will
use transitive search indexes, as we'll see below.

    >>> catalog.canFind({'token': 1}, targetQuery={'children': 23})
    True
    >>> catalog.canFind({'token': 2}, targetQuery={'children': 23})
    False
    >>> catalog.canFind({'children': 23}, targetQuery={'token': 1})
    True
    >>> catalog.canFind({'children': 23}, targetQuery={'token': 2})
    False

`findRelationTokenChains` won't change, but we'll include it in the
discussion and examples to show that.

    >>> res = catalog.findRelationTokenChains({'token': 2})
    >>> chains = list(res)
    >>> len(chains)
    3
    >>> len(list(res))
    0

Transitive Search Indexes
=========================

Now we can add a couple of transitive search index.  We'll talk about
them a bit first.

There is currently one variety of transitive index, which indexes
relation and value searches for the transposing transitive query
factory.

The index can only be used under certain conditions.

    - The search is not a request for a relation chain.
    
    - It does not specify a maximum depth.
    
    - Filters are not used.

If it is a value search, then specific value indexes cannot be used if a
target filter or target query are used, but the basic relation index can
still be used in that case.

The usage of the search indexes is largely transparent: set them up, and
the relation catalog will use them for the same API calls that used more
brute force previously.  The only difference from external uses is that
results that use an index will usually be a BTree structure, rather than
an iterator.

When you add a transitive index for a relation, you must specify the
transitive name (or names) of the query, and the same for the reverse. 
That's all we'll do now.

    >>> import zc.relation.searchindex
    >>> catalog.addSearchIndex(
    ...     zc.relation.searchindex.TransposingTransitiveMembership(
    ...         'token', 'children', names=('children',)))

Now we should have a search index installed.

Notice that we went from parent (token) to child: this index is primarily
designed for helping transitive membership searches in a hierarchy. Using it to
index parents would incur a lot of write expense for not much win.

There's just a bit more you can specify here: static fields for a query
to do a bit of filtering.  We don't need any of that for this example.

Now how does the catalog use this index for searches?  Three basic ways,
depending on the kind of search, relations, values, or `canFind`.
Before we start looking into the internals, let's verify that we're getting
what we expect: correct answers, and not iterators, but BTree structures.

    >>> res = catalog.findRelationTokens({'token': 0})
    >>> list(res)
    [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
    >>> list(res)
    [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]
    
    >>> res = catalog.findValueTokens('children', {'token': 0})
    >>> list(res) == range(1, 33)
    True
    >>> list(res) == range(1, 33)
    True

    >>> catalog.canFind({'token': 1}, targetQuery={'children': 23})
    True
    >>> catalog.canFind({'token': 2}, targetQuery={'children': 23})
    False

[#findValuesIndexed]_ Note that the last two `canFind` examples from
when we went through these examples without an index do not use the
index, so we don't show them here: they look the wrong direction for
this index.

So how do these results happen?

The first, `findRelationTokens`, and the last, `canFind`, are the most
straightforward.  The index finds all relations that match the given
query, intransitively. Then for each relation, it looks up the indexed
transitive results for that token.  The end result is the union of all
indexed results found from the intransitive search.  `canFind` simply
casts the result into a boolean.

`findValueTokens` is the same story as above with only one more step.  After
the union of relations is calculated, the method returns the union of the
sets of the requested value for all found relations.

It will maintain itself when relations are reindexed.

    >>> rel = list(catalog.findRelations({'token': 11}))[0]
    >>> for t in (27, 28, 29, 30, 31):
    ...     rel.children.remove(t)
    ...
    >>> catalog.index(rel)

    >>> catalog.findValueTokens('children', {'token': 0})
    ... # doctest: +NORMALIZE_WHITESPACE
    LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
           20, 21, 22, 23, 24, 25, 26, 32])
    >>> catalog.findValueTokens('children', {'token': 2})
    LFSet([10, 11, 12, 25, 26, 32])
    >>> catalog.findValueTokens('children', {'token': 11})
    LFSet([32])
    
    >>> rel.children.remove(32)
    >>> catalog.index(rel)

    >>> catalog.findValueTokens('children', {'token': 0})
    ... # doctest: +NORMALIZE_WHITESPACE
    LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
           20, 21, 22, 23, 24, 25, 26])
    >>> catalog.findValueTokens('children', {'token': 2})
    LFSet([10, 11, 12, 25, 26])
    >>> catalog.findValueTokens('children', {'token': 11})
    LFSet([])
    
    >>> rel.children.insert(27)
    1
    >>> catalog.index(rel)

    >>> catalog.findValueTokens('children', {'token': 0})
    ... # doctest: +NORMALIZE_WHITESPACE
    LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
           20, 21, 22, 23, 24, 25, 26, 27])
    >>> catalog.findValueTokens('children', {'token': 2})
    LFSet([10, 11, 12, 25, 26, 27])
    >>> catalog.findValueTokens('children', {'token': 11})
    LFSet([27])

When the index is copied, the search index is copied.

    >>> new = catalog.copy()
    >>> res = list(new.iterSearchIndexes())
    >>> len(res)
    1
    >>> new_index = res[0]
    >>> res = list(catalog.iterSearchIndexes())
    >>> len(res)
    1
    >>> old_index = res[0]
    >>> new_index is old_index
    False
    >>> old_index.index is new_index.index
    False
    >>> list(old_index.index.keys()) == list(new_index.index.keys())
    True
    >>> for key, value in old_index.index.items():
    ...     v = new_index.index[key]
    ...     if v is value or list(v) != list(value):
    ...         print 'oops', key, value, v
    ...         break
    ... else:
    ...     print 'good'
    ...
    good
    >>> old_index.names is not new_index.names
    True
    >>> list(old_index.names) == list(new_index.names)
    True
    >>> for name, old_ix in old_index.names.items():
    ...     new_ix = new_index.names[name]
    ...     if new_ix is old_ix or list(new_ix.keys()) != list(old_ix.keys()):
    ...         print 'oops'
    ...         break
    ...     for key, value in old_ix.items():
    ...         v = new_ix[key]
    ...         if v is value or list(v) != list(value):
    ...             print 'oops', name, key, value, v
    ...             break
    ...     else:
    ...         continue
    ...     break
    ... else:
    ...     print 'good'
    ...
    good

Helpers
=======

When writing search indexes and query factories, you often want complete
access to relation catalog data.  We've seen a number of these tools already:

- `getRelationModuleTools` gets a dictionary of the BTree tools needed to
  work with relations.

    >>> sorted(catalog.getRelationModuleTools().keys())
    ... # doctest: +NORMALIZE_WHITESPACE
    ['BTree', 'Bucket', 'Set', 'TreeSet', 'difference', 'dump',
     'intersection', 'load', 'multiunion', 'union']

    'multiunion' is only there if the BTree is an I* or L* module.
    Use the zc.relation.catalog.multiunion helper function to do the
    best union you can for a given set of tools.

- `getValueModuleTools` does the same for indexed values.

    >>> tools = set(('BTree', 'Bucket', 'Set', 'TreeSet', 'difference',
    ...              'dump', 'intersection', 'load', 'multiunion', 'union'))
    >>> tools.difference(catalog.getValueModuleTools('children').keys())
    set([])

    >>> tools.difference(catalog.getValueModuleTools('token').keys())
    set([])

- `getRelationTokens` can return all of the tokens in the catalog.

    >>> len(catalog.getRelationTokens()) == len(catalog)
    True

    This also happens to be equivalent to `findRelationTokens` with an empty
    query.
    
    >>> catalog.getRelationTokens() is catalog.findRelationTokens({})
    True

    It also can return all the tokens that match a given query, or None if
    there are no matches.

    >>> catalog.getRelationTokens({'token': 0}) # doctest: +ELLIPSIS
    <BTrees.LOBTree.LOTreeSet object at ...>
    >>> list(catalog.getRelationTokens({'token': 0}))
    [100]

    This also happens to be equivalent to `findRelationTokens` with a query,
    a maxDepth of 1, and no other arguments.

    >>> catalog.findRelationTokens({'token': 0}, maxDepth=1) is (
    ...     catalog.getRelationTokens({'token': 0}))
    True
    
    Except that if there are no matches, `findRelationTokens` returns an empty
    set (so it *always* returns an iterable).
    
    >>> catalog.findRelationTokens({'token': 50}, maxDepth=1)
    LOSet([])
    >>> print catalog.getRelationTokens({'token': 50})
    None

- `getValueTokens` can return all of the tokens for a given value name in
    the catalog.

    >>> list(catalog.getValueTokens('token')) == range(12)
    True

    This is identical to catalog.findValueTokens with a name only (or with
    an empty query, and a maxDepth of 1).

    >>> list(catalog.findValueTokens('token')) == range(12)
    True
    >>> catalog.findValueTokens('token') is catalog.getValueTokens('token')
    True

    It can also return the values for a given token.

    >>> list(catalog.getValueTokens('children', 100))
    [1, 2]

    This is identical to catalog.findValueTokens with a name and a query of
    {None: token}.

    >>> list(catalog.findValueTokens('children', {None: 100}))
    [1, 2]
    >>> catalog.getValueTokens('children', 100) is (
    ...     catalog.findValueTokens('children', {None: 100}))
    True
    
    Except that if there are no matches, `findValueTokens` returns an empty
    set (so it *always* returns an iterable); while getValueTokens will
    return None if the relation has no values (or the relation is unknown).
    
    >>> catalog.findValueTokens('children', {None: 50}, maxDepth=1)
    LFSet([])
    >>> print catalog.getValueTokens('children', 50)
    None
    
    >>> rel.children.remove(27)
    >>> catalog.index(rel)
    >>> catalog.findValueTokens('children', {None: rel.id}, maxDepth=1)
    LFSet([])
    >>> print catalog.getValueTokens('children', rel.id)
    None

- `yieldRelationTokenChains` is a search workhorse for searches that use a
    query factory.  TODO: describe.

.. ......... ..
.. Footnotes ..
.. ......... ..

.. [#queryFactory] The query factory knows when it is not needed--not only
    when neither of its names are used, but also when both of its names are
    used.
    
    >>> list(catalog.findRelationTokens({'token': 0, 'children': 1}))
    [100]

.. [#findValuesUnindexed] When values are the same as their tokens,
    `findValues` returns the same result as `findValueTokens`.  Here
    we see this without indexes.
    
    >>> list(catalog.findValueTokens('children', {'token': 0})) == list(
    ...     catalog.findValues('children', {'token': 0}))
    True

.. [#findValuesIndexed] Again, when values are the same as their tokens,
    `findValues` returns the same result as `findValueTokens`.  Here
    we see this with indexes.
    
    >>> list(catalog.findValueTokens('children', {'token': 0})) == list(
    ...     catalog.findValues('children', {'token': 0}))
    True