================================================================= Working with Search Indexes: zc.relation Catalog Extended Example ================================================================= .. contents:: :local: Introduction ============ This document assumes you have read the README.txt document, and want to learn a bit more by example. In it, we will explore a set of relations that demonstrates most of the aspects of working with search indexes and listeners. It will not explain the topics that the other documents already addressed. It also describes an advanced use case. As we have seen in the other documents, the relation catalog supports search indexes. These can return the results of any search, as desired. Of course, the intent is that you supply an index that optimizes the particular searches it claims. The searchindex module supplies a few search indexes, optimizing specified transitive and intransitive searches. We have seen them working in other documents. We will examine them more in depth in this document. Search indexes update themselves by receiving messages via a "listener" interface. We will also look at how this works. The example described in this file examines a use case similar to that in the zc.revision or zc.vault packages: a relation describes a graph of other objects. Therefore, this is our first concrete example of purely extrinsic relations. Let's build the example story a bit. Imagine we have a graph, often a hierarchy, of tokens--integers. Relations specify that a given integer token relates to other integer tokens, with a containment denotation or other meaning. The integers may also have relations that specify that they represent an object or objects. This allows us to have a graph of objects in which changing one part of the graph does not require changing the rest. zc.revision and zc.vault thus are able to model graphs that can have multiple revisions efficiently and with quite a bit of metadata to support merges. Let's imagine a simple hierarchy. The relation has a `token` attribute and a `children` attribute; children point to tokens. Relations will identify themselves with ids. >>> import BTrees >>> relations = BTrees.family64.IO.BTree() >>> relations[99] = None # just to give us a start >>> class Relation(object): ... def __init__(self, token, children=()): ... self.token = token ... self.children = BTrees.family64.IF.TreeSet(children) ... self.id = relations.maxKey() + 1 ... relations[self.id] = self ... >>> def token(rel, self): ... return rel.token ... >>> def children(rel, self): ... return rel.children ... >>> def dumpRelation(obj, index, cache): ... return obj.id ... >>> def loadRelation(token, index, cache): ... return relations[token] ... The standard TransposingTransitiveQueriesFactory will be able to handle this quite well, so we'll use that for our index. >>> import zc.relation.queryfactory >>> factory = zc.relation.queryfactory.TransposingTransitive( ... 'token', 'children') >>> import zc.relation.catalog >>> catalog = zc.relation.catalog.Catalog( ... dumpRelation, loadRelation, BTrees.family64.IO, BTrees.family64) >>> catalog.addValueIndex(token) >>> catalog.addValueIndex(children, multiple=True) >>> catalog.addDefaultQueryFactory(factory) Now let's quickly create a hierarchy and index it. >>> for token, children in ( ... (0, (1, 2)), (1, (3, 4)), (2, (10, 11, 12)), (3, (5, 6)), ... (4, (13, 14)), (5, (7, 8, 9)), (6, (15, 16)), (7, (17, 18, 19)), ... (8, (20, 21, 22)), (9, (23, 24)), (10, (25, 26)), ... (11, (27, 28, 29, 30, 31, 32))): ... catalog.index(Relation(token, children)) ... [#queryFactory]_ That hierarchy is arbitrary. Here's what we have, in terms of tokens pointing to children:: _____________0_____________ / \ ________1_______ ______2____________ / \ / | \ ______3_____ _4_ 10 ____11_____ 12 / \ / \ / \ / / | \ \ \ _______5_______ 6 13 14 25 26 27 28 29 30 31 32 / | \ / \ _7_ _8_ 9 15 16 / | \ / | \ / \ 17 18 19 20 21 22 23 24 Twelve relations, with tokens 0 through 11, and a total of 33 tokens, including children. The ids for the 12 relations are 100 through 111, corresponding with the tokens of 0 through 11. Without a transitive search index, we can get all transitive results. The results are iterators. >>> res = catalog.findRelationTokens({'token': 0}) >>> getattr(res, 'next') is None False >>> print getattr(res, '__len__', None) None >>> sorted(res) [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111] >>> list(res) [] >>> res = catalog.findValueTokens('children', {'token': 0}) >>> sorted(res) == range(1, 33) True >>> list(res) [] [#findValuesUnindexed]_ `canFind` also can work transitively, and will use transitive search indexes, as we'll see below. >>> catalog.canFind({'token': 1}, targetQuery={'children': 23}) True >>> catalog.canFind({'token': 2}, targetQuery={'children': 23}) False >>> catalog.canFind({'children': 23}, targetQuery={'token': 1}) True >>> catalog.canFind({'children': 23}, targetQuery={'token': 2}) False `findRelationTokenChains` won't change, but we'll include it in the discussion and examples to show that. >>> res = catalog.findRelationTokenChains({'token': 2}) >>> chains = list(res) >>> len(chains) 3 >>> len(list(res)) 0 Transitive Search Indexes ========================= Now we can add a couple of transitive search index. We'll talk about them a bit first. There is currently one variety of transitive index, which indexes relation and value searches for the transposing transitive query factory. The index can only be used under certain conditions. - The search is not a request for a relation chain. - It does not specify a maximum depth. - Filters are not used. If it is a value search, then specific value indexes cannot be used if a target filter or target query are used, but the basic relation index can still be used in that case. The usage of the search indexes is largely transparent: set them up, and the relation catalog will use them for the same API calls that used more brute force previously. The only difference from external uses is that results that use an index will usually be a BTree structure, rather than an iterator. When you add a transitive index for a relation, you must specify the transitive name (or names) of the query, and the same for the reverse. That's all we'll do now. >>> import zc.relation.searchindex >>> catalog.addSearchIndex( ... zc.relation.searchindex.TransposingTransitiveMembership( ... 'token', 'children', names=('children',))) Now we should have a search index installed. Notice that we went from parent (token) to child: this index is primarily designed for helping transitive membership searches in a hierarchy. Using it to index parents would incur a lot of write expense for not much win. There's just a bit more you can specify here: static fields for a query to do a bit of filtering. We don't need any of that for this example. Now how does the catalog use this index for searches? Three basic ways, depending on the kind of search, relations, values, or `canFind`. Before we start looking into the internals, let's verify that we're getting what we expect: correct answers, and not iterators, but BTree structures. >>> res = catalog.findRelationTokens({'token': 0}) >>> list(res) [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111] >>> list(res) [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111] >>> res = catalog.findValueTokens('children', {'token': 0}) >>> list(res) == range(1, 33) True >>> list(res) == range(1, 33) True >>> catalog.canFind({'token': 1}, targetQuery={'children': 23}) True >>> catalog.canFind({'token': 2}, targetQuery={'children': 23}) False [#findValuesIndexed]_ Note that the last two `canFind` examples from when we went through these examples without an index do not use the index, so we don't show them here: they look the wrong direction for this index. So how do these results happen? The first, `findRelationTokens`, and the last, `canFind`, are the most straightforward. The index finds all relations that match the given query, intransitively. Then for each relation, it looks up the indexed transitive results for that token. The end result is the union of all indexed results found from the intransitive search. `canFind` simply casts the result into a boolean. `findValueTokens` is the same story as above with only one more step. After the union of relations is calculated, the method returns the union of the sets of the requested value for all found relations. It will maintain itself when relations are reindexed. >>> rel = list(catalog.findRelations({'token': 11}))[0] >>> for t in (27, 28, 29, 30, 31): ... rel.children.remove(t) ... >>> catalog.index(rel) >>> catalog.findValueTokens('children', {'token': 0}) ... # doctest: +NORMALIZE_WHITESPACE LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 32]) >>> catalog.findValueTokens('children', {'token': 2}) LFSet([10, 11, 12, 25, 26, 32]) >>> catalog.findValueTokens('children', {'token': 11}) LFSet([32]) >>> rel.children.remove(32) >>> catalog.index(rel) >>> catalog.findValueTokens('children', {'token': 0}) ... # doctest: +NORMALIZE_WHITESPACE LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]) >>> catalog.findValueTokens('children', {'token': 2}) LFSet([10, 11, 12, 25, 26]) >>> catalog.findValueTokens('children', {'token': 11}) LFSet([]) >>> rel.children.insert(27) 1 >>> catalog.index(rel) >>> catalog.findValueTokens('children', {'token': 0}) ... # doctest: +NORMALIZE_WHITESPACE LFSet([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]) >>> catalog.findValueTokens('children', {'token': 2}) LFSet([10, 11, 12, 25, 26, 27]) >>> catalog.findValueTokens('children', {'token': 11}) LFSet([27]) When the index is copied, the search index is copied. >>> new = catalog.copy() >>> res = list(new.iterSearchIndexes()) >>> len(res) 1 >>> new_index = res[0] >>> res = list(catalog.iterSearchIndexes()) >>> len(res) 1 >>> old_index = res[0] >>> new_index is old_index False >>> old_index.index is new_index.index False >>> list(old_index.index.keys()) == list(new_index.index.keys()) True >>> for key, value in old_index.index.items(): ... v = new_index.index[key] ... if v is value or list(v) != list(value): ... print 'oops', key, value, v ... break ... else: ... print 'good' ... good >>> old_index.names is not new_index.names True >>> list(old_index.names) == list(new_index.names) True >>> for name, old_ix in old_index.names.items(): ... new_ix = new_index.names[name] ... if new_ix is old_ix or list(new_ix.keys()) != list(old_ix.keys()): ... print 'oops' ... break ... for key, value in old_ix.items(): ... v = new_ix[key] ... if v is value or list(v) != list(value): ... print 'oops', name, key, value, v ... break ... else: ... continue ... break ... else: ... print 'good' ... good Helpers ======= When writing search indexes and query factories, you often want complete access to relation catalog data. We've seen a number of these tools already: - `getRelationModuleTools` gets a dictionary of the BTree tools needed to work with relations. >>> sorted(catalog.getRelationModuleTools().keys()) ... # doctest: +NORMALIZE_WHITESPACE ['BTree', 'Bucket', 'Set', 'TreeSet', 'difference', 'dump', 'intersection', 'load', 'multiunion', 'union'] 'multiunion' is only there if the BTree is an I* or L* module. Use the zc.relation.catalog.multiunion helper function to do the best union you can for a given set of tools. - `getValueModuleTools` does the same for indexed values. >>> tools = set(('BTree', 'Bucket', 'Set', 'TreeSet', 'difference', ... 'dump', 'intersection', 'load', 'multiunion', 'union')) >>> tools.difference(catalog.getValueModuleTools('children').keys()) set([]) >>> tools.difference(catalog.getValueModuleTools('token').keys()) set([]) - `getRelationTokens` can return all of the tokens in the catalog. >>> len(catalog.getRelationTokens()) == len(catalog) True This also happens to be equivalent to `findRelationTokens` with an empty query. >>> catalog.getRelationTokens() is catalog.findRelationTokens({}) True It also can return all the tokens that match a given query, or None if there are no matches. >>> catalog.getRelationTokens({'token': 0}) # doctest: +ELLIPSIS >>> list(catalog.getRelationTokens({'token': 0})) [100] This also happens to be equivalent to `findRelationTokens` with a query, a maxDepth of 1, and no other arguments. >>> catalog.findRelationTokens({'token': 0}, maxDepth=1) is ( ... catalog.getRelationTokens({'token': 0})) True Except that if there are no matches, `findRelationTokens` returns an empty set (so it *always* returns an iterable). >>> catalog.findRelationTokens({'token': 50}, maxDepth=1) LOSet([]) >>> print catalog.getRelationTokens({'token': 50}) None - `getValueTokens` can return all of the tokens for a given value name in the catalog. >>> list(catalog.getValueTokens('token')) == range(12) True This is identical to catalog.findValueTokens with a name only (or with an empty query, and a maxDepth of 1). >>> list(catalog.findValueTokens('token')) == range(12) True >>> catalog.findValueTokens('token') is catalog.getValueTokens('token') True It can also return the values for a given token. >>> list(catalog.getValueTokens('children', 100)) [1, 2] This is identical to catalog.findValueTokens with a name and a query of {None: token}. >>> list(catalog.findValueTokens('children', {None: 100})) [1, 2] >>> catalog.getValueTokens('children', 100) is ( ... catalog.findValueTokens('children', {None: 100})) True Except that if there are no matches, `findValueTokens` returns an empty set (so it *always* returns an iterable); while getValueTokens will return None if the relation has no values (or the relation is unknown). >>> catalog.findValueTokens('children', {None: 50}, maxDepth=1) LFSet([]) >>> print catalog.getValueTokens('children', 50) None >>> rel.children.remove(27) >>> catalog.index(rel) >>> catalog.findValueTokens('children', {None: rel.id}, maxDepth=1) LFSet([]) >>> print catalog.getValueTokens('children', rel.id) None - `yieldRelationTokenChains` is a search workhorse for searches that use a query factory. TODO: describe. .. ......... .. .. Footnotes .. .. ......... .. .. [#queryFactory] The query factory knows when it is not needed--not only when neither of its names are used, but also when both of its names are used. >>> list(catalog.findRelationTokens({'token': 0, 'children': 1})) [100] .. [#findValuesUnindexed] When values are the same as their tokens, `findValues` returns the same result as `findValueTokens`. Here we see this without indexes. >>> list(catalog.findValueTokens('children', {'token': 0})) == list( ... catalog.findValues('children', {'token': 0})) True .. [#findValuesIndexed] Again, when values are the same as their tokens, `findValues` returns the same result as `findValueTokens`. Here we see this with indexes. >>> list(catalog.findValueTokens('children', {'token': 0})) == list( ... catalog.findValues('children', {'token': 0})) True