Unit 3: Reading and Writing Notes
chapter 4 we learned how to process
Index construction and create inverted index.
process index construction or indexing; the process or machine that performs it the indexer.
1 blocked sort-based indexing
2 single-pass in-memory indexing
3 dynamic indexing
- Hardware basics
we have to cache data in main memory in
order to not waste time.(a few clock cycles VS. transfer it from
disk)
- Blocked sort-based indexing
We then sort the pairs with the term as
the dominant key and docID as the secondary key.
We can build the mapping from terms to
termIDs and external sorting algorithm.
The block is then inverted and written
to disk. Inversion involves two steps. First, we
sort the termID–docID pairs. Next, we collect all termID–docID pairs
with the same termID into a postings list, POSTING where a posting is simply a
docID.
- Single-pass in-memory indexing
SPIMI uses terms instead of termIDs, IN-MEMORY INDEXING writes each block’s
dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections
of any size as long as there is enough disk space available.
A difference between BSBI and SPIMI is
that SPIMI adds a posting directly to its postings list (line 10). Instead
of first collecting all termID–docID pairs and then sorting them (as we did
in BSBI), each postings list is dynamic (i.e., its size is adjusted as it
grows) and it is immediately available to collect postings.
- Distributed indexing
A master node directs the process of
assigning and reassigning tasks to individual worker nodes
A simple solution is to maintain a (perhaps precomputed) mapping for
frequent terms that is copied to all nodes and to use terms directly
(instead of termIDs) for infrequent terms.
- Dynamic indexing
If there is a requirement that new
documents be included quickly, one solution is to maintain two indexes: a large
main index and a small auxiliary index that stores new documents.
chapter 5: Indexers compress and decompress intermediate files and the final index
Index compression for dictionary
1. The first is increased use of
caching
2.advantage of compression is faster
transfer of data from disk to memory.
- Heaps’ law: Estimating the number of
terms
allow to understand how terms are distributed across documents. This helps us to characterize the properties of the algorithms for compressing postings lists.
- Zipf’s law: Modeling the distribution
of terms
One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation.
- Dictionary compression
One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation.
Dictionary As A String: There are two methods of dictionary compassion one is fix width entries of term and other one is by sorting dictionary term as a one long sting of characters.
Blocked Searched: The dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block .
One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding.
-Postings file compression
To devise a more efficient representation of the postings file, one that uses fewer than 20 bits per document, we observe that the postings for frequent terms are close together.
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. The last 7 bits of a byte are “payload” and encode part of the gap. The first bit of the byte is a continuation bit.It is set to 1 for the last byte of the encoded gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1. We then extract and concatenate the 7-bit parts.
Comments
Post a Comment