Unit 3: Reading and Writing Notes

chapter 4 we learned how to process Index construction and create inverted index.

process index construction or indexing; the process or machine that performs it the indexer.

1 blocked sort-based indexing 
2 single-pass in-memory indexing 
3 dynamic indexing 

- Hardware basics
we have to cache data in main memory in order to not waste time.(a few clock cycles VS. transfer it from disk)

- Blocked sort-based indexing
We then sort the pairs with the term as the dominant key and docID as the secondary key.
We can build the mapping from terms to termIDs and external sorting algorithm.

The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID–docID pairs. Next, we collect all termID–docID pairs with the same termID into a postings list, POSTING where a posting is simply a docID.

- Single-pass in-memory indexing
SPIMI uses terms instead of termIDs, IN-MEMORY INDEXING writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.

A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID–docID pairs and then sorting them (as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings.  

- Distributed indexing
A master node directs the process of assigning and reassigning tasks to individual worker nodes
A simple solution is to maintain a (perhaps precomputed) mapping for frequent terms that is copied to all nodes and to use terms directly (instead of termIDs) for infrequent terms.

- Dynamic indexing
If there is a requirement that new documents be included quickly, one solution is to maintain two indexes: a large main index and a small auxiliary index that stores new documents.

chapter 5: Indexers compress and decompress intermediate files and the final index

Index compression for dictionary  

1. The first is increased use of caching
2.advantage of compression is faster transfer of data from disk to memory.

- Heaps’ law: Estimating the number of terms
allow to understand how terms are distributed across documents. This helps us to characterize the properties of the algorithms for compressing postings lists. 

- Zipf’s law: Modeling the distribution of terms
One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. 

- Dictionary compression
One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of the dictionary are on disk, then many more disk seeks are necessary in query evaluation. 

Dictionary As  A String: There are two methods of dictionary compassion one is fix width entries of term and other one is by sorting dictionary term as a one long sting of characters. 

Blocked Searched: The dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block .

One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding.

-Postings file compression
To devise a more efficient representation of the postings file, one that uses fewer than 20 bits per document, we observe that the postings for frequent terms are close together.

Variable byte (VB) encoding uses an integral number of bytes to encode a gap. The last 7 bits of a byte are “payload” and encode part of the gap. The first  bit of the byte is a continuation bit.It is set to 1 for the last byte of the encoded gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0 terminated by a byte with continuation bit 1. We then extract and concatenate the 7-bit parts.



Comments

Popular posts from this blog

Unit 7 Muddiest Points

Unit 10 Muddiest Points

Unit 5 Muddiest Points