Unit 2: Reading and Writing Notes

This unit is talking about the concept of document and query processing.

- A first take at building an inverted index.
These are four steps to build an inverted index. When we collect the documents to be indexed, we can tokenize the text and become a list of token. And then we produced a list of normalized tokens by doing linguistic preprocessing. Finally, adding to a dictionary and postings would create an inverted index. However, a fixed length array would be wasteful as some words occur in many documents, and others in very few. The solutions are singly linked lists and variable length arrays.

- Document delineation and character sequence decoding
If the document is too large, it is difficult to index them. We have to choose the unit of the document. Moreover, character sequence is usually a byte but the document would provide different encoding style.

- Determining the vocabulary of terms
Tokenization is that is the task of chopping it up into pieces.
A token is an instance of a sequence of characters in some particular document that are grouped TYPE together as a useful semantic unit for processing. 

A type is the class of all TERM tokens containing the same character sequence. 

A term is a (perhaps normalized) type that is included in the IR system’s dictionary. 

some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded STOP WORDS. These words are called stop words.

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base LEMMA or dictionary form of a word, which is known as the lemma suffix the word.

- Faster postings list intersection via skip pointers
Skip pointers are effectively shortcuts that allow us to avoid processing parts of the postings list that will not figure in the search results. Building effective skip pointers is easy if an index is relatively static; it is harder if a postings list keeps changing because of updates.

- Positional postings and phrase queries
In this section we consider two approaches to supporting phrase queries and their combination. A related but distinct concept is term proximity weighting. Biword indexes is a phrase index. Positional indexes: to process a phrase query, you still need to access the inverted index entries for each distinct term. a partial next word index.

Dictionaries and tolerant retrieval
- Search structures for dictionaries
search structures for dictionaries has two broad classes of solutions: hashing, and search trees.

Wildcard queries: using uncertain way: sid*

Permuterm indexes
Notice the close interplay between the B-tree and the permuterm index above. Indeed, it suggests that the structure should perhaps be viewed as a permuterm B-tree. However, we follow traditional terminology here in describing the permuterm index as distinct from the B-tree that allows us to select the rotations with a given prefix.

k-gram indexes for wildcard queries

- Spelling correction

We focus on two specific forms of spelling correction that we refer to as isolated-term correction and context-sensitive correction.We begin by examining two techniques for addressing isolated-term correction: edit distance, and k-gram overlap. We then proceed to context sensitive correction.

edit distance:
, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character

k-gram indexes for spelling correction: use the k-gram index

-Phonetic correction:

Our final technique for tolerant retrieval has to do with phonetic correction: misspellings that arise because the user types a query that sounds like the target term.  

Comments

Popular posts from this blog

Unit 7 Muddiest Points

Unit 10 Muddiest Points

Unit 5 Muddiest Points