Unit 2: Reading and Writing Notes
This unit is talking about the concept
of document and query processing.
- A first take at building an inverted
index.
These are four steps to build an
inverted index. When we collect the documents to be indexed, we can
tokenize the text and become a list of token. And then we produced a
list of normalized tokens by doing linguistic preprocessing. Finally,
adding to a dictionary and postings would create an inverted index.
However, a fixed length array would be wasteful as some words occur
in many documents, and others in very few. The solutions are singly
linked lists and variable length arrays.
- Document delineation and character
sequence decoding
If the document is too large, it is
difficult to index them. We have to choose the unit of the document.
Moreover, character sequence is usually a byte but the document would
provide different encoding style.
- Determining the vocabulary of terms
Tokenization is that is the task of
chopping it up into pieces.
A token is an instance of a
sequence of characters in some particular document that are grouped
TYPE together as a useful semantic unit for processing.
A type
is the class of all TERM tokens containing the same character
sequence.
A term is a (perhaps normalized) type that is
included in the IR system’s dictionary.
some extremely common words
which would appear to be of little value in helping select documents
matching a user need are excluded STOP WORDS. These words are called stop words.
Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of
achieving this goal correctly most of the time.
Lemmatization usually refers to doing
things properly with the use of a vocabulary and morphological
analysis of words, normally aiming to remove inflectional endings
only and to return the base LEMMA or dictionary form of a word, which
is known as the lemma suffix the word.
- Faster postings list intersection via
skip pointers
Skip pointers are effectively shortcuts
that allow us to avoid processing parts of the postings list that
will not figure in the search results. Building effective skip
pointers is easy if an index is relatively static; it is harder if a
postings list keeps changing because of updates.
- Positional postings and phrase
queries
In this section we consider two
approaches to supporting phrase queries and their combination. A
related but distinct concept is term proximity weighting. Biword
indexes is a phrase index. Positional indexes: to process a phrase
query, you still need to access the inverted index entries for each
distinct term. a partial next word index.
Dictionaries and tolerant retrieval
- Search structures for dictionaries
search structures for dictionaries has
two broad classes of solutions: hashing, and search trees.
Wildcard queries: using uncertain way:
sid*
Permuterm indexes
Notice the close interplay between the
B-tree and the permuterm index above. Indeed, it suggests that the
structure should perhaps be viewed as a permuterm B-tree. However, we
follow traditional terminology here in describing the permuterm index
as distinct from the B-tree that allows us to select the rotations
with a given prefix.
k-gram indexes for wildcard queries
- Spelling correction
We focus on two specific forms of
spelling correction that we refer to as isolated-term correction and
context-sensitive correction.We begin by examining two techniques for
addressing isolated-term correction: edit distance, and k-gram
overlap. We then proceed to context sensitive correction.
edit distance:
, the edit operations allowed for this
purpose are: (i) insert a character into a string; (ii) delete a
character from a string and (iii) replace a character of a string by
another character
k-gram indexes for spelling
correction: use the k-gram index
-Phonetic correction:
Our final technique for tolerant
retrieval has to do with phonetic correction: misspellings that arise
because the user types a query that sounds like the target term.
Comments
Post a Comment