This unit is talking about the concept of document and query processing. - A first take at building an inverted index. These are four steps to build an inverted index. When we collect the documents to be indexed, we can tokenize the text and become a list of token. And then we produced a list of normalized tokens by doing linguistic preprocessing. Finally, adding to a dictionary and postings would create an inverted index. However, a fixed length array would be wasteful as some words occur in many documents, and others in very few. The solutions are singly linked lists and variable length arrays. - Document delineation and character sequence decoding If the document is too large, it is difficult to index them. We have to choose the unit of the document. Moreover, character sequence is usually a byte but the document would provide different encoding style. - Determining the vocabulary of terms Tokenization is that is the task of chopping it up into pieces. ...