Information Storage and Retrieval

Posts

Showing posts from September, 2017

Unit 4 Muddiest Points

September 30, 2017

1. In the search engine, can user use "," as or option to search the key word(e.g New York City, Buffalo)? 2. In the Vector Space Model, if we required the user queries' counts to calculate the rank, would it be bias on the ranking result?

Unit 3 Muddiest Points

September 22, 2017

1. I know some users would use * to query information needs, but would normal users use the $ to query the information? 2. If users use more than one *(e.g sa**), would the result show a related correct answer? 3. Does Information Retrieval have 80/20 rule in stemming or what?

Unit 4: Reading and Writing Notes

September 22, 2017

1.3 Processing Boolean queries The intersection is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. There is a simple and effective method of intersecting postings lists using the merge algorithm: we maintain pointers into both lists and walk through the two postings lists simultaneously, in time linear in the total number of postings entries. Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work. For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for i...

Unit 3: Reading and Writing Notes

September 15, 2017

chapter 4 we learned how to process Index construction and create inverted index. process index construction or indexing; the process or machine that performs it the indexer. 1 blocked sort-based indexing 2 single-pass in-memory indexing 3 dynamic indexing - Hardware basics we have to cache data in main memory in order to not waste time.(a few clock cycles VS. transfer it from disk) - Blocked sort-based indexing We then sort the pairs with the term as the dominant key and docID as the secondary key. We can build the mapping from terms to termIDs and external sorting algorithm. The block is then inverted and written to disk. Inversion involves two steps. First, we sort the termID–docID pairs. Next, we collect all termID–docID pairs with the same termID into a postings list, POSTING where a posting is simply a docID. - Single-pass in-memory indexing SPIMI uses terms instead of termIDs, IN-MEMORY INDEXING writes each block’s dictiona...

Unit 2 Muddiest Points

September 15, 2017

There are some questions about Unit 2: 1. Is that people can fully trust Automatic Indexing? I mean maybe we have to rely on people to review the content(e.g inappropriate content) and then create the index. 2. How do IR system to measure and find correct words if users mistype or use the wrong grammar? 3. Some people would search words which combined English words with translation in another language(e.g brand name). Is that possible the system can process separately or process once?

Unit 2: Reading and Writing Notes

September 08, 2017

This unit is talking about the concept of document and query processing. - A first take at building an inverted index. These are four steps to build an inverted index. When we collect the documents to be indexed, we can tokenize the text and become a list of token. And then we produced a list of normalized tokens by doing linguistic preprocessing. Finally, adding to a dictionary and postings would create an inverted index. However, a fixed length array would be wasteful as some words occur in many documents, and others in very few. The solutions are singly linked lists and variable length arrays. - Document delineation and character sequence decoding If the document is too large, it is difficult to index them. We have to choose the unit of the document. Moreover, character sequence is usually a byte but the document would provide different encoding style. - Determining the vocabulary of terms Tokenization is that is the task of chopping it up into pieces. ...

Unit 1: Muddiest Points

September 02, 2017

When I studied an overview of IR, some questions were camp up: 1. If html information is belong to unstructured information, are XML and Json format also belong to unstructured information? 2. How should people ensure that users assess the right information to the IR system? Since some people would give the wrong feedback on the system. 3. Sometimes developers would store unstructured data on the database and create the index on the table. Is that a unstructured or structured information?

Unit 1: Reading and Writing Notes

September 02, 2017

This is an overview of information retrieval. These are several articles to talk about how people and experts think about the information retrieval. - Finding Out About People are always seeking the useful information about our environment. As a human, language is information on a complex system. Language can be changed if necessary because people can express about them and react to them. Finding Out About focuses on meaning and looks for the related information. The process of Finding Out About: 1) Asking the question people on the cognitive state of mind may be aware of asking their questions but sometimes people do not the problem. Also, we may call query as information need. 2) Constructing the answer The main difficult problems are that how to translate human mind into the computer system so that we can receive the result. As we view passage as a document and a set of documents as a corpus. 3) Assessing the answer It is all about the related feedbacks and may give the p...