Information Retrieval: Data Structures and Algorithms (英語) ペーパーバック – Facsimile, 1992/6/12
Kindle 端末は必要ありません。無料 Kindle アプリのいずれかをダウンロードすると、スマートフォン、タブレットPCで Kindle 本をお読みいただけます。
Information retrieval is a sub-field of computer science that deals with the automated storage and retrieval of documents. Providing the latest information retrieval techniques, this guide discusses Information Retrieval data structures and algorithms, including implementations in C. Aimed at software engineers building systems with book processing components, it provides a descriptive and evaluative explanation of storage and retrieval systems, file structures, term and query operations, document operations and hardware. Contains techniques for handling inverted files, signature files, and file organizations for optical disks. Discusses such operations as lexical analysis and stoplists, stemming algorithms, thesaurus construction, and relevance feedback and other query modification techniques. Provides information on Boolean operations, hashing algorithms, ranking algorithms and clustering algorithms. In addition to being of interest to software engineering professionals, this book will be useful to information science and library science professionals who are interested in text retrieval technology.
An edited volume containing data structures and algorithms for information retrieved including a disk with examples written in C. For programmers and students interested in parsing text, automated indexing, its the first collection in book form of the basic data structures and algorithms that are critical to the storage and retrieval of documents.
Luckily, there is an sequel (unofficially labeled by me) to this book: Baeza-Yates and Ribiero (eds.), Modern Information Retrieval, ACM Press and Addison-Wesley, 1999. The 1999 book covers Internet-related IR technologies as best as possible, considering that anything associated with the Web is out-of-date as soon as the ink is dry. Folks genuinely interested in technical details should consider looking at the proceedings of Web/Internet-centric conferences, such as the WWW Conferences.
In short, this 1992 classic is a great book for all serious IR practitioners to have. It has clear and concise definitions of basic terms and comes in handy when reading papers written by others and writing one's own.
The introduction covers hashing and automata for string matching in detail, but doesn't mention vector-based techniques other than Hamming distance (!) and in one paragraph provides the only mention of edit distance (aka Levenstein distance) in the book. The chapter on PAT trees and the one on optical disks seem out of place due to their depth and obscurity. On the other hand, there's no mention of caching anywhere. The chapter on lexical analysis and stoplists by Fox has a nice introduction, but then devolves into page after page of C code. Ditto for Frakes' chapter on stemming -- good introduction, but we didn't need ten pages of code. Same for the thesaurus chapter -- a few pages of introduction, and then 40+ pages of code for some kind of hierarchical clustering. Baeza-Yates' chapter on string searching covers Knuth-Morris-Pratt and Boyer-Moore briefly and even contains some interesting empirical data, but again, we didn't really need the C code. Harman's chapter on relevance feedback (query modification) stands out as being entirely sensible, high level and informative, but is a decade behind the times. The chapter on boolean operations provides a few pages of info and then mysteriously spends 10 pages on bit vector code and then another handful on hashing. Then the following chapter on hashing has 40 pages of C code for perfect hashing! Harman's later chapter on ranking algorithms is a useful overview of scoring (though very high level). Rasmussen's chapter on clustering is also thoughtful, but rather non-standard -- you don't even get k-means, everyone's favorite clustering algorithm, and it also recaps the definitions of many of the other chapters.
Unfortunately, you don't get any higher-order graph analysis techniques that power web search engines like Google. You won't get any kind of help for load balancing servers or databases, which is critical. You also don't get any dimensionality reducing and smoothing techniques like latent semantic analysis or principal components analysis. There's also no analysis from a users' perspective on usability and the different kinds of tasks that peopel might be using information retrieval for. And of course, there's no discussion of natural language understanding techniques or crosslingual or multilingual retrieval techniques. Finally, it's all text based and you won't get any information on retrieving audio or images.
If you're serious about information retrieval, this book lacks the depth and recency to leave you feeling like an expert. The statistical language processing book by Manning and Schuetze contains an excellent introduction to information retrieval algorithms, as well as reams of background on statistical language processing you'll want to understand before getting into information retrieval. For more details on information retrieval itself, check out the collection of primary source papers edited by Karen Sparck-Jones: Readings in Information Retrieval.
However, the book turned out yet more useful to me as, during my M.A. studies (in CS) I had to write a work on "Suffix Trees" and "Suffix Arrays" and I found that Gonnet, Baeza-Yates and Snider describe equivalent ideas they call "PAT trees" and "PAT arrays".
I found this book useful too for working on computational linguistics related projects as well.
In short - I like keeping this book always in reach, as a reference, though, I found this book not so friendly as an introduction book to the subject ("Managing Gigabytes", might turn out to be a more welcomming).
If you're a hipster dilettante brogrammer...stay away, this will only hurt your head.