Lucene internals
This article introduces Lucene internals
Apache Lucene is a java full-text search engine. Lucene provide its core library and API that can easily be used to add search capabilities to applications.
API
Lucene is divided into several packages:
analysisdefines an abstractAnalyzerAPI for converting text from aReaderinto aTokenStream. A TokenStream can be composed by applyingTokenFiltersto the output of aTokenizer.TokenizerandTokenFiltersare strung together and applied with anAnalyzer.analysis-commonprovides a number of Analyzer implementations.codecsprovides an abstraction over the encoding and decoding of the inverted index structure, as well as different implementations that can be chosen depending upon application needs.documentprovides a simpleDocumentclass. A document is a set of namedFields, whose values may be strings or instances ofReader.indexprovide two primary classes:IndexWriter, which creates and adds documents to indices; andIndexReaderwhich accesses data in the index.searchprovides data structures to represent queries(TermQueryfor individual words,PhraseQueryfor phrases,BooleanQueryfor boolean combinations of queries) and theIndexSearcherwhich turns queries intoTopDocs. A number ofQueryParsers are provided for producing query structures from strings or xml.storedefines an abstract class for storing persistent data, theDirectorywhich is a collection of named files written by anIndexOutputand read byIndexInput. Multiple implementations are provided, butFSDirectoryis generally recommended as it tries to use operation system disk buffer caches efficiently.
typical usage
- Create 
Documentsby addingFields. - Create an 
IndexWriterand add documents to it withaddDocument() - Call 
QueryParser.parse()to build a query from a string - Create an 
IndexSearcherand parse the query to itssearch()method.