Lucene internals
This article introduces Lucene internals
Apache Lucene is a java full-text search engine. Lucene provide its core library and API that can easily be used to add search capabilities to applications.
API
Lucene is divided into several packages:
analysis
defines an abstractAnalyzer
API for converting text from aReader
into aTokenStream
. A TokenStream can be composed by applyingTokenFilters
to the output of aTokenizer
.Tokenizer
andTokenFilters
are strung together and applied with anAnalyzer
.analysis-common
provides a number of Analyzer implementations.codecs
provides an abstraction over the encoding and decoding of the inverted index structure, as well as different implementations that can be chosen depending upon application needs.document
provides a simpleDocument
class. A document is a set of namedField
s, whose values may be strings or instances ofReader
.index
provide two primary classes:IndexWriter
, which creates and adds documents to indices; andIndexReader
which accesses data in the index.search
provides data structures to represent queries(TermQuery
for individual words,PhraseQuery
for phrases,BooleanQuery
for boolean combinations of queries) and theIndexSearcher
which turns queries intoTopDocs
. A number ofQueryParser
s are provided for producing query structures from strings or xml.store
defines an abstract class for storing persistent data, theDirectory
which is a collection of named files written by anIndexOutput
and read byIndexInput
. Multiple implementations are provided, butFSDirectory
is generally recommended as it tries to use operation system disk buffer caches efficiently.
typical usage
- Create
Documents
by addingField
s. - Create an
IndexWriter
and add documents to it withaddDocument()
- Call
QueryParser.parse()
to build a query from a string - Create an
IndexSearcher
and parse the query to itssearch()
method.