Links ***** - http://lucene.apache.org/ - http://lucenebook.com/ - .NET_ - `Search Engine Watch`_ Issues ====== - http://issues.apache.org/jira/browse/LUCENE Alternatives ============ Also see *Competitors* (below)... - `MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java`_. - http://sphinxsearch.com/ - `Sphinx search introduction`_ - http://xapian.org/ Articles ======== - `Delve inside the Lucene indexing mechanism`_ including *Improving the indexing performance*. Competitors =========== Also see *Alternatives* (above)... - `IBM OmniFind Yahoo! Edition`_ - `IBM OmniFind Enterprise Edition`_ Did you mean ============ - org.apache.lucene.search.didyoumean_ - `Did You Mean: Lucene`_? - `Spelling Checker using Lucene`_ Faceted ======= - `Faceted Metadata Search and Browse`_ History ======= - `The Lucene Search Engine, Doug Cutting, 16 June, 2000`_ - `Lucene - Doug Cutting, November 24, 2004`_ Index Accessor ============== - :doc:`lucene-index-accessor` Monitor ======= - `LucidGaze for Lucene`_ Monitor and improve your Lucene search performance. StopWords ========= - `Stopword List`_ - `Key to Effective Searches, Dealing with Stopwords`_ - `To Stopword or Not to Stopword?`_ Snowball ======== - `RE: [Snowball-discuss] Stop word lists`_ Text Extractor ============== - `Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries`_. - `Aperture, Extract full-text and metadata from many common file formats`_ `Getting started (Appears to use Java 1.5)`_ - `OpenXML4j is a complete Java framework supporting Open Package Convention`_, - Office Open XML (WordProcessingML, SpreadsheetML, PresentationML and shared specs like DrawingML). html ---- - `Parsing HTML with java`_: - `HTMLEditorKit`_ - `JTidy`_ - `HTML Parser`_ - `HTML Cleaner`_ For Maven instructions: `Maven repository notes`_. Microsoft --------- - `wv is a library which allows access to Microsoft Word files`_. - `catdoc is program which reads one or more Microsoft word files and outputs text`_ - `catdoc, xls2csv and catppt`_ - `Antiword is a free MS Word reader for Linux`_ - `Using Java to Crack Office 2007`_ OpenOffice ---------- - `JOOConverter automates conversions between office document formats using OpenOffice.org`_ - `file2xliff4j is a set of Java classes to convert HTML, Word, Excel, OpenOffice.org Text, PowerPoint, RTF and MIF documents to XLIFF File Format`_. pdf --- - http://www.jpedal.org/ Projects ======== - `The Compass Framework`_ is a first class open source Java framework, enabling the power of Search Engine semantics to your application stack decoratively. - `Enhydra Snapper - Fulltext Indexing and Search`_ - `Hibernate Search brings the power of full text search engines to the persistence domain model`_ and Hibernate experience, through transparent configuration (Hibernate Annotations) and a common API. Might be here now... http://www.hibernate.org/410.html - `Hibernate Annotations`_ includes a package of annotations that allows you to mark any domain model object as indexable and have Hibernate maintain a Lucene index of any instances persisted via Hibernate. - `Kowari is an Open Source`_, massively scalable, transaction-safe, purpose-built database for the storage, retrieval and analysis of metadata. - DBSight_ is a highly customisable full-text search platform for any relational database. - `NetSearch - the Enterprise Search Solution from Ardentia`_ - Solr_ is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP APIs, caching, replication, and a web administration interface. - `LIUS is an indexing Java framework based on the Jakarta Lucene project`_. The LIUS framework indexes : MsWord, MsExcel, MsPowerPoint, RTF, PDF, XML, HTML, TXT, OpenOffice suite, ZIP files, MP3, VCard, Latex and JavaBeans. - `Tika, a generic document parsing framework`_ Sample ====== - sample-lucene-did-you-mean_ - sample-lucene-count-unique-terms_ Upgrade ======= - `Lucene 2.4 in 60 seconds`_ Word List ========= - `Kevin's Word List Page`_ .. _.NET: http://sourceforge.net/projects/dotlucene/ .. _`Search Engine Watch`: http://searchenginewatch.com/ .. _`MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java`: http://mg4j.dsi.unimi.it/ .. _`Sphinx search introduction`: http://komunitasweb.com/2009/03/sphinx-search-introduction/ .. _`Delve inside the Lucene indexing mechanism`: http://www-128.ibm.com/developerworks/library/wa-lucene/ .. _`IBM OmniFind Yahoo! Edition`: http://omnifind.ibm.yahoo.net/ .. _`IBM OmniFind Enterprise Edition`: http://www-306.ibm.com/software/data/enterprise-search/omnifind-enterprise/ .. _org.apache.lucene.search.didyoumean: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html .. _`Did You Mean: Lucene`: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html .. _`Spelling Checker using Lucene`: http://sujitpal.blogspot.com/2007/12/spelling-checker-with-lucene.html .. _`Faceted Metadata Search and Browse`: http://www.searchtools.com/info/faceted-metadata.html .. _`The Lucene Search Engine, Doug Cutting, 16 June, 2000`: http://lucene.sourceforge.net/talks/inktomi/ .. _`Lucene - Doug Cutting, November 24, 2004`: http://lucene.sourceforge.net/talks/pisa/ .. _`LucidGaze for Lucene`: http://www.lucidimagination.com/Downloads/LucidGaze-for-Lucene .. _`Stopword List`: http://www.unine.ch/info/clef/ .. _`Key to Effective Searches, Dealing with Stopwords`: http://www.informit.com/articles/article.asp?p=412909&seqNum=9&rl=1 .. _`To Stopword or Not to Stopword?`: http://www.ultraseek.com/articles/archives/2005/09/to_stopword_or.html .. _`RE: [Snowball-discuss] Stop word lists`: http://www.snowball.tartarus.org/archives/snowball-discuss/0320.html .. _`Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries`: http://lucene.apache.org/tika/ .. _`Aperture, Extract full-text and metadata from many common file formats`: http://aperture.sourceforge.net/ .. _`Getting started (Appears to use Java 1.5)`: ../aperture/getting-started.html .. _`OpenXML4j is a complete Java framework supporting Open Package Convention`: http://sourceforge.net/projects/openxml4j/ .. _`Parsing HTML with java`: http://jtoee.blogspot.com/2007/11/parsing-html-with-htmleditorkitparserca.html .. _`HTMLEditorKit`: http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/html/HTMLEditorKit.html .. _`JTidy`: http://jtidy.sourceforge.net/ .. _`HTML Parser`: http://htmlparser.sourceforge.net/ .. _`HTML Cleaner`: http://htmlcleaner.sourceforge.net/ .. _`Maven repository notes`: ../../info/computers/slinky/maven-repository.html .. _`wv is a library which allows access to Microsoft Word files`: http://wvware.sourceforge.net/ .. _`catdoc is program which reads one or more Microsoft word files and outputs text`: http://www.45.free.net/~vitus/software/catdoc/ .. _`catdoc, xls2csv and catppt`: http://www.wagner.pp.ru/~vitus/software/catdoc/ .. _`Antiword is a free MS Word reader for Linux`: http://www.winfield.demon.nl/ .. _`Using Java to Crack Office 2007`: http://www.infoq.com/articles/cracking-office-2007-with-java .. _`JOOConverter automates conversions between office document formats using OpenOffice.org`: http://sourceforge.net/projects/joott/ .. _`file2xliff4j is a set of Java classes to convert HTML, Word, Excel, OpenOffice.org Text, PowerPoint, RTF and MIF documents to XLIFF File Format`: http://file2xliff4j.sourceforge.net/ .. _`The Compass Framework`: http://www.compassframework.org .. _`Enhydra Snapper - Fulltext Indexing and Search`: http://www.enhydra.org/apps/snapper/index.html .. _`Hibernate Search brings the power of full text search engines to the persistence domain model`: http://search.hibernate.org/ .. _`Hibernate Annotations`: http://www.hibernate.org/hib_docs/annotations/reference/en/html/lucene.html .. _`Kowari is an Open Source`: http://www.kowari.org/ .. _DBSight: http://www.dbsight.net/ .. _`NetSearch - the Enterprise Search Solution from Ardentia`: http://www.ardentia.com/ .. _Solr: http://lucene.apache.org/solr/ .. _`LIUS is an indexing Java framework based on the Jakarta Lucene project`: http://sourceforge.net/projects/lius/ .. _`Tika, a generic document parsing framework`: http://code.google.com/p/tika/ .. _sample-lucene-did-you-mean: http://toybox/hg/sample/file/tip/java/sample-lucene-did-you-mean .. _sample-lucene-count-unique-terms: http://toybox/hg/sample/file/tip/java/sample-lucene-count-unique-terms .. _`Lucene 2.4 in 60 seconds`: http://lingpipe-blog.com/2009/02/18/lucene-24-in-60-seconds/ .. _`Kevin's Word List Page`: http://wordlist.sourceforge.net/