Class SimilarityBase

java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.SimilarityBase
Direct Known Subclasses:
Axiomatic, DFISimilarity, DFRSimilarity, IBSimilarity, LMSimilarity

public abstract class SimilarityBase extends Similarity
A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the score(org.apache.lucene.search.similarities.BasicStats, double, double) and toString() methods. Implementing explain(List, BasicStats, double, double) is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.

Note: multi-word queries such as phrase queries are scored in a different way than Lucene's default ranking algorithm: whereas it "fakes" an IDF value for the phrase as a whole (since it does not know it), this class instead scores phrases as a summation of the individual term scores.

  • Field Details

    • LOG_2

      private static final double LOG_2
      For log2(double). Precomputed for efficiency reasons.
    • LENGTH_TABLE

      private static final float[] LENGTH_TABLE
      Cache of decoded bytes.
  • Constructor Details

    • SimilarityBase

      public SimilarityBase()
      Default constructor: parameter-free
    • SimilarityBase

      public SimilarityBase(boolean discountOverlaps)
      Primary constructor.
  • Method Details

    • scorer

      public final Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats)
      Description copied from class: Similarity
      Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.
      Specified by:
      scorer in class Similarity
      Parameters:
      boost - a multiplicative factor to apply to the produces scores
      collectionStats - collection-level statistics, such as the number of tokens in the collection.
      termStats - term-level statistics, such as the document frequency of a term across the collection.
      Returns:
      SimWeight object with the information this Similarity needs to score a query.
    • newStats

      protected BasicStats newStats(String field, double boost)
      Factory method to return a custom stats object
    • fillBasicStats

      protected void fillBasicStats(BasicStats stats, CollectionStatistics collectionStats, TermStatistics termStats)
      Fills all member fields defined in BasicStats in stats. Subclasses can override this method to fill additional stats.
    • score

      protected abstract double score(BasicStats stats, double freq, double docLen)
      Scores the document doc.

      Subclasses must apply their scoring formula in this class.

      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
      Returns:
      the score.
    • explain

      protected void explain(List<Explanation> subExpls, BasicStats stats, double freq, double docLen)
      Subclasses should implement this method to explain the score. expl already contains the score, the name of the class and the doc id, as well as the term frequency and its explanation; subclasses can add additional clauses to explain details of their scoring formulae.

      The default implementation does nothing.

      Parameters:
      subExpls - the list of details of the explanation to extend
      stats - the corpus level statistics.
      freq - the term frequency.
      docLen - the document length.
    • explain

      protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
      Explains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via the score(BasicStats, double, double) method) and the explanation for the term frequency. Subclasses content with this format may add additional details in explain(List, BasicStats, double, double).
      Parameters:
      stats - the corpus level statistics.
      freq - the term frequency and its explanation.
      docLen - the document length.
      Returns:
      the explanation.
    • toString

      public abstract String toString()
      Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.
      Overrides:
      toString in class Object
    • log2

      public static double log2(double x)
      Returns the base two logarithm of x.