Class SloppyPhraseMatcher

java.lang.Object
org.apache.lucene.search.PhraseMatcher
org.apache.lucene.search.SloppyPhraseMatcher

public final class SloppyPhraseMatcher extends PhraseMatcher
Find all slop-valid position-combinations (matches) encountered while traversing/hopping the PhrasePositions.
The sloppy frequency contribution of a match depends on the distance:
- highest freq for distance=0 (exact match).
- freq gets lower as distance gets higher.
Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
  • Field Details

    • phrasePositions

      private final PhrasePositions[] phrasePositions
    • slop

      private final int slop
    • numPostings

      private final int numPostings
    • pq

      private final PhraseQueue pq
    • captureLeadMatch

      private final boolean captureLeadMatch
    • approximation

      private final DocIdSetIterator approximation
    • impactsApproximation

      private final ImpactsDISI impactsApproximation
    • end

      private int end
    • leadPosition

      private int leadPosition
    • leadOffset

      private int leadOffset
    • leadEndOffset

      private int leadEndOffset
    • leadOrd

      private int leadOrd
    • hasRpts

      private boolean hasRpts
    • checkedRpts

      private boolean checkedRpts
    • hasMultiTermRpts

      private boolean hasMultiTermRpts
    • rptGroups

      private PhrasePositions[][] rptGroups
    • rptStack

      private PhrasePositions[] rptStack
    • positioned

      private boolean positioned
    • matchLength

      private int matchLength
  • Constructor Details

  • Method Details

    • approximation

      DocIdSetIterator approximation()
      Description copied from class: PhraseMatcher
      Approximation that only matches documents that have all terms.
      Specified by:
      approximation in class PhraseMatcher
    • impactsApproximation

      ImpactsDISI impactsApproximation()
      Description copied from class: PhraseMatcher
      Approximation that is aware of impacts.
      Specified by:
      impactsApproximation in class PhraseMatcher
    • maxFreq

      float maxFreq() throws IOException
      Description copied from class: PhraseMatcher
      An upper bound on the number of possible matches on this document
      Specified by:
      maxFreq in class PhraseMatcher
      Throws:
      IOException
    • reset

      public void reset() throws IOException
      Description copied from class: PhraseMatcher
      Called after PhraseMatcher.approximation() has been advanced
      Specified by:
      reset in class PhraseMatcher
      Throws:
      IOException
    • sloppyWeight

      float sloppyWeight()
      Description copied from class: PhraseMatcher
      The slop-adjusted weight of the current match

      The sum of the slop-adjusted weights is used as the freq for scoring

      Specified by:
      sloppyWeight in class PhraseMatcher
    • nextMatch

      public boolean nextMatch() throws IOException
      Description copied from class: PhraseMatcher
      Find the next match on the current document, returning false if there are none.
      Specified by:
      nextMatch in class PhraseMatcher
      Throws:
      IOException
    • captureLead

      private void captureLead(PhrasePositions pp) throws IOException
      Throws:
      IOException
    • startPosition

      public int startPosition()
      Description copied from class: PhraseMatcher
      The start position of the current match
      Specified by:
      startPosition in class PhraseMatcher
    • endPosition

      public int endPosition()
      Description copied from class: PhraseMatcher
      The end position of the current match
      Specified by:
      endPosition in class PhraseMatcher
    • startOffset

      public int startOffset() throws IOException
      Description copied from class: PhraseMatcher
      The start offset of the current match
      Specified by:
      startOffset in class PhraseMatcher
      Throws:
      IOException
    • endOffset

      public int endOffset() throws IOException
      Description copied from class: PhraseMatcher
      The end offset of the current match
      Specified by:
      endOffset in class PhraseMatcher
      Throws:
      IOException
    • advancePP

      private boolean advancePP(PhrasePositions pp) throws IOException
      advance a PhrasePosition and update 'end', return false if exhausted
      Throws:
      IOException
    • advanceRpts

      private boolean advanceRpts(PhrasePositions pp) throws IOException
      pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.
      Throws:
      IOException
    • lesser

      private PhrasePositions lesser(PhrasePositions pp, PhrasePositions pp2)
      compare two pps, but only by position and offset
    • collide

      private int collide(PhrasePositions pp)
      index of a pp2 colliding with pp, or -1 if none
    • initPhrasePositions

      private boolean initPhrasePositions() throws IOException
      Initialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):
      • Check if there are repetitions
      • If there are, find groups of repetitions.
      Examples:
      1. no repetitions: "ho my"~2
      2. repetitions: "ho my my"~2
      3. repetitions: "my ho my"~2
      Returns:
      false if PPs are exhausted (and so current doc will not be a match)
      Throws:
      IOException
    • initSimple

      private void initSimple() throws IOException
      no repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient
      Throws:
      IOException
    • initComplex

      private boolean initComplex() throws IOException
      with repeats: not so simple.
      Throws:
      IOException
    • placeFirstPositions

      private void placeFirstPositions() throws IOException
      move all PPs to their first position
      Throws:
      IOException
    • fillQueue

      private void fillQueue()
      Fill the queue (all pps are already placed
    • advanceRepeatGroups

      private boolean advanceRepeatGroups() throws IOException
      At initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.

      Case 1: no multi-term repeats
      It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.

      Case 2: multi-term repeats

      Returns:
      false if PPs are exhausted.
      Throws:
      IOException
    • initFirstTime

      private boolean initFirstTime() throws IOException
      initialize with checking for repeats. Heavy work, but done only for the first candidate doc.

      If there are repetitions, check if multi-term postings (MTP) are involved.

      Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
      With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
      For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.

      The more complex initialization has two parts:
      (1) identification of repetition groups.
      (2) advancing repeat groups at the start of the doc.
      For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.

      Throws:
      IOException
    • sortRptGroups

      private void sortRptGroups(ArrayList<ArrayList<PhrasePositions>> rgs)
      sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc.
    • gatherRptGroups

      private ArrayList<ArrayList<PhrasePositions>> gatherRptGroups(LinkedHashMap<Term,Integer> rptTerms) throws IOException
      Detect repetition groups. Done once - for first doc
      Throws:
      IOException
    • tpPos

      private int tpPos(PhrasePositions pp)
      Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset
    • repeatingTerms

      private LinkedHashMap<Term,Integer> repeatingTerms()
      find repeating terms and assign them ordinal values
    • repeatingPPs

      private PhrasePositions[] repeatingPPs(HashMap<Term,Integer> rptTerms)
      find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts
    • ppTermsBitSets

      private ArrayList<FixedBitSet> ppTermsBitSets(PhrasePositions[] rpp, HashMap<Term,Integer> tord)
      bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set
    • unionTermGroups

      private void unionTermGroups(ArrayList<FixedBitSet> bb)
      union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms
    • termGroups

      private HashMap<Term,Integer> termGroups(LinkedHashMap<Term,Integer> tord, ArrayList<FixedBitSet> bb) throws IOException
      map each term to the single group that contains it
      Throws:
      IOException