Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index sections #165

Merged
merged 25 commits into from
Nov 3, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
74bdaf9
prototype of index sections
mschoch Aug 31, 2021
9ccb8e5
thoughts on refactoring the inv index to section code
Thejas-bhat Jun 7, 2023
89991d8
refactoring 1/index as per sections interface - wip
Thejas-bhat Jun 27, 2023
f217f81
code cleanup and populating dictLocs and dv offsets
Thejas-bhat Jun 27, 2023
3d37b27
resetting inverted index opaque values
Thejas-bhat Jul 4, 2023
81a356f
bug fixes with respect to unit tests
Thejas-bhat Jul 5, 2023
afc0f2b
code cleanup; removing numeric range section
Thejas-bhat Jul 5, 2023
943961a
notes on upgrade stuff
Thejas-bhat Jul 7, 2023
c88aa37
handling older file formats - wip
Thejas-bhat Jul 12, 2023
abc054d
using varint encoding for storing offsets
Thejas-bhat Jul 19, 2023
dd4d018
handling loadDvReaders for older file formats as well
Thejas-bhat Jul 24, 2023
4184ccc
zap version change
Thejas-bhat Aug 9, 2023
25fd426
init var for dv readers
Thejas-bhat Aug 9, 2023
79e04c4
bug fix: loading the doc values for older index files
Thejas-bhat Sep 11, 2023
037a94f
bug fix determing the correct portion of index file for sb.mem
Thejas-bhat Sep 11, 2023
bd6d08f
refactoring code, better reuse of objects
Thejas-bhat Sep 11, 2023
35abf14
unit test fixes
Thejas-bhat Sep 13, 2023
245215a
code comments and some naming changes
Thejas-bhat Sep 15, 2023
69ff4fe
renaming vars, files and some refactoring of code
Thejas-bhat Sep 18, 2023
e49a45b
added licensing files
Thejas-bhat Sep 18, 2023
b8f406e
trimming mergeToWriter API signature
Thejas-bhat Sep 20, 2023
eb9eb11
using uint32 for docNum
Thejas-bhat Sep 26, 2023
6ed2ef3
Sections code for vector search (#166)
Thejas-bhat Nov 2, 2023
b610c16
Merge branch 'master' into refactor-section
abhinavdangeti Nov 2, 2023
fdff547
docNum to be of type uint32 in section's Process(..)
abhinavdangeti Nov 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 26 additions & 18 deletions build.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ import (
"github.com/blevesearch/vellum"
)

const Version uint32 = 15

const Version uint32 = 16
const IndexSectionsVersion uint32 = 16
const Type string = "zap"

const fieldNotUninverted = math.MaxUint64
Expand Down Expand Up @@ -98,7 +98,7 @@ func persistSegmentBaseToWriter(sb *SegmentBase, w io.Writer) (int, error) {
return 0, err
}

err = persistFooter(sb.numDocs, sb.storedIndexOffset, sb.fieldsIndexOffset,
err = persistFooter(sb.numDocs, sb.storedIndexOffset, sb.fieldsIndexOffset, sb.sectionsIndexOffset,
sb.docValueOffset, sb.chunkMode, sb.memCRC, br)
if err != nil {
return 0, err
Expand Down Expand Up @@ -159,25 +159,33 @@ func persistStoredFieldValues(fieldID int,

func InitSegmentBase(mem []byte, memCRC uint32, chunkMode uint32,
fieldsMap map[string]uint16, fieldsInv []string, numDocs uint64,
storedIndexOffset uint64, fieldsIndexOffset uint64, docValueOffset uint64,
dictLocs []uint64) (*SegmentBase, error) {
storedIndexOffset uint64, dictLocs []uint64,
sectionsIndexOffset uint64) (*SegmentBase, error) {
sb := &SegmentBase{
mem: mem,
memCRC: memCRC,
chunkMode: chunkMode,
fieldsMap: fieldsMap,
fieldsInv: fieldsInv,
numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: fieldsIndexOffset,
docValueOffset: docValueOffset,
dictLocs: dictLocs,
fieldDvReaders: make(map[uint16]*docValueReader),
fieldFSTs: make(map[uint16]*vellum.FST),
mem: mem,
memCRC: memCRC,
chunkMode: chunkMode,
fieldsMap: fieldsMap,
fieldsInv: fieldsInv,
numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: sectionsIndexOffset,
sectionsIndexOffset: sectionsIndexOffset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the fieldsIndexOffset and sectionsIndexOffset are the same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially sections index content contains the same as fields index + sections information. so, for the fieldsIndexOffset which gives info about the fields in a segment we just consult the sectionsIndexOffset to get that info. Also, the field is needed for backward compatibility purposes as well.

fieldDvReaders: make([]map[uint16]*docValueReader, len(segmentSections)),
docValueOffset: 0, // docvalueOffsets identified automicatically by the section
dictLocs: dictLocs,
fieldFSTs: make(map[uint16]*vellum.FST),
}
sb.updateSize()

err := sb.loadDvReaders()
// load the data/section starting offsets for each field
// by via the sectionsIndexOffset as starting point.
err := sb.loadFieldsNew()
if err != nil {
return nil, err
}

err = sb.loadDvReaders()
if err != nil {
return nil, err
}
Expand Down
3 changes: 1 addition & 2 deletions docvalues.go
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,6 @@ func (s *SegmentBase) loadFieldDocValueReader(field string,
s.incrementBytesRead(offset)
// set the data offset
fdvIter.dvDataLoc = fieldDvLocStart

return fdvIter, nil
}

Expand Down Expand Up @@ -310,7 +309,7 @@ func (s *SegmentBase) VisitDocValues(localDocNum uint64, fields []string,
continue
}
fieldID := fieldIDPlus1 - 1
if dvIter, exists := s.fieldDvReaders[fieldID]; exists &&
if dvIter, exists := s.fieldDvReaders[sectionInvertedIndex][fieldID]; exists &&
dvIter != nil {
dvs.dvrs[fieldID] = dvIter.cloneInto(dvs.dvrs[fieldID])
}
Expand Down
Loading