RFC 0027 - The Future of Search #43

mta-umbraco · 2025-01-08T09:36:24Z

mta-umbraco
Jan 8, 2025
Maintainer

Request for Comments: The Future of Search

Read the full RFC document here.

This RFC discusses adding a new search abstraction to Umbraco. We would love your feedback on the described feature.

How do I contribute?

Most importantly, we don’t want to miss anything, so everything goes in terms of clarifications, questions, suggestions, etc.

Please do the following things if you want to contribute:

Read the RFC document here
Please read and respect the RFC Code of Conduct
Come back here and add comments down below 👇

Shazwazza · 2025-01-08T16:42:06Z

Shazwazza
Jan 8, 2025

Thanks @kjac + @bergmania for putting this together. I have some suggestions and questions:

Document structure

As this document is basically a design/spec, ideally it would have a section before the detailed design for Requirements that captures both functional and non-functional requirements that this spec needs to fulfill. My suggestion would be to organize the headings as:

Summary
Terminology
Objectives (instead of Motivation)
- Requirements
  - Functional - typically in user story format. (i.e. The indexing must explicitly exclude sensitive data)
  - Non-Functional - quantifiable requirements. (i.e. Index rebuilding performance should improve by XX%)
- Out of Scope
Detailed Design

This way the reader knows exactly what the expectations are for what the detailed design proposes. Further, terminology for requirements are: “Must” means mandatory/required. “May” means permitted. “Should” means recommended.

So far the only listed requirements are:

The abstraction should be easy to use.
It should be possible to implement with any search provider.

From what I can gather in the doc, the requirements (not limited to) are:

The abstraction should be easy to use.
It should be possible to implement with any search provider.
Indexing and searching must support filtering, faceting, and full text search.
Indexing and search should support Autocomplete/Suggestions functionality.
Indexing must support both protected content and content variance.
Indexed data must explicitly exclude sensitive data.
Indexing must only be performed on servers with the Publisher role.
Additional indexes should be possible to define (for example Umbraco Forms).
Master data should not be stored within the indexes.
The process of rebuilding an index must have a clear indexing start and end status that is transactional.
The indexing API should allow for customizing the index document before it is indexed.
Field boosting should be automatically applied for the individual document fields based on a set of rules.

There might be others that I've missed but IMO it would be good to call all of this out at the start of the document to make it clear what outcomes are expected.

Yet to be defined

There are several items in the document explaining what hasn't been defined yet. I think these should also be called out within a single section (maybe at the top of Detailed Design?)

Index Extensibility - "also index these items" (etc).
Search Extensibility - "We also envision an extension model for searching".
Index rebuilding start/end transactional status.
Field naming conventions (i.e. prefix).
Configuration of the search and indexing abstraction.

Concerns

"Replacing the current Examine-based implementation"

In the Summary this sentence is telling folks that you are removing Examine which is not actually the case. I would suggest removing this part of the sentence since it is not what the intention of this document is.

Separating media and documents into separate indexes

Generally less indexes is better for both performance reasons but also for manageability and costing when using search-as-a-service providers since many providers will charge for more indexes or have index count limitations. Further, many search providers do not support cross index searching in a single search operation which means that a single search operation could not find both media and content in the same query which also means that scoring between content and media would not be possible because you'd have to execute 2 disparate queries.

I'm unsure what the real reason would be to separate these indexes? If there is a specific requirement that this is fulfilling it should be called out in the requirements section.

Field prefixes

Some search providers have strict limitations on how fields are named and this should strive to ensure that field names align with all search provider limitations. This should be trivial by simply not using special chars in the naming. For example, here's the limitation for Azure Search: First character must be a letter or number, No consecutive dashes or underscores.

I would suggest that one of the requirements should be: Index field names must be named with a convention that supports all search providers.

Questions

My assumption for this abstraction is purely to support Umbraco specific operations such as: All back office search requirements, Basic/standard search functionality for the front-end? For example, I assume that the faceting requirement for this abstraction is because the back office will at some point use this functionality? Else, I think the abstraction should reduce its surface area as much as possible to only support what is absolutely mandatory for the back office.
The section regarding "The index data format will include the following" which lists Texts, TextsH1/H2/etc... could be made more clear as to what is meant by "TextsH*" fields. Are "TextsH*" fields are specifically talking about 'headers'? or is this more abstract and talking about automatic boosting?

0 replies

liamlaverty · 2025-01-09T08:01:44Z

liamlaverty
Jan 9, 2025

Thanks for releasing this RFC.

Support for date/times

In the Out of scope section, dates & times are listed as excluded (though datetimeoffset is included elsewhere in the RFC). Please consider support for dates & times as early as possible. The ability to search around datetime ranges is a consistent requirement for medium-to-large Umbraco sites. The following scenario may be covered by the included datetimeoffset: I've worked on many sites which have a features like an Events or Our History listing page, where we need to know "what events will happen in Feb 2025"/"did happen in July 1990". I'd assume most agencies encounter requests for this genre of features on most builds. Without it as a core feature, the community will end up creating multiple different implementations, or our clients will end up using third-party websites for those features. Hopefully this feature can be achieved with datetimeoffset, but if not, it'd be worth giving higher priority to those out of scope date and time features.

Support for providers
What's the extent of "support" in the statement:

We will not support other search providers in the initial release.

I've read it like "[Umbraco] will not implement other search providers for the initial release", but site maintainers will be able to implement their own alternative providers with the initial release. Is that correct, or would this implementation work exclusively with Examine at the initial release?

0 replies

bielu · 2025-01-09T09:41:15Z

bielu
Jan 9, 2025

I was one of people which done few approaches to introduce abstractions, so let me start on my comments the most important:

Persisting index data in the database this whole section doesnt make sense for any provider excluding of examine, so this should be feature of provider not of Umbraco abstraction...
Faceting is feature which is really hard to abstract, especially if we look into support facets from examine, they have specific way of implementing them by modifying indexes. Elastic search for example allows for faceting on any field without modification.
Variance with new abstractions, we should support indexing in one or multiple indexes, there is multiple reasons why different cultures should be indexed separately (starting from analyzers, ending up of geo load balancing).
I think Umbraco should drop Examine provider as default provider in examine for something not file based (such as Lifti or similar provider)
"Umbraco will be handing off either individual content items or collections of content for indexing, but for large sites it will eventually not be possible to hand off all content in a single operation. However, we still want to support zero downtime re-indexing if at all possible." zero downtime indexing should be only on some providers which support it, such as elastic search, trying implement it on top of examine means a lot storage usage and memory usage.

0 replies

bergmania · 2025-01-09T11:01:16Z

bergmania
Jan 9, 2025
Collaborator

Thank you all for the valuable feedback so far! #H5RY

@Shazwazza

Suggestions for document structure
Thank you for your recommendations. My concern is that listing elements in the document may lead to discrepancies as the document evolves through amendments.

Your concerns

“Replacing the current Examine-based implementation”

You’re absolutely right—this statement needs clarification. While it’s correct that the current Examine-based implementation will be replaced, it doesn’t imply that Examine will be entirely removed. At present, removing Examine is not part of the plan.

“Separating media and documents into separate indexes”

I understand your concern, particularly with Azure Search. From an abstraction perspective, however, we believe separating these types is appropriate. That said, custom implementations of the abstraction can consolidate this data into a single index, using a type field for filtering during searches.

Field prefixes

Thanks for highlighting Azure Search’s limitations in this area. Using universally safe prefixes is indeed wise. However, implementations can ultimately adapt field names to accommodate specific providers where necessary.

Regarding your questions:

“My assumption for this abstraction is purely to support Umbraco-specific operations…”

The objective is to deliver a robust abstraction that serves both the back office and websites. Features like faceting and basic filtering reflect this dual focus, as outlined in the RFC:

rfcs/cms/0027-the-future-of-search.md

Line 22 in 8e06e7f

    
           This RFC proposes a new search and indexing abstraction for Umbraco, replacing the current Examine-based implementation.

“The section regarding ‘The index data format’…”

Your interpretation is correct. We will clarify this section to specify that the “TextsH*” fields represent a hierarchy akin to HTML headings versus regular text. The abstraction provides seven “buckets” for implementation-specific boosting with a defined order of importance.

@liamlaverty

Thank you for your input—it’s greatly appreciated.

Support for date/times

Initially, we considered including date/time support but scoped it out for the first version to simplify implementation. Most scenarios can still be addressed using DateTimeOffset, including your example. However, we recognize there may be cases requiring expanded type support and are open to revisiting this.

Support for providers

You understood correctly, though we will refine the RFC to make this clearer. Initially, HQ will provide and support a single implementation. However, we aim to foster community-driven contributions and hope packages for additional search providers will emerge. HQ plans to prototype a single alternative implementation to validate the abstraction’s viability.

@bielu
Thank you for your input

We disagree — search provider type does not change the fact that data preparation for search operations, particularly with nested structures like block grids/lists, is computationally intensive.
While faceting is inherently challenging to abstract, we believe including it is important to address the majority of common use cases. Some implementations may opt not to provide it, but the abstraction supports it.
Ultimately this will be up to the implementations as long as the abstraction have the information available to the implementation both at index and search time.
Your point is valid. Currently, Examine remains the best default for local development due to its feature set and compatibility.
The primary focus is ensuring the abstraction supports functionality that implementations like Elasticsearch may require.

8 replies

mzajkowski Jan 9, 2025

The index data stored in the database is the "raw" input values for the providers. They're not meant to differ between providers - the providers might choose to interpret and index them differently, but that's on "the other side of the fence", so to speak.

You can consider the stored data as a cached version of raw index data, just like we store a cached version of published content data.

The question it raised for me now is: do we need to store both then? Maybe there would be an option to "deserialize" cached content for the purpose of searching it?

That probably opens the cans of worms underneeth it too.. e.g. what about tokenization of content for the local usages? Is the plan to ship with Examine as a package/inital provider or e.g. build the Umbraco non-Examine Provider that will do it for the basic usages on top of the abstracted services?

Good discussions and points there. Looking forward to more perspectives!

kjac Jan 9, 2025
Collaborator

@mzajkowski we want to store the raw index values (as perceived by Umbraco) in the database for the sole reason of sending them to whatever search abstraction implementation might be enabled. There are a few potential benefits in terms of queueing/async indexing operations with this approach, but the predominant reason is for re-indexing. If we pre-calculate (cache) all raw index values at save/publish time, we can significantly speed up the whole re-indexing process.

The indexed data is logically stored too - by the search abstraction implementation, that is... in an Examine index, or Elastic, or whatever search technology. I'm not sure I see a reason to store the actually indexed values back into the database, if that's what you're suggesting?

kjac Jan 9, 2025
Collaborator

@bielu good point. I have amended the RFC with a little clarification - added this line

JasonElkin Jan 9, 2025

This is helpful.

Though, I think it would be helpful to clarify why this is necessary, as I think it's a counter-intuitive approach for anyone not familiar with the underlying problem - namely that IContent (well, the persistence layer behind it) is not well optimised for indexing.

Shazwazza Jan 9, 2025

@bergmania

Suggestions for document structure
Thank you for your recommendations. My concern is that listing elements in the document may lead to discrepancies as the document evolves through amendments.

IMO without listing requirements for what requirements the document is fulfilling up-front means that there are no concrete requirements in the first place. I don't think all my suggestions need to be taken into account 100%, its totally up to you folks but listing functional requirements up-front I feel is the most important part of any design document written. It provides clarity on exactly needs to be achieved and since this is a living document, of course these requirements can/will change until it is accepted. Up to you though, otherwise, I'll be try to keep the requirements list in my discussion section up to date.

From an abstraction perspective, however, we believe separating these types is appropriate.

It may be worth calling this out in the doc that these 4 'indexes' may not actually be real indexes, more that these are 4 'interfaces' to this index/search data. I'm not sure what the terminology could be.

But then this leads to another question (see below).

Some implementations may opt not to provide it, but the abstraction supports it.

This is then a broken abstraction. There is some reasons why the Examine search abstraction can't support all search providers. It can partially support pretty much all search providers but then for features that can't be supported you have to start throwing runtime NotSupportedExceptions which is not exactly friendly. I would urge you to consider only supporting the requirements that are necessary to run Umbraco itself and reduce this abstraction surface area as much as possible so that we don't end up with an abstraction that cannot support all/most search providers and resort to runtime NotSupportedExceptions.

But then this leads to more question (see below).

Questions

Regarding the 4 'indexes'/'interfaces': Will your abstraction then prevent or support cross 'interface' searching? And would the assumption be that Scoring is taking into account between cross searches?
Regarding the abstraction: Faceting, Autocomplete/suggestions - are these requirements for the CMS? If not, please don't include these in the Umbraco abstraction as it will lead to the above concern, and then you will also be forced to try/catch for NotSupportedException within the backoffice.
Do you plan on taking into account media files, their content, etc...? How will you help implementations fulfill the requirements for indexing the content within files? It would be nice to try to make this easier for implementations and also ensure it supports putting the extracted data into the same media document/index. There are various ways that this can be done by implementations:
- The brute force way - for example, like the Umbraco Examine PDF plugin where this ties into extension points, figures out how to read the file, passes this content through some external packages, extract the text from file content and then store this into an index. Currently, this PDF package indexes this content into yet-another-index which is not ideal.
- Using Blob Storage - for example, in ExamineX with Azure Search, the implementation ties into media saving events, adds the required metadata to the media blobs and then behind the scenes Azure Search runs an indexer process to read the file contents and then updates the media index for that document with the extracted content. Elastic also supports this type of scenario but only with higher level licenses.
- Using provider specific extensions - for example, Elastic supports sending up a base64 version of the file to an endpoint to be analyzed/extracted and then populate a field for that document/index.
Not related to this document, but more the RFC process - where should we be posting these questions/comments? Does it make more sense to keep conversations within the threads or keep making new ones?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 0027 - The Future of Search #43

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC 0027 - The Future of Search #43

mta-umbraco Jan 8, 2025 Maintainer

Request for Comments: The Future of Search

Read the full RFC document here.

How do I contribute?

Replies: 4 comments · 8 replies

Shazwazza Jan 8, 2025

Document structure

Yet to be defined

Concerns

"Replacing the current Examine-based implementation"

Separating media and documents into separate indexes

Field prefixes

Questions

liamlaverty Jan 9, 2025

bielu Jan 9, 2025

bergmania Jan 9, 2025 Collaborator

mzajkowski Jan 9, 2025

kjac Jan 9, 2025 Collaborator

kjac Jan 9, 2025 Collaborator

JasonElkin Jan 9, 2025

Shazwazza Jan 9, 2025

Questions

mta-umbraco
Jan 8, 2025
Maintainer

Replies: 4 comments 8 replies

Shazwazza
Jan 8, 2025

liamlaverty
Jan 9, 2025

bielu
Jan 9, 2025

bergmania
Jan 9, 2025
Collaborator

kjac Jan 9, 2025
Collaborator

kjac Jan 9, 2025
Collaborator