Skip to content

Indexing

Romain Ruaud edited this page Nov 13, 2017 · 5 revisions

Indexing content

In this chapter you will learn how Elasticsuite is proceeding to index content into Elasticsearch.

This guide will not cover Elasticsearch basics, such as "what is an index ?" or "what is a field ?". It is prerequisite that you already know the main concepts of Elasticsearch before exploring this guide.

Table of contents

Indices

Index declaration

Elasticsuite will create an index in Elasticsearch for each entity type and store view.

For now, indexed entity types are Products, Categories, and Synonyms.

The indices' names are based on :

Lets say we have a Magento Store with 2 store views (with 'en' and 'fr' as store code), and the alias set to magento2, the following indices will be created :

  • magento2_en_catalog_category_20171110_113448
  • magento2_en_catalog_product_20171110_113610
  • magento2_en_thesaurus_20171110_113449
  • magento2_fr_catalog_category_20171110_113448
  • magento2_fr_catalog_product_20171110_113610
  • magento2_fr_thesaurus_20171110_113449

These indices configuration is driven by the elasticsuite_indices.xml file. You can declare a new elasticsuite_indices.xml file in your module if you plan to index other entities.

Let's see how it is declared for the products index :

<indices xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="urn:magento:module:Smile_ElasticsuiteCore:etc/elasticsuite_indices.xsd">

    <index identifier="catalog_product" defaultSearchType="product">
        <type name="product" idFieldName="entity_id">

    ...

Indexer Model

The indexer is declared via the Magento's indexer.xml like this is done for the category indexing :

Product indexer is not shown since it is already declared in Magento and only modified by Elasticsuite

    <indexer id="elasticsuite_categories_fulltext" view_id="elasticsuite_categories_fulltext" class="Smile\ElasticsuiteCatalog\Model\Category\Indexer\Fulltext">
        <title translate="true">ElasticSuite Category Indexing</title>
        <description translate="true">Reindex ElasticSuite catalog categories.</description>
    </indexer>

Finally, your indexing model must use the proper Indexer Handler (which shall extend \Smile\ElasticsuiteCore\Indexer\GenericIndexerHandler) and have the proper index name and type defined. This can be done via DI.

Eg for the categories :

    <virtualType name="catalogCategorySearchIndexHandler" type="\Smile\ElasticsuiteCore\Indexer\GenericIndexerHandler">
        <arguments>
            <argument name="indexName" xsi:type="string">catalog_category</argument>
            <argument name="typeName" xsi:type="string">category</argument>
        </arguments>
    </virtualType>

    <type name="Smile\ElasticsuiteCatalog\Model\Category\Indexer\Fulltext">
        <arguments>
            <argument name="indexerHandler" xsi:type="object">catalogCategorySearchIndexHandler</argument>
        </arguments>
    </type>

Now it's time to write your Indexer Model.

Take a look on the Elasticsuite Categories Indexer which is basically an implementation of \Magento\Framework\Indexer\ActionInterface and \Magento\Framework\Mview\ActionInterface:

class Fulltext implements \Magento\Framework\Indexer\ActionInterface, \Magento\Framework\Mview\ActionInterface
{
    /**
     * @var string
     */
    const INDEXER_ID = 'elasticsuite_categories_fulltext';

    /**
     * @var IndexerInterface
     */
    private $indexerHandler;

    /**
     * @var StoreManagerInterface
     */
    private $storeManager;

    /**
     * @var DimensionFactory
     */
    private $dimensionFactory;

    /**
     * @var Full
     */
    private $fullAction;

    /**
     * @param Full                  $fullAction       The full index action
     * @param IndexerInterface      $indexerHandler   The index handler
     * @param StoreManagerInterface $storeManager     The Store Manager
     * @param DimensionFactory      $dimensionFactory The dimension factory
     */
    public function __construct(
        Full $fullAction,
        IndexerInterface $indexerHandler,
        StoreManagerInterface $storeManager,
        DimensionFactory $dimensionFactory
    ) {
        $this->fullAction = $fullAction;
        $this->indexerHandler = $indexerHandler;
        $this->storeManager = $storeManager;
        $this->dimensionFactory = $dimensionFactory;
    }

    /**
     * Execute materialization on ids entities
     *
     * @param int[] $ids The ids
     *
     * @return void
     */
    public function execute($ids)
    {
        $storeIds = array_keys($this->storeManager->getStores());

        foreach ($storeIds as $storeId) {
            $dimension = $this->dimensionFactory->create(['name' => 'scope', 'value' => $storeId]);
            $this->indexerHandler->deleteIndex([$dimension], new \ArrayObject($ids));
            $this->indexerHandler->saveIndex([$dimension], $this->fullAction->rebuildStoreIndex($storeId, $ids));
        }
    }

    /**
     * Execute full indexation
     *
     * @return void
     */
    public function executeFull()
    {
        $storeIds = array_keys($this->storeManager->getStores());

        foreach ($storeIds as $storeId) {
            $dimension = $this->dimensionFactory->create(['name' => 'scope', 'value' => $storeId]);
            $this->indexerHandler->cleanIndex([$dimension]);
            $this->indexerHandler->saveIndex([$dimension], $this->fullAction->rebuildStoreIndex($storeId));
        }
    }

    /**
     * {@inheritDoc}
     */
    public function executeList(array $categoryIds)
    {
        $this->execute($categoryIds);
    }

    /**
     * {@inheritDoc}
     */
    public function executeRow($categoryId)
    {
        $this->execute([$categoryId]);
    }

You see that the main part is about the $this->fullAction->rebuildStoreIndex($storeId, $ids).

This model is just retrieving entities to index. It can have some custom logic, for products it only takes the products which are visible.

Once you have this, you are done with your index definition and your indexer model.

But for now, you are only iterating over main table of your entities and are missing the most part of your data.

Let's see now how you will add content into your Elasticsearch index.

Mappings

The Mapping is the part that will define which fields are stored into Elasticsearch, how they are stored (it is different if a field is used for filtering or sorting), and what type they have.

You can read more about mappings in the Elasticsearch documentation

Data sources

Each index can have several data sources. These objects are meant to retrieve data (from MySQL, or even elsewhere if needed) and aggregate them into documents that will be sent to Elasticsearch.

Let's get back to our elasticsuite_indices.xml and see what do we have for the product index :

    <index identifier="catalog_product" defaultSearchType="product">
        <type name="product" idFieldName="entity_id">
            <datasources>
                <datasource name="prices">Smile\ElasticsuiteCatalog\Model\Product\Indexer\Fulltext\Datasource\PriceData</datasource>
                <datasource name="categories">Smile\ElasticsuiteCatalog\Model\Product\Indexer\Fulltext\Datasource\CategoryData</datasource>
                <datasource name="attributes">Smile\ElasticsuiteCatalog\Model\Product\Indexer\Fulltext\Datasource\AttributeData</datasource>
                <datasource name="stock">Smile\ElasticsuiteCatalog\Model\Product\Indexer\Fulltext\Datasource\InventoryData</datasource>
            </datasources>

We have 4 datasources, retrieving different kind of data. Being able to have several allows us to write tiny data sources that are easy to maintain and do only a precise job.

This is also really easy for anybody to add a custom data source by defining it into a new elasticsuite_indices.xml in his module.

Each DataSource is basically a simple Model implementing Smile\ElasticsuiteCore\Api\Index\DatasourceInterface which has only one method : addData($storeId, array $indexData)

  • $storeId is the Store Id being reindexed.
  • $indexData is the "current" data being indexed. Since it can have gone through other datasources before, you may have various amount of data on it. But what is important is that the key of the array is the idFieldName defined in elasticsuite_indices.xml. Eg in the product datasources, we often do $productIds = array_keys($indexData); and then retrieve products data and add it to $indexData.

Let's see an example with the Stock Datasource which uses a resource model to load stock data, and then push it to the $indexData :

    /**
     * Add inventory data to the index data.
     * {@inheritdoc}
     */
    public function addData($storeId, array $indexData)
    {
        $inventoryData = $this->resourceModel->loadInventoryData($storeId, array_keys($indexData));

        foreach ($inventoryData as $inventoryDataRow) {
            $productId = (int) $inventoryDataRow['product_id'];
            $indexData[$productId]['stock'] = [
                'is_in_stock' => (bool) $inventoryDataRow['stock_status'],
                'qty'         => (int) $inventoryDataRow['qty'],
            ];
        }

        return $indexData;
    }

For now, you should already see some examples of potential additional data sources which are easy to implement :

  • fetch product ratings from the database.
  • add data coming from external services via an API if needed.
  • and so on...

Fields

Once the data sources are done, you are now able to define how the data you have just added should be indexed into Elasticsearch.

This part is basically about how data coming from Magento will be converted into Elasticsearch fields.

You can learn more about Elasticsearch fields types here

Fields defined directly in configuration file

The easy way is to define directly the fields into the elasticsuite_indices.xml file, like this :

    <mapping>
        <!-- Static fields handled by the base indexer (not datasource) -->
        <field name="entity_id" type="integer" />
        <field name="attribute_set_id" type="integer" />
        <field name="has_options" type="boolean" />
        <field name="required_options" type="boolean" />
        <field name="created_at" type="date" />
        <field name="updated_at" type="date" />
        <field name="type_id" type="string" />
        <field name="visibility" type="integer" />
    ...

Defining fields property via configuration

In this file, you are also able to define custom properties of fields directly. Let's say how the SKU field is declared :

    <field name="sku" type="string">
        <isSearchable>1</isSearchable>
        <isUsedInSpellcheck>1</isUsedInSpellcheck>
        <defaultSearchAnalyzer>whitespace</defaultSearchAnalyzer>
    </field>

Here you can define the following non-required properties :

  • isSearchable (default to false) : if querying this index will search into this field
  • isFilterable (default to true) : if the field can be used for filtering queries (then it will get indexed differently)
  • isUsedInSpellcheck (default to false) : if the engine will check for exact matching in this field
  • isUsedForSortBy (default to false) : if you plan to use this field to sort (then it will get indexed differently)
  • searchWeight (default to 1) : the weight to give to this field when searching (default to 1)
  • defaultSearchAnalyzer (default to standard) : we'll speak about this later in Custom analysis part.

Complex field types

You are also able to store some fields as objects or nested objects. We will not cover the difference between the two here, please refer to the Elasticsearch documentation to understand more these concepts.

Eg : stock is stored as an object field.

    <field name="stock.is_in_stock" type="boolean" />
    <field name="stock.qty" type="integer" />

Eg : Price is stored as a nested field.

    <field name="price.price" type="double" nestedPath="price" />
    <field name="price.original_price" type="double" nestedPath="price" />
    <field name="price.is_discount" type="boolean" nestedPath="price" />
    <field name="price.customer_group_id" type="integer" nestedPath="price" />

Eg : Category nested field with custom properties.

    <field name="category.category_id" type="integer" nestedPath="category" />
    <field name="category.position" type="integer" nestedPath="category" />
    <field name="category.is_parent" type="boolean" nestedPath="category" />
    <field name="category.name" type="string" nestedPath="category">
        <isSearchable>1</isSearchable>
        <isUsedInSpellcheck>1</isUsedInSpellcheck>
        <isFilterable>0</isFilterable>
    </field>

Dynamic fields provider

Ok, previous part about defining the mapping directly in XML was great, but this is not really compatible with evolutive data such as product attributes, which can be easily added/removed in the Back-Office. Their type can even be switched by users !

And, in fact, as you may have seen, the product attributes are not declared into our elasticsuite_indices.xml file. Guess why ?

You remember the previous part about the DataSource, right ?

If your datasource is implementing Smile\ElasticsuiteCore\Api\Index\Mapping\DynamicFieldProviderInterface, the engine will automatically detect it, and call the getFields() method of your DataSource.

If you take a look at the methods getFields() and initField() located in Smile\ElasticsuiteCatalog\Model\Eav\Indexer\Fulltext\Datasource\AbstractAttributeData you will see that it does automatically the job to convert each attribute configuration (defined via the Magento's Back-Office) into an array of \Smile\ElasticsuiteCore\Api\Index\Mapping\FieldInterface according to the values of each attribute settings (is_filterable, is_searchable, search_weight and so on...)

You may implement the same logic if you plan to index custom EAV content or extensible data that does not come with a strongly-typed and irremovable structure.

Custom analysis / filtering of fields

Analysis is the logic which is applied to field values when they are sent to Elasticsearch. It allows to handle special characters, stem the words to their root, or even more.

If you are willing to customise this, it implies that you have already a solid knowledge about Elasticsearch (or Solr and Lucene) analyzers and filters.

You can read more about this topic in the Elasticsearch documentation

The list of available analyzers delivered by Elasticsuite is in the elasticsuite_analysis.xml file of the ElasticsuiteCore module.

Since it's an xml file, it can be extended in your own modules to fit your needs.

The default list of analyzers and filters is quite enough to have the engine working properly on many languages and field types.

An analyzer is basically a combination of char_filters and filters. Let's see the standard analyzer :

    <analyzer name="standard" tokenizer="whitespace" language="default">
        <filters>
            <filter ref="lowercase" />
            <filter ref="ascii_folding" />
            <filter ref="trim" />
            <filter ref="elision" />
            <filter ref="word_delimiter" />
            <filter ref="standard" />
        </filters>
        <char_filters>
            <char_filter ref="html_strip" />
        </char_filters>
    </analyzer>

On the elasticsuite_indices.xml you are able to define the defaultSearchAnalyzer of a field. Remember the SKU example :

    <field name="sku" type="string">
        <isSearchable>1</isSearchable>
        <isUsedInSpellcheck>1</isUsedInSpellcheck>
        <defaultSearchAnalyzer>whitespace</defaultSearchAnalyzer>
    </field>

The default analyzer for the SKU is whitespace : it allows exact matching on the SKU. Using standard analyzer for SKU previously resulted in Elasticsearch automatically splitting the value if having a mix of letter and numbers, or dashes, which is often the case with SKUs.

Going Further / Practicals

We already have a module for indexing CMS Pages, which is a quite good tutorial to learn how you can index and query for custom content in an external module.

This module is available here

Clone this wiki locally