Search architecture

Purpose: To outline thoughts about search and it's architecture

How search works today

Currently, search is built on Elasticsearch(ES) and creates a query_string query to interact with ES. The returned data is an array of ayahs that contain a match, the ayah's quran content and each match's content with highlight and scoring. See Example

Where is search files?

Search is contained in the search model directory. The Search::Query::Client is the starting point for search.

Creating a search

SearchController is currently the only consumer of search, but not limited to it. You can create searches from your rails console too.

What happens when we construct a search? The new search was built to be very pluggable and dynamic. It initializes the search object and the needed queries/options. Returns an instance of the Search::Query::Client with a request method to make the request. An initial request is made to ES to fetch matches and aggregate by ayah_keys, which is a property on every index. After this request is returned, it goes to the Search::Results class which digests the search and returns it to the client for pagination and a second run to fetch the full data set from the search. On the second request, the client will search for the same thing as initial but limiting to the ayah_keys that are paginated (so if first request returned 50 ayah_keys, the second request will fetch a set of 10, depending on the page number and size, and returned the full data set for those ayah_keys). The reason we do this is the following:

If we asked for all the data set and made 1 request, it would be very large and would take ES a very long time. Say a user searched 'he' it will returned 1000s of hits and could take a few seconds to return. But with no data asked, just the ayah_keys, ES can do it in milliseconds.
What if we don't aggregate and just return the full results? ES can also do the pagination for you too, which is great. We thought about this option extensively and would be beneficial to ask for the best hit but most users want the best hit ayah, not the best hit translation. Aggregating in the backend vs ES is more costly.

After the full data set has been returned, we now need to clean and fetch the according records to return to the frontend. We fetch from the DB the corresponding records, and their quran content, then we merge with the matched data for each ayah.

What's nice about the current search?

It's very clean and organized written code. We previously had a giant controller
Models have specific tasks
Virtus gem helps with managing non-activerecord models
It's really fast
It's more superior to prior search
Easily experiment with different kinds of search, hassle free
query_string provides flexibility

Learnings thus far

From users:

People generally want to find an ayah they are looking for (eg. 'inna alazeena', where is it!)
- They generally would either search with the English translation ('oh you who believe') or transliteration ('inna alazeena') or arabic ('إن الذين')
- Need to have good fuzziness search, English detection for transliteration -> transliteration + arabic transliterated, arabic search
Search tweaks can go out of control, so tests could help but it's hard to benchmark a good set of results (I generally test against 'inna alatheena' cause I want them both beside each other)
Test search to try and break it! (query_string caused ES to go down many times. We need to test different variations of things)

Where search should go

When I search 'inna alazeena' it should do an OR search for 'inna alazeena' and be smart enough to know it's not english (and not another language) then transliterate it to 'إن الذين'
When I search 'inna alazeena' it should do fuzziness for 'inna alatheena' (which is the actual thing) or phonetic search (implemented!)
When I search 'oh you who believe' it should do a (oh|you|who|believe) and (oh you who believe) search
It's own service - dissociate it from the backend
Autocomplete suggestions (I started experimenting with it. We need to get on that)
factor in frequency, density, proximity to each other, and proximity to the beginning of the ayah (seems like it's not factored in)
- frequency, i.e. if 'allah light' matches 'allah' once, and 'light' twice in the same result, then that result needs a higher score than matching only 'allah' once and 'light' once
- density, i.e. if 'allah light' matches an ayah which is only 5 tokens long, e.g. 'allah word_a light word_b word_c' then this has a higher density then a match against a result which is 300 words long and should respectively have a higher score
- proximity to each other, i.e. 'allah light' matching 'allah word light word word word' gets a better score then a match against 'allah word word word word word word light'
- proximity to the beginning of the ayah, i.e. if 'allah light' matches a translation which is 'allah is the light of word word word word word word' then this should have a higher score then 'word word word word word word word allah word word word word light'

Optimization TODO NOTES

normalize western languages (stemming, etc.)
normalize arabic using techniques to-be-determined involving root, stem, lemma
improving relevance:
- this document: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html
- in combination with a rails console inspection of:
matched_children = ( OpenStruct.new Quran::Ayah.matched_children( query, config[:types], array_of_ayah_keys ) ).responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly