Search highlighting with ArangoSearch
You can retrieve the positions of matches within strings when querying ArangoSearch Views, to highlight what was found in search results
Search highlighting is only available in the Enterprise Edition.
ArangoSearch lets you search for terms and phrases in full-text, and more. It only returns matching documents, however. With search highlighting, you can get the exact locations of the matches.
A common use case is to emphasize the matching parts in client applications, for example, with a background color or an underline, so that users can easily see and understand the matches.
How to use search highlighting
To use search highlighting, you need to index the respective attributes with
Analyzers that have the offset feature enabled. The built-in text Analyzers
don’t have this feature enabled, you need to create custom Analyzers.
You can get the substring offsets of matches by calling the
OFFSET_INFO() function in
search queries. It takes the document emitted by the View (FOR doc IN viewName)
and a list of paths like "field.nested" or "array[0].field", defining for
what attributes or array elements you want to retrieve the offsets for. For
every path, it returns a list comprised of a name and offsets.
The name is the path of the value, but in a different form than you passed to
the function, like ["field", "nested"] or ["array", 0, "field"]. You can
look up the value with the VALUE() function
using this path description.
The offsets are a list of offset pairs, one for every match. Each pair is an
array with two numbers, with the start offset and length of the match. There can be
multiple matches per path. You can optionally cap how many matches are collected
per path by setting limits when calling the OFFSET_INFO() function.
The start offsets and lengths describe the positions in bytes, not characters. You may need to account for characters encoded using multiple bytes.
Term and phrase search with highlighting
Dataset:
A collection called food with the following documents:
{ "name": "avocado", "description": { "en": "The avocado is a medium-sized, evergreen tree, native to the Americas." } }
{ "name": "carrot", "description": { "en": "The carrot is a root vegetable, typically orange in color, native to Europe and Southwestern Asia." } }
{ "name": "chili pepper", "description": { "en": "Chili peppers are varieties of the berry-fruit of plants from the genus Capsicum, cultivated for their pungency." } }
{ "name": "tomato", "description": { "en": "The tomato is the edible berry of the tomato plant." } }
Custom Analyzer:
Create a text Analyzer in arangosh to tokenize text, like the built-in
text_en Analyzer, but additionally set the offset feature, enabling
search highlighting:
View definition:
{
"links": {
"food": {
"fields": {
"description": {
"fields": {
"en": {
"analyzers": [
"text_en_offset"
]
}
}
}
}
}
}
}
AQL queries
Search the View for descriptions that contain the tokens avocado or tomato,
the phrase cultivated ... pungency with two arbitrary tokens between the two
words, and for words that start with cap. Get the matching positions, and use
this information to extract the substrings with the
SUBSTRING_BYTES() function.
The OFFSET_INFO() function
returns a name that describes the path of the attribute or array element with
the match. You can use the VALUE() function
to dynamically get the respective value: