Inverted indexes
You can use inverted indexes to speed up a broad range of AQL queries, from simple to complex, including full-text search
Documents hold attributes, mapping attribute keys to values. Inverted indexes store mappings from values (of document attributes) to their locations in collections. You can create these indexes to accelerate queries like value lookups, range queries, accent- and case-insensitive search, wildcard and fuzzy search, nested search, search highlighting, as well as for sophisticated full-text search with the ability to search for words, phrases, and more.
You can use inverted indexes stand-alone in FILTER operations of AQL queries,
or add them to search-alias Views
to search multiple collections at once and to rank search results by relevance.
Defining inverted indexes
Inverted indexes are defined per collection. You can add an arbitrary number of document attributes to each index. Every attribute can optionally be processed with an Analyzer, for instance, to tokenize text into words.
Basic definition
For example, you can create an inverted index for the attributes value1 and
value2 with the following command in arangosh:
db.<collection>.ensureIndex({
type: "inverted",
fields: ["value1", "value2"]
});
You should give inverted indexes easy-to-remember names because you need to
refer to them with indexHint in queries to
utilize inverted indexes:
db.<collection>.ensureIndex({
name: "inverted_index_name",
type: "inverted",
fields: ["value1", "value2"]
});
Processing fields with Analyzers
The fields are processed by the identity Analyzer by default, which indexes
the attribute values as-in. To define a different Analyzer, you can pass an
object with the field settings instead. For example, you can use the default
Analyzer for value1 and the built-in text_en Analyzer for value2.
You can also overwrite the features of an Analyzer:
db.<collection>.ensureIndex({
type: "inverted",
fields: [
"value1",
{ name: "value2", analyzer: "text_en", features: [ "frequency", "norm", "position", "offset" ] }
]
});
You can define default analyzer and features value at the top-level of the
index property. In the following example, both fields are indexed with the
text_en Analyzer:
db.<collection>.ensureIndex({
type: "inverted",
fields: ["value1", "value2" ],
analyzer: "text_en"
});
If no features are defined in fields nor at the top-level, then the features
defined by the Analyzer itself are used.
Indexing sub-attributes
To index a sub-attribute, like { "attr": { "sub": "value" } }, use the
. character for the description of the attribute path:
db.<collection>.ensureIndex({ type: "inverted", fields: ["attr.sub"] });
For SEARCH queries using a search-alias View, you can also index all
sub-attribute of an attribute with the includeAllFields options. It has no
effect on FILTER queries that use an inverted index directly, however. You can
enable the option for specific attributes in the fields definition, or for the
entire document by setting the option at the top-level of the index properties:
db.<collection>.ensureIndex({ type: "inverted", fields: [ { name: "attr", includeAllFields: true } ] });
db.<collection>.ensureIndex({ type: "inverted", includeAllFields: true });
With the includeAllFields option enabled at the top-level, the otherwise
mandatory fields property becomes optional.
The includeAllFields option only includes the remaining fields that are not
separately specified in the fields definition, including their sub-attributes.
Indexing array values
You can expand an attribute with an array as value so that its elements get
indexed individually by using [*] in the attribute path. For example, to
index the string values of a document like
{ "arr": [ { "name": "foo" }, { "name": "bar" } ] }, use arr[*].name:
db.<collection>.ensureIndex({ type: "inverted", fields: ["arr[*].name"] });
You can only expand one level of arrays.
If you want to use the inverted index in a search-alias View and index primitive
and array values like arangosearch Views do by default, then you can enable the
searchField option for specific attributes in the fields definition, or by
default using the top-level option with the same name. You may want to combine
it with the includeAllFields option to index sub-objects without explicit
definition:
db.<collection>.ensureIndex({
type: "inverted",
fields: [
{ name: "arr", searchField: true, includeAllFields: true }
]
});
db.<collection>.ensureIndex({
type: "inverted",
fields: [ "arr", "arr.name" ],
searchField: true
});
To index array values but preserve the array indexes for a search-alias View,
which you then also need to specify in queries, enable the trackListPositions
option:
db.<collection>.ensureIndex({
type: "inverted",
fields: [
{ name: "arr", searchField: true, trackListPositions: true, includeAllFields: true }
]
});
db.<collection>.ensureIndex({
type: "inverted",
fields: [ "arr", "arr.name" ],
searchField: true,
trackListPositions: true
});
Storing additional values in indexes
Inverted indexes allow you to store additional attributes in the index that can be used to satisfy projections of the document. They cannot be used for index lookups, but for projections and sorting only. They allow inverted indexes to fully cover more queries and avoid extra document lookups. This can have a great positive effect on index scan performance if the number of scanned index entries is large.
You can set the storedValues option and specify the additional attributes as
an array of attribute paths when creating a new inverted index:
db.<collection>.ensureIndex({
type: "inverted",
fields: ["value1"],
storedValues: ["value2"]
});
This indexes the value1 attribute in the traditional sense, so that the index
can be used for looking up by value1 or for sorting by value1. The index also
supports projections on value1 as usual.
In addition, due to storedValues being used here, the index can now also
supply the values for the value2 attribute for projections without having to
look up the full document. Non-existing attributes are stored as null values.
You cannot specify the same attribute path in both, the fields and the
storedValues option. If there is an overlap, the index creation aborts
with an error message.
Additional configuration options
See the full list of options in the HTTP API documentation.
Restrictions
- You cannot index the same field twice in a single inverted index. This includes
indexing the same field with and without array expansion (e.g.
fields: ["arr", "arr[*]"]). - Every field can only be processed by a single Analyzer per index. The benefit
is that you don’t need to specify the Analyzers in queries with the
ANALYZER()function, they can be inferred from the index definition.
If you plan on using inverted indexes in search-alias Views, also consider the
following restrictions in addition:
- All inverted indexes you add to a single
search-aliasView need to have matchingprimarySortandstoredValuessettings. - If different inverted indexes index fields with the same name (attribute path),
these fields need to have matching
analyzerandsearchFieldsettings. - Fields that are implicitly indexed by
includeAllFieldsmust also have the sameanalyzerandsearchFieldsettings as fields with the same attribute path indexed by other inverted indexes. You cannot combine inverted indexes in asearch-aliasView if these settings would be ambiguous for any field.
Utilizing inverted indexes in queries
Unlike other index types in ArangoDB, inverted indexes are not utilized
automatically, even if they are eligible for a query. The reason is that AQL
queries may produce different results with and without an inverted index,
because of differences in how FILTER operations work for inverted indexes,
the eventual-consistent nature of this index type, and for backward compatibility.
To use an inverted index, add an OPTIONS clause to the FOR operation that
you want to speed up. It needs to be a loop over a collection, not an array or
a View. Specify the desired index as an indexHint, using its name.
You should enable the forceIndexHint option to prevent unexpected query
results: Index hints are only recommendations you provide to
the optimizer unless you force them, and because FILTER queries behave
differently if an inverted index is used, the results would be inconsistent,
depending on whether or not the index hint is used. Forcing the hint ensures
that an error is raised if the index isn’t suitable.
FOR doc IN coll OPTIONS { indexHint: "inverted_index_name", forceIndexHint: true }
FILTER doc.value == 42 AND PHRASE(doc.text, ["meaning", 1, "life"])
RETURN doc
Inverted indexes are eventually consistent. Document changes are not reflected instantly, but only near-realtime. There is an internal option to wait for the inverted index to update. This option is intended to be used in unit tests only:
FOR doc IN coll { indexHint: "inv-idx", forceIndexHint: true, waitForSync: true }
FILTER ...
Also see Dealing with eventual consistency.
Examples
The following examples demonstrate how you can set up and use inverted indexes
with the JavaScript API of arangosh. See the
ensureIndex() method
description for details.
For more in-depth explanations and example data, see the ArangoSearch example datasets and the corresponding ArangoSearch examples.
Exact value matching
Match exact movie title (case-sensitive, full string):
db.imdb_vertices.ensureIndex({ name: "inv-exact", type: "inverted", fields: [ "title" ] });
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-exact", forceIndexHint: true }
FILTER doc.title == "The Matrix"
RETURN doc.title`);
Match multiple exact movie titles using IN:
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-exact", forceIndexHint: true }
FILTER doc.title IN ["The Matrix", "The Matrix Reloaded"]
RETURN doc.title`);
Match movies that neither have the title The Matrix nor The Matrix Reloaded,
but ignore documents without a title:
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-exact", forceIndexHint: true }
FILTER EXISTS(doc.title) AND doc.title NOT IN ["The Matrix", "The Matrix Reloaded"]
RETURN doc.title`);
Range queries
Match movies with a runtime over 300 minutes and sort them from longest to
shortest runtime. You can give the index a primary sort order so that the SORT
operation of the query can be optimized away:
db.imdb_vertices.ensureIndex({
name: "inv-exact",
type: "inverted",
fields: [ "runtime" ],
primarySort: {
fields: [
{ field: "runtime", direction: "desc" }
]
}
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-exact", forceIndexHint: true }
FILTER doc.runtime > 300
SORT doc.runtime DESC
RETURN doc`);
Match movies where the name is >= Wu and < Y:
db.imdb_vertices.ensureIndex({ name: "inv-exact", type: "inverted", fields: [ "name" ] });
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-exact", forceIndexHint: true }
FILTER IN_RANGE(doc.name, "Wu", "Y", true, false)
RETURN doc`);
The alphabetical order of characters is not taken into account, i.e. range queries backed by inverted indexes do not follow the language rules as per the defined Analyzer locale (except for the collation Analyzer) nor the server language (startup option --default-language)! Also see Known Issues.
Case-insensitive search
Match movie titles, ignoring capitalization and using the base characters instead of accented characters (full string):
var analyzers = require("@arangodb/analyzers");
analyzers.save("norm_en", "norm", { locale: "en", accent: false, case: "lower" });
db.imdb_vertices.ensureIndex({
name: "inv-ci",
type: "inverted",
fields: [
{ name: "title", analyzer: "norm_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-ci", forceIndexHint: true }
FILTER doc.title == TOKENS("thé mäTRïX", "norm_en")[0]
RETURN doc.title)
Prefix matching
Match movie titles that start with either "the matr" or "harry pot"
(case-insensitive and accent-insensitive, using a custom norm Analyzer),
utilizing the feature of the STARTS_WITH() function that allows you to pass
multiple possible prefixes as array of strings, of which one must match:
var analyzers = require("@arangodb/analyzers");
analyzers.save("norm_en", "norm", { locale: "en", case: "lower", accent: false });
db.imdb_vertices.ensureIndex({
name: "inv-ci",
type: "inverted",
fields: [
{ name: "name", analyzer: "norm_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-ci", forceIndexHint: true }
FILTER STARTS_WITH(doc.title, ["the matr", "harry pot"])
RETURN doc.title`);
Wildcard search
Match all titles that starts with the matr (case-insensitive) using LIKE(),
where _ stands for a single wildcard character and % for an arbitrary amount:
var analyzers = require("@arangodb/analyzers");
analyzers.save("norm_en", "norm", { locale: "en", accent: false, case: "lower" });
db.imdb_vertices.ensureIndex({
name: "inv-ci",
type: "inverted",
fields: [
{ name: "title", analyzer: "norm_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-ci", forceIndexHint: true }
FILTER ANALYZER(LIKE(doc.title, "the matr%"), "identity")
RETURN doc.title`);
Full-text token search
Search for movies with both dinosaur and park in their description:
db.imdb_vertices.ensureIndex({
name: "inv-text",
type: "inverted",
fields: [
{ name: "description", analyzer: "text_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER TOKENS("dinosaur park", "text_en") ALL == doc.description
RETURN {
title: doc.title,
description: doc.description
}`);
Phrase and proximity search
Search for movies that have the (normalized and stemmed) tokens biggest and
blockbust in their description, in this order:
db.imdb_vertices.ensureIndex({
name: "inv-text",
type: "inverted",
fields: [
{ name: "description", analyzer: "text_en" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER PHRASE(doc.description, "BIGGEST Blockbuster")
RETURN {
title: doc.title,
description: doc.description
}`);
Match movies that contain the phrase epic <something> film in their
description, where <something> can be exactly one arbitrary token:
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER PHRASE(doc.description, "epic", 1, "film", "text_en")
RETURN {
title: doc.title,
description: doc.description
}`);
Fuzzy search
Search for the token galxy in the movie descriptions with some fuzziness.
The maximum allowed Levenshtein distance is set to 1. Everything with a
Levenshtein distance equal to or lower than this value will be a match and the
respective documents will be included in the search result. The query will find
the token galaxy as the edit distance to galxy is 1.
A custom text Analyzer with stemming disabled is used to improve the accuracy
of the Levenshtein distance calculation:
var analyzers = require("@arangodb/analyzers");
analyzers.save("text_en_no_stem", "text", { locale: "en", accent: false, case: "lower", stemming: false, stopwords: [] });
db.imdb_vertices.ensureIndex({
name: "inv-text",
type: "inverted",
fields: [
{ name: "description", analyzer: "text_en_no_stem" }
]
});
db._query(`FOR doc IN imdb_vertices OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER LEVENSHTEIN_MATCH(
doc.description,
TOKENS("galxy", "text_en_no_stem")[0],
1, // max distance
false // without transpositions
)
RETURN {
title: doc.title,
description: doc.description
}`);
Geo-spatial search
Using the Museum of Modern Arts as reference location, find restaurants within a 100 meter radius. Return the matches sorted by distance and include how far away they are from the reference point in the result:
var analyzers = require("@arangodb/analyzers");
analyzers.save("geojson", "geojson", {}, ["frequency", "norm", "position"]);
db.restaurants.ensureIndex({
name: "inv-rest",
type: "inverted",
fields: [
{ name: "location", analyzer: "geojson" }
]
});
db._query(`LET moma = GEO_POINT(-73.983, 40.764)
FOR doc IN restaurants OPTIONS { indexHint: "inv-rest", forceIndexHint: true }
FILTER GEO_DISTANCE(doc.location, moma) < 100
LET distance = GEO_DISTANCE(doc.location, moma)
SORT distance
RETURN {
geometry: doc.location,
distance
}`);
Nested search (Enterprise Edition)
Match documents with a dimensions array that contains one or two sub-objects
with a type of "height" and a value greater than 40:
db.coll.ensureIndex({
name: "inv-nest",
type: "inverted",
fields: [
{
name: "dimensions",
nested: [
{ name: "type" }
{ name: "value" }
]
}
]
});
db._query(`FOR doc IN coll OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER doc.dimensions[? 1..2 FILTER CURRENT.type == "height" AND CURRENT.value > 40]
RETURN doc
`);
Nested search is only available in the Enterprise Edition.
Search highlighting (Enterprise Edition)
Search for descriptions that contain the tokens avocado or tomato,
the phrase cultivated ... pungency with two arbitrary tokens between the two
words, and for words that start with cap. Get the matching positions, and use
this information to extract the substrings with the
SUBSTRING_BYTES() function:
db.food.ensureIndex({
name: "inv-text",
type: "inverted",
fields: [
{ name: "description", analyzer: "text_en", features: ["frequency", "norm", "position", "offset"] }
]
});
db._query(`FOR doc IN food OPTIONS { indexHint: "inv-text", forceIndexHint: true }
FILTER
TOKENS("avocado tomato", "text_en") ANY == doc.description.en OR
PHRASE(doc.description.en, "cultivated", 2, "pungency") OR
STARTS_WITH(doc.description.en, "cap")
RETURN {
title: doc.title,
matches: (
FOR offsetInfo IN OFFSET_INFO(doc, ["description.en"])
RETURN offsetInfo.offsets[* RETURN {
offset: CURRENT,
match: SUBSTRING_BYTES(VALUE(doc, offsetInfo.name), CURRENT[0], CURRENT[1])
}]
)
}
`);
Search highlighting is only available in the Enterprise Edition.