Conditionally Inserting and Modifying Documents
A common requirement when ingesting data is to ensure that certain documents exist in a collection. Oftentimes when running a command it is unclear whether the target documents are already present in the collection or need to be inserted first.
Unconditional INSERT operations will not work here, because they may run
into errors if the target documents already exist. This will trigger a
“unique constraint violation” error. Unconditional UPDATE or REPLACE
operations will also fail, because they require that the target documents are
already present. If this is not the case, the operations would run into
“document not found” errors.
So what needs to be run instead are conditional inserts/updates/replaces, also called upserts or repserts. The behavior of such operations is:
- Check if a document exists, based on some criteria
- If it does not exist, create the document
- If it exists, update or replace it with a new version
ArangoDB provides the following options in AQL to achieve this:
UPSERTAQL operationINSERTAQL operation withoverwriteMode- Insert operation not using AQL, but the Document REST API
These alternatives have different capabilities and performance characteristics.
UPSERT AQL Operation
Let us start with the UPSERT AQL operation,
which is very generic and flexible.
The purpose of the UPSERT AQL operation is to ensure that a specific document
exists after the operation has finished.
UPSERT will look for a specific document, based on user-configurable
attributes/values, and create the document if it does not yet exist.
If UPSERT finds such document, it can partially adjust it (UPDATE) or fully
replace it (REPLACE).
To recap, the syntaxes of AQL UPSERT are, depending on whether you want to
update replace a document:
UPSERT <search-expression>
INSERT <insert-expression>
UPDATE <update-expression>
IN <collection> OPTIONS <options>
or
UPSERT <search-expression>
INSERT <insert-expression>
REPLACE <replace-expression>
IN <collection> OPTIONS <options>
The OPTIONS part is optional.
An example UPSERT operation looks like this:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
UPDATE { status: "updated" }
IN pages
This will look for a document in the pages collection with the page
attribute having a value of index.html. If such document cannot be found, the
INSERT part will be executed, which will create a document with the page and
status attributes. If the operation finds an existing document with page
being index.html, it will execute the UPDATE part, which will set the
document’s status attribute to updated.
Tracking Modification Dates
The UPSERT AQL operation is sometimes used in combination with
date/time-keeping. For example, the following query keeps track of when a
document was first created, and when it was last updated:
UPSERT { page: "index.html" }
INSERT { page: "index.html", created: DATE_NOW() }
UPDATE { updated: DATE_NOW() }
IN pages
OLD variable
The UPSERT AQL operation also provides a pseudo-variable named OLD to refer
to the existing document and its values in the UPDATE/REPLACE part.
Following is an example that increments a counter on a document whenever the
UPSERT operation is executed:
UPSERT { page: "index.html" }
INSERT { page: "index.html", hits: 1 }
UPDATE { hits: OLD.value + 1 }
IN pages
UPSERT Caveats
UPSERT is a very flexible operation, so some things should be kept in mind to
use it effectively and efficiently.
Repeat the Search Attributes
First of all, the INSERT part of an UPSERT operation should contain all
attributes that are used in the search expression. Consider the following
counter-example:
UPSERT { page: "index.html" }
INSERT { status: "inserted" } /* page attribute missing here! */
UPDATE { status: "updated" }
IN pages
Forgetting to specify the search attributes in the INSERT part introduces a
problem: The first time the UPSERT is executed and does not find a document
with page being index.html, it will branch into the INSERT part as
expected. However, the INSERT part will create a document with only the
status attribute set. The page attribute is missing here, so when the
INSERT completes, there is still no document with page being index.html.
That means whenever this UPSERT statement executes, it will branch into the
INSERT part, and the UPDATE part will never be reached. This is likely
unintentional.
The problem can easily be avoided by adding the search attributes to the
INSERT part:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
UPDATE { status: "updated" }
IN pages
Note that it is not necessary to repeat the search attributes in the UPDATE
part, because UPDATE is a partial update. It will only set the attributes that
are specified in the UPDATE part, and leave all other existing attributes
alone. However, it is necessary to repeat the search attributes in the REPLACE
part, because REPLACE will completely overwrite the existing document with
what is specified in the REPLACE part.
That means when using the REPLACE operation, the query should look like:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted" }
REPLACE { page: "index.html", status: "updated" }
IN pages
Use Indexes for Search Attributes
A downside of UPSERT’s flexibility is that it can be used on arbitrary
collection attributes, even if those are not indexed.
When the UPSERT looks for an existing document, it will use an index if an
index exists, but will also continue if no index exists. In the latter case,
the UPSERT will execute a full collection scan, which can be expensive for
large collections. So it is advised to create an index on the search
attribute(s) used in an UPSERT.
UPSERT is Non-Atomic
The overall UPSERT operation does not execute atomically for a single document.
It is basically a document lookup followed by either a document insert, update
or replace operation.
That means if multiple UPSERT operations run concurrently with the same search
values, they may all determine that the target document does not exist - and
then all decide to create such document. That will mean one will end up with
multiple instances of the target document afterwards.
To avoid such concurrency issues, a unique index can be created on the search
attribute(s). Such index will prevent concurrent UPSERT operations from
creating identical documents. Instead, only one of the concurrent UPSERTs will
succeed, whereas the others will fail with a “unique constraint violated” error.
In that case the client application can either retry the operation (which then
should go into the UPDATE/REPLACE branch), or ignore the error if the goal
was only to ensure the target document exists.
Using a unique index on the search attribute(s) will thus improve lookup performance and avoid duplicates.
Using Shard Key(s) for Lookups
In a cluster setup, the search expression should contain the shard key(s), as this allows the lookup to be sent to a single shard only. This will be more efficient than having to execute the lookup on all the shards of the collection.
Another benefit of using the shard key(s) in the search expression is that unique indexes are only supported if they contain the shard key(s).
INSERT AQL Operation with overwriteMode
While the UPSERT AQL operation is very powerful and flexible, it is often not
the ideal choice for high-volume ingestion.
A much more efficient alternative to the UPSERT AQL operation is the
INSERT AQL operation with the overwriteMode
attribute set. This operation is not a drop-in replacement for UPSERT, but
rather a fast alternative in case the document key (_key attribute) is known
when the operation is executed, and none of the old values need to be referenced.
The general syntax of the INSERT AQL operation is:
INSERT <insert-expression>
IN <collection> OPTIONS <options>
As we will deal with the overwriteMode option here, we are focussing on
INSERT operations with this option set, for example:
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "ignore" }
Regardless of the selected overwriteMode, the INSERT operation will insert
the document if no document exists in the collection with the specified _key.
In this aspect it behaves as a regular INSERT operation.
However, if a document with the specified _key already exists in the
collection, the INSERT behavior will be as follows, depending on the selected
overwriteMode:
conflict(default): if a document with the specified_keyexists, return a “unique constraint violation”ignore: if a document with the specified_keyexists, do nothing. Especially do not report a “unique constraint violation” error.update: if a document with the specified_keyexists, (partially) update the document with the attributes specified.replace: if a document with the specified_keyexists, fully replace the document with the attributes specified.
If no overwriteMode is specified, the behavior of an INSERT operation is as
if the overwriteMode was set to conflict.
The benefit of using INSERT with overwriteMode set to ignore, update or
replace is that the INSERT operation is going to be very fast, especially in
comparison with the UPSERT operation. In addition, INSERT will do a lookup
using the _key attribute, which is always indexed. So it will always use the
primary index and never do full collection scans. It also does not require
setting up additional indexes, because the primary index is automatically
present for all collections.
There are also a few caveats when working with INSERT AQL operations:
-
They can only be used when the value of the
_keyattribute is known at the time of insert. That means the client application must be able to provide the document keys in a deterministic way. -
The values that can be used for the
_keyattribute have some character and length restrictions, but alphanumeric keys work well. -
In a cluster setup, the underlying collection must be sharded by
_key. This is the default shard key, however. -
There is no access to the data of an existing document for arbitrary calculations when going into the
updateorreplacemode.
Please note that even though the INSERT AQL operation cannot refer to existing
documents to calculate values for updating/replacing, it can still return the
previous version of the document in case the document is already present.
This can be achieved by appending a RETURN OLD to the INSERT operation,
e.g.
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "replace" }
RETURN OLD
It is also possible to return the new version of the document (the inserted
document if no previous document existed, or the updated/replaced version in
case a document already existed) by using RETURN NEW:
INSERT { _key: "index.html", status: "created" }
IN pages OPTIONS { overwriteMode: "replace" }
RETURN NEW
Insert Operation not Using AQL
There is the option to execute an insert operation with overwriteMode outside
of AQL. The POST /_api/document/{collection}
endpoint is a dedicated REST API for insert operations, which can handle one
document, or multiple documents at once.
Conceptually this API behaves like the INSERT AQL operation, but it can be
called with a batch of documents at once. This is the most efficient solution,
and should be preferred if possible.
Most ArangoDB drivers also provide a means to insert multiple documents at once, which will internally call this same REST API.
The REST API provides the returnOld and returnNew options to make it return
the previous versions of documents or the insert/updated/replaced documents, in
the same way as the INSERT AQL operation can do.
Summary
The UPSERT AQL operation is the most flexible way to conditionally insert or
update/replace documents in ArangoDB, but it is also the least efficient variant.
The INSERT AQL operation with the overwriteMode set will outperform
UPSERT, but it can only be used for some use cases.
Using the dedicated REST API for document inserts will be even more efficient, and is thus the preferred option for bulk document inserts.