'Vespa visitor indexing documents

I want to attribute an ID to every document in a vespa cluster.

But I don't completely understand how visitors work in vespa.

Can I get a shared field (meaning shared by all instances of my visitor), which I can atomically increment (using some lock) every time I visit a document ?

What I tried obviously doesn't work, but you'll see the general idea :

public class MyVisitor extends DocumentProcessor {

    // where should i put this ? 
    private int document_id;

    private final Lock lock = new ReentrantLock();

    @Override
    public Progress process(Processing processing) {
        Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
        while (it.hasNext()) {

            DocumentOperation op = it.next();
            if (op instanceof DocumentPut) {

                Document doc = ((DocumentPut) op).getDocument();
                /*
                 * Remove the PUT operation from the iterator so that it is not indexed back in
                 * the document cluster
                 */
                it.remove();

                try {
                    try {
                        lock.lock();
                        document_id += 1;
                    } finally {
                        lock.unlock();
                    }
                } catch (StatusRuntimeException | IllegalArgumentException e) {
                }
            }
        }
        return Progress.DONE;
    }
}

Another idea it to get the number of buckets and the bucket id I'm currently dealing with and to increment using this pattern:

document_id = bucket_id
document_id += bucked_count

which would work (if I can ensure my visitor operates on a single bucket at a time) but I don't know how to get these information from my visitor.



Solution 1:[1]

Document processors operate on incoming document writes, so they cannot be applied to the result of visiting (not without a bit more setup anyway).

What you can do to visit the documents instead is to just get all the documents using HTTP/2: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit

Then use the same API to issue an update operation for each document to set the field using the same API: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#put

Since this is done by a single process, you can then have a document_id counter which assigns unique values.

As an aside, a common trick to avoid that requirement is to generate an UUID for each document.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jon