'Vespa visitor indexing documents
I want to attribute an ID to every document in a vespa cluster.
But I don't completely understand how visitors work in vespa.
Can I get a shared field (meaning shared by all instances of my visitor), which I can atomically increment (using some lock) every time I visit a document ?
What I tried obviously doesn't work, but you'll see the general idea :
public class MyVisitor extends DocumentProcessor {
// where should i put this ?
private int document_id;
private final Lock lock = new ReentrantLock();
@Override
public Progress process(Processing processing) {
Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
while (it.hasNext()) {
DocumentOperation op = it.next();
if (op instanceof DocumentPut) {
Document doc = ((DocumentPut) op).getDocument();
/*
* Remove the PUT operation from the iterator so that it is not indexed back in
* the document cluster
*/
it.remove();
try {
try {
lock.lock();
document_id += 1;
} finally {
lock.unlock();
}
} catch (StatusRuntimeException | IllegalArgumentException e) {
}
}
}
return Progress.DONE;
}
}
Another idea it to get the number of buckets and the bucket id I'm currently dealing with and to increment using this pattern:
document_id = bucket_id
document_id += bucked_count
which would work (if I can ensure my visitor operates on a single bucket at a time) but I don't know how to get these information from my visitor.
Solution 1:[1]
Document processors operate on incoming document writes, so they cannot be applied to the result of visiting (not without a bit more setup anyway).
What you can do to visit the documents instead is to just get all the documents using HTTP/2: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit
Then use the same API to issue an update operation for each document to set the field using the same API: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#put
Since this is done by a single process, you can then have a document_id counter which assigns unique values.
As an aside, a common trick to avoid that requirement is to generate an UUID for each document.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jon |
