'How can I retrieve term vectors in Elasticsearch programmatically?

I haven't found any example of how to set up an ES index with term vectors and to retrieve them later programmatically in Java by document ID.

The JSON variant is described here is working: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/docs-termvectors.html

Can anyone give a Java "translation" for this?

Currently, I create the index like so:

CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate(indexName);
createIndexRequestBuilder.execute().actionGet(); 

And add a document like this:

XContentBuilder sourceBuilder;
sourceBuilder = XContentFactory.jsonBuilder().startObject()
                .field("text", text)
                .field("type", "testType");
IndexRequest request = new IndexRequest(indexName, esContentType).source(sourceBuilder);
client.index(request);

This is how I can fetch a document again:

GetResponse response = client.prepareGet(indexName, esContentType, id).execute().actionGet();


Solution 1:[1]

Ok, I finally figured out what I was looking for (this link was also quite helpful). As it may be helpful for others I would like to share it here:

Create your index like so:

CreateIndexRequestBuilder createIndexRequestBuilder = client.admin().indices().prepareCreate("indexName");
createIndexRequestBuilder.execute().actionGet(); 

try {
    client.admin().indices().preparePutMapping("indexName").setType("docType")
        .setSource(XContentFactory.jsonBuilder().prettyPrint()
        .startObject()
            .startObject("docType")
            .startObject("properties")
                .startObject("text").field("type", "string").field("index", "not_analyzed").field("term_vector", "yes").endObject()
            .endObject()
            .endObject()
        .endObject())
    .execute().actionGet();
} catch (IOException e) ...

And here is how you can get back the term vectors from ES:

TermVectorsResponse resp = client.prepareTermVectors().setIndex("indexName")
                          .setType("docType").setId("docId").execute().actionGet();

XContentBuilder builder;
try {
    builder = XContentFactory.jsonBuilder().startObject();
    resp.toXContent(builder, ToXContent.EMPTY_PARAMS);
    builder.endObject();
    System.out.println(builder.string());
} catch (IOException e) ...

This works for me so far, but if anyone has another or a better solution, please feel free to share.

Solution 2:[2]

To get the terms we parse the TermsVectorResponse as follows:

import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.action.termvectors.TermVectorsResponse;

...

public List<String> getTerms(TermVectorsResponse resp){

    List<String> termStrings = new ArrayList<>();
    Fields fields = resp.getFields();
    Iterator<String> iterator = fields.iterator();
    while (iterator.hasNext()) {
        String field = iterator.next();
        Terms terms = fields.terms(field);
        TermsEnum termsEnum = terms.iterator();
        while(termsEnum.next() != null){
            BytesRef term = termsEnum.term();
            if (term != null) {
                termStrings.add(term.utf8ToString());
            }
        }
    }
    return termStrings;
}

The TermsEnum object provides further methods to get some aggregated values for the current term. In case you need values for distinct documents (like frequency of term per document) you probably use termsEnum.postings(...) to retrieve them.

We use Elastic 2.3 with Lucene 5.5.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2