'How to store Bag of Words or Embeddings in a Database

I would like to store vector features, like Bag-of-Words or Word-Embedding vectors of a large number of texts, in a dataset, stored in a SQL Database. What're the data structures and the best practices to save and retrieve these features?



Solution 1:[1]

Word vectors should generally be stored as BLOBs if possible. If not they can be stored as json arrays. Since the only reasonable operation for word vectors is to look them up by the word key the other details don't particularly matter.

For bag of words you would typically need three columns, this is what it would look like in sqlite.

create table bow (
  doc_id int,
  word text,
  count int)

Where your document IDs come from somewhere else. If you need to you can make (doc_id, word) the key.

However, storing features like this in a SQL DB is generally not helpful. When you access word counts or word vectors you typically don't need a subset of them, you need them all at once, so the relational features of SQL aren't helpful.

Solution 2:[2]

There are databases that are specialized for vector data in machine learning. these are the list.

  1. Milvus https://milvus.io/
  2. Weavviate https://weaviate.io/
  3. AquilaDB https://docs.aquila.network
  4. Pinecone https://www.pinecone.io/

Solution 3:[3]

This would depend on a number of factors, such as the precise SQL DB you intend to use and how you store this embedding. For instance, PostgreSQL allows to store query and retrieve JSON variables ( https://www.postgresqltutorial.com/postgresql-json/ ) ; Other options as SQLite would allow to store string representations of JSONs or pickle objects - that would be OK for storing, but would make querying the elements inside the vector impossible.

Solution 4:[4]

Milvus is an open-source vector database built to power embedding similarity search and AI applications

https://github.com/milvus-io/milvus

I am doing the test

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 polm23
Solution 2
Solution 3 LukasP
Solution 4 coolflower