Current condition (opensource)

Built by Rust

qdrant
cozo
pgvecto.rs

Built by Python
Weaviate
Milvus
Lancedb
Marqo

What is indexing in vector database?

An index is a method for optimizing database retrieval. It is a data structure that stores the values of one or more rows from a database table to allow fast access to specific information — similar to a table of contents in a book. Common index types include B-tree, Hash, Spatial, and so on.

Building an index in a vector database is a complex process. It involves data preprocessing, dimensionality reduction, choosing an appropriate index structure and algorithm, and ongoing maintenance and optimization. The right indexing strategy depends on the specific application scenario, data characteristics, and performance requirements.

Algorithms

Distance between vectors

Euclidean Distance — the straight-line distance between two points
Cosine Similarity — measures the similarity in “direction” between two vectors
Dot Product — the sum of the products of corresponding elements of two vectors. The result depends on the magnitudes of the vectors and the angle between them
Manhattan Distance — the sum of absolute coordinate-wise distances between two points
Jaccard Similarity — typically used to compare similarity and diversity between sample sets
Hamming Distance — for two vectors, counts the number of positions where the values differ (?)
Pearson Correlation Coefficient — measures the degree of linear correlation between two vectors

Search

k-Nearest Neighbors
- KNN is the simplest and most intuitive algorithm, and can be combined with others
Locality-Sensitive Hashing
- LSH is a hashing technique used for approximate nearest neighbor search on large-scale datasets
- It maps vectors into a hash table so that similar vectors have a higher probability of landing in the same “bucket”.
Tree-based search algorithms
- Such as KD-trees (k-dimension tree) and Ball Trees, which organize data into a tree structure to speed up search.
- Performance may degrade in high-dimensional spaces.
Approximate Nearest Neighbor (ANN) search
- Such as Annoy (Approximate Nearest Neighbors Oh Yeah) and FAISS (Facebook AI Similarity Search) — efficient algorithms and libraries designed specifically for approximate nearest neighbor search.
- They sacrifice a bit of precision in exchange for high search speed.
Vector Quantization
- A method that partitions the vector space into a finite number of “clusters” and encodes those clusters.
- It can be used for compression and fast search, especially when combined with algorithms like FAISS.
Graph-based search algorithms
- Such as HNSW (Hierarchical Navigable Small World) — graph-based algorithms that organize data efficiently to enable fast search.

Ref

Repo

ChangeLog

20231123-init
20260501–translate by claude code

[AIGC Note] Awesome Vector Database

Roughly collect info about Vector DB

Current condition (opensource)

Built by Rust

Built by Python

What is indexing in vector database?

Algorithms

Distance between vectors

Search

Ref

Repo

Algorithms

ChangeLog

CATALOG

FEATURED TAGS