Current condition (opensource)
Built by Rust
- qdrant
- cozo
-
Built by Python
-
What is indexing in vector database?
An index is a method for optimizing database retrieval. It is a data structure that stores the values of one or more rows from a database table to allow fast access to specific information — similar to a table of contents in a book. Common index types include B-tree, Hash, Spatial, and so on.
Building an index in a vector database is a complex process. It involves data preprocessing, dimensionality reduction, choosing an appropriate index structure and algorithm, and ongoing maintenance and optimization. The right indexing strategy depends on the specific application scenario, data characteristics, and performance requirements.
Algorithms
Distance between vectors
- Euclidean Distance — the straight-line distance between two points
- Cosine Similarity — measures the similarity in “direction” between two vectors
- Dot Product — the sum of the products of corresponding elements of two vectors. The result depends on the magnitudes of the vectors and the angle between them
- Manhattan Distance — the sum of absolute coordinate-wise distances between two points
- Jaccard Similarity — typically used to compare similarity and diversity between sample sets
- Hamming Distance — for two vectors, counts the number of positions where the values differ (?)
- Pearson Correlation Coefficient — measures the degree of linear correlation between two vectors
Search
- k-Nearest Neighbors
- KNN is the simplest and most intuitive algorithm, and can be combined with others
- Locality-Sensitive Hashing
- LSH is a hashing technique used for approximate nearest neighbor search on large-scale datasets
- It maps vectors into a hash table so that similar vectors have a higher probability of landing in the same “bucket”.
- Tree-based search algorithms
- Such as KD-trees (k-dimension tree) and Ball Trees, which organize data into a tree structure to speed up search.
- Performance may degrade in high-dimensional spaces.
- Approximate Nearest Neighbor (ANN) search
- Such as Annoy (Approximate Nearest Neighbors Oh Yeah) and FAISS (Facebook AI Similarity Search) — efficient algorithms and libraries designed specifically for approximate nearest neighbor search.
- They sacrifice a bit of precision in exchange for high search speed.
- Vector Quantization
- A method that partitions the vector space into a finite number of “clusters” and encodes those clusters.
- It can be used for compression and fast search, especially when combined with algorithms like FAISS.
- Graph-based search algorithms
- Such as HNSW (Hierarchical Navigable Small World) — graph-based algorithms that organize data efficiently to enable fast search.
Ref
Repo
- https://www.benchcouncil.org/evaluation/opencs/cn/annual.html
https://stablecog.com/blog/the-best-vector-database-for-stablecogs-semantic-search
Algorithms
ChangeLog
- 20231123-init
- 20260501–translate by claude code