miller
发布于

我在矢量数据库公司工作两年学到的 37 件事

英文原文

37 Things I Learned About Information Retrieval in Two Years at a Vector Database Company
我在矢量数据库公司工作两年学到的 37 件事
Reflections on what I’ve learned about information retrieval in the last two years working at Weaviate

Published
July 3, 2025

Today I’m celebrating my two-year work anniversary at Weaviate, a vector database company. To celebrate, I want to reflect on what I’ve learned about vector databases and search during this time. Here are some of the things I’ve learned and some common misconceptions I see:
今天,我庆祝在矢量数据库公司 Weaviate 工作两周年。为了庆祝,我想回顾一下这段时间以来,我对矢量数据库和搜索的学习和体会。以下是我学到的一些知识,以及一些常见的误解:

BM25 is a strong baseline for search. Ha! You thought I would start with something about vector search, and here I am talking about keyword search. And that is exactly the first lesson: Start with something simple like BM25 before you move on to more complex things like vector search.
BM25 是搜索的强大基线。哈哈!你以为我会从向量搜索开始,而我这里要讲的是关键词搜索。这正是第一课:先从像 BM25 这样简单的开始,然后再学习像向量搜索这样更复杂的内容。

Vector search in vector databases is approximate and not exact. In theory, you could run a brute-force search to compute distances between a query vector and every vector in the database using exact k-nearest neighbors (KNN). But this doesn’t scale well. That’s why vector databases use Approximate Nearest Neighbor (ANN) algorithms, like HNSW, IVF, or ScaNN, to speed up search while trading off a small amount of accuracy. Vector indexing is what makes vector databases so fast at scale.
向量数据库中的向量搜索是近似的,而非精确的。理论上,你可以使用精确的 K 最近邻 (KNN) 算法,通过暴力搜索来计算查询向量与数据库中每个向量之间的距离。但这扩展性不佳。因此,向量数据库会使用近似最近邻 (ANN) 算法(例如 HNSW、IVF 或 ScaNN)来加快搜索速度,但会牺牲少量准确度。向量索引正是向量数据库在大规模情况下如此快速的关键所在。

Vector databases don’t only store embeddings. They also store the original object (e.g., the text from which you generated the vector embeddings) and metadata. This allows them to support other features beyond vector search, like metadata filtering and keyword and hybrid search.
向量数据库不仅存储嵌入,还存储原始对象(例如,生成向量嵌入的文本)和元数据。这使得它们能够支持向量搜索之外的其他功能,例如元数据过滤、关键字搜索和混合搜索。

Vector databases’ main application is not in generative AI. It’s in search. But finding relevant context for LLMs is ‘search’. That’s why vector databases and LLMs go together like cookies and cream.
矢量数据库的主要应用并非生成式人工智能,而是搜索。但为法学硕士 (LLM) 寻找相关背景也是一种“搜索”。这就是为什么矢量数据库和法学硕士 (LLM) 就像饼干和奶油一样密不可分。

You have to specify how many results you want to retrieve. When I think back, I almost have to laugh because this was such a big “aha” moment when I realized that you need to define the maximum number of results you want to retrieve. It’s a little oversimplified, but vector search would return all the objects, stored in the database sorted by the distance to your query vector, if there weren’t a limit or top_k parameter.
你必须指定要检索的结果数量。现在回想起来,我差点笑了出来,因为当我意识到需要定义要检索的最大结果数量时,我真是恍然大悟。虽然这么说有点过于简单,但如果没有 limit 或 top_k 参数,向量搜索会返回数据库中存储的所有对象,这些对象按与查询向量的距离排序。

There are many different types of embeddings. When you think of a vector embedding, you probably visualize something like [-0.9837, 0.1044, 0.0090, …, -0.2049]. That’s called a dense vector, and it is the most commonly used type of vector embedding. But there’s also many other types of vectors, such as sparse ([0, 2, 0, …, 1]), binary ([0, 1, 1, …, 0]), and multi-vector embeddings ([[-0.9837, …, -0.2049], [ 0.1044, …, 0.0090], …, [-0.0937, …, 0.5044]]), which can be used for different purposes.
嵌入有很多不同的类型。当你想到向量嵌入时,你可能会想到 [-0.9837, 0.1044, 0.0090, …, -0.2049] 之类的东西。这被称为密集向量,也是最常用的向量嵌入类型。但还有许多其他类型的向量,例如稀疏向量 ([0, 2, 0, …, 1])、二元向量 ([0, 1, 1, …, 0]) 和多向量嵌入 ([[-0.9837, …, -0.2049], [ 0.1044, …, 0.0090], …, [-0.0937, …, 0.5044]]),它们可用于不同的目的。

Fantastic embedding models and where to find them. The first place to go is the Massive Text Embedding Benchmark (MTEB). It covers a wide range of different tasks for embedding models, including classification, clustering, and retrieval. If you’re focused on information retrieval, you might want to check out BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.
优秀的嵌入模型以及它们的获取途径。首先要去的地方是海量文本嵌入基准 (MTEB)。它涵盖了嵌入模型的各种不同任务,包括分类、聚类和检索。如果您专注于信息检索,不妨看看 BEIR:用于零样本评估信息检索模型的异构基准。

The majority of embedding models on MTEB are English. If you’re working with multilingual or non-English languages, it might be worth checking out MMTEB (Massive Multilingual Text Embedding Benchmark).
MTEB 上的大多数嵌入模型都是英语的。如果您正在使用多语言或非英语语言,那么 MMTEB(海量多语言文本嵌入基准)可能值得一试。

A little history on vector embeddings: Before there were today’s contextual embeddings (e.g., BERT), there were static embeddings (e.g., Word2Vec, GloVe). They are static because each word has a fixed representation, while contextual embeddings generate different representations for the same word based on the surrounding context. Although today’s contextual embeddings are much more expressive, static embeddings can be helpful in computationally restrained environments because they can be looked up from pre-computed tables.
关于向量嵌入的一些历史:在如今的上下文嵌入(例如 BERT)出现之前,存在静态嵌入(例如 Word2Vec、GloVe)。它们之所以是静态的,是因为每个单词都有固定的表示,而上下文嵌入会根据周围的上下文为同一个单词生成不同的表示。虽然如今的上下文嵌入表达能力更强,但静态嵌入在计算受限的环境中仍然很有用,因为它们可以从预先计算的表中查找。

Don’t confuse sparse vectors and sparse embeddings. It took me a while until I understood that sparse vectors can be generated in different ways: Either by applying statistical scoring functions like TF-IDF or BM25 to term frequencies (often retrieved via inverted indexes), or with neural sparse embedding models like SPLADE. That means a sparse embedding is a sparse vector, but not all sparse vectors are necessarily sparse embeddings.
不要混淆稀疏向量和稀疏嵌入。我花了一段时间才明白,稀疏向量可以通过不同的方式生成:要么将 TF-IDF 或 BM25 等统计评分函数应用于词频(通常通过倒排索引检索),要么使用像 SPLADE 这样的神经稀疏嵌入模型。这意味着稀疏嵌入就是稀疏向量,但并非所有稀疏向量都必然是稀疏嵌入。

Embed all the things. Embeddings aren’t just for text. You can embed images, PDFs as images (see ColPali), graphs, etc. And that means you can do vector search over multimodal data. It’s pretty incredible. You should try it sometime.
嵌入所有内容。嵌入不仅适用于文本。您可以嵌入图像、将 PDF 嵌入为图像(参见 ColPali)、图表等。这意味着您可以对多模态数据进行向量搜索。这非常不可思议。您应该找个时间尝试一下。

The economics of vector embeddings. This shouldn’t be a surprise, but the vector dimensions will impact the required storage cost. So, consider whether it is worth it before you choose an embedding model with 1536 dimensions over one with 768 dimensions and risk doubling your storage requirements. Yes, more dimensions capture more semantic nuances. But you probably don’t need 1536 dimensions to “chat with your docs”. Some models actually use Matryoshka Representation Learning to allow you to shorten vector embeddings for environments with less computational resources, with minimal performance losses.
向量嵌入的经济性。这并不令人意外,但向量维度会影响所需的存储成本。因此,在选择 1536 维的嵌入模型而非 768 维的嵌入模型并冒着存储需求翻倍的风险之前,请考虑这样做是否值得。没错,更高的维度可以捕捉到更多的语义细微差别。但你可能不需要 1536 维来“与你的文档聊天”。有些模型实际上使用了 Matryoshka 表示学习,允许你在计算资源较少的环境中缩短向量嵌入,同时将性能损失降至最低。

Speaking of: “Chat with your docs” tutorials are the “Hello world” programs of Generative AI.
说到:“与您的文档聊天”教程是生成式人工智能的“Hello world”程序。

You need to call the embedding model A LOT. Just because you embedded your documents during the ingestion stage, doesn’t mean you’re done calling the embedding model. Every time you run a search query, the query must also be embedded (if you’re not using a cache). If you’re adding objects later on, those must also be embedded (and indexed). If you’re changing the embedding model, you must also re-embed (and re-index) everything.
您需要频繁调用嵌入模型。仅仅因为您在数据提取阶段嵌入了文档,并不意味着您已经完成了嵌入模型的调用。每次运行搜索查询时,也必须嵌入该查询(如果您不使用缓存)。如果您稍后要添加对象,也必须嵌入(并索引)这些对象。如果您要更改嵌入模型,则还必须重新嵌入(并重新索引)所有内容。

Similar does not necessarily mean relevant. Vector search returns objects by their similarity to a query vector. The similarity is measured by their proximity in vector space. Just because two sentences are similar in vector space (e.g., “How to fix a faucet” and “Where to buy a kitchen faucet”) does not mean they are relevant to each other.
相似并不一定意味着相关。向量搜索根据对象与查询向量的相似度返回结果。相似度是通过它们在向量空间中的接近度来衡量的。仅仅因为两个句子在向量空间中相似(例如,“如何修理水龙头”和“在哪里可以买到厨房水龙头”)并不意味着它们彼此相关。

Cosine similarity and cosine distance are not the same thing. But they are related to each other (
). If you will, distance and similarity are inverses: If two vectors are exactly the same, the similarity is 1 and the distance between them is 0.
余弦相似度和余弦距离不是一回事。但它们彼此相关 (
)。如果你愿意,距离和相似度是互逆的:如果两个向量完全相同,则相似度为 1,距离为 0。

If you’re working with normalized vectors, it doesn’t matter whether you’re using cosine similarity or dot product for the similarity measure. Because mathematically, they are the same. For the calculation, dot product is more efficient.
如果你使用的是归一化向量,那么使用余弦相似度还是点积来计算相似度并不重要。因为从数学上讲,它们是相同的。但从计算角度来看,点积更高效。

Common misconception: The R in RAG stands for ‘vector search’. It doesn’t. It stands for ‘retrieval’. And retrieval can be done in many different ways (see following bullets).
常见误解:RAG 中的 R 代表“向量搜索”。其实不然。它代表的是“检索”。检索可以通过多种方式进行(请参阅以下要点)。

Vector search is just one tool in the retrieval toolbox. There’s also keyword-based search, filtering, and reranking. It’s not one over the other. To build something great, you will need to combine it with different tools.
向量搜索只是检索工具箱中的一种工具。此外,还有基于关键词的搜索、过滤和重新排序等功能。它们之间并非互相独立。要构建出色的功能,您需要将其与不同的工具结合使用。

When to use keyword-based search vs. vector-based search: Does your use case require mainly matching semantics and synonyms (e.g., “pastel colors” vs. “light pink”) or exact keywords (e.g., “A-line skirt”, “peplum dress”)? If it requires both (e.g., “pastel colored A-line skirt”), you might benefit from combining both and using hybrid search. In some implementations (e.g., Weaviate), you can just use the hybrid search function and then use the alpha parameter to change the weighting from pure keyword-based search, a mix of both, to pure vector search.
何时使用基于关键词的搜索,何时使用基于向量的搜索:您的用例主要需要匹配语义和同义词(例如,“柔和色彩”与“浅粉色”),还是精确的关键词(例如,“A 字裙”、“荷叶边连衣裙”)?如果两者都需要(例如,“柔和色彩的 A 字裙”),那么结合使用两者并使用混合搜索可能会更有利。在某些实现中(例如 Weaviate),您可以直接使用混合搜索功能,然后使用 alpha 参数将权重从纯基于关键词的搜索(两者混合)更改为纯向量搜索。

Hybrid search can be a hybrid of different search techniques. Most often, when you hear people talk about hybrid search, they mean the combination of keyword-based search and vector-based search. But the term ‘hybrid’ doesn’t specify which techniques to combine. So, sometimes you might hear people talk about hybrid search, meaning the combination of vector-based search and search over structured data (often referred to as metadata filtering).
混合搜索可以是多种搜索技术的混合。通常,当你听到人们谈论混合搜索时,他们指的是基于关键词的搜索和基于向量的搜索的组合。但“混合”一词并没有具体说明要结合哪些技术。因此,有时你可能会听到人们谈论混合搜索,它指的是基于向量的搜索和基于结构化数据(通常称为元数据过滤)的搜索的组合。

Misconception: Filtering makes vector search faster. Intuitively, you’d think using a filter should speed up search latency because you’re reducing the number of candidates to search through. But in practice, pre-filtering candidates can, for example, break the graph connectivity in HNSW, and post-filtering can leave you with no results at all. Vector databases have different, sophisticated techniques to handle this challenge.
误解:过滤可以加快向量搜索速度。直觉上,你会认为使用过滤器应该可以降低搜索延迟,因为这样可以减少需要搜索的候选对象数量。但在实践中,例如,预先过滤候选对象可能会破坏 HNSW 中的图连通性,而后过滤则可能导致完全没有结果。向量数据库拥有不同的、复杂的技术来应对这一挑战。

Two-stage retrieval pipelines aren’t only for recommendation systems. Recommendation systems often have a first retrieval stage that uses a simpler retrieval process (e.g., vector search) to reduce the number of potential candidates, which is followed by a second retrieval stage with a more compute-intensive but more accurate reranking stage. You can apply this to your RAG pipeline as well.
两阶段检索流程不仅适用于推荐系统。推荐系统通常包含一个第一检索阶段,该阶段使用更简单的检索过程(例如向量搜索)来减少潜在候选集的数量;然后是第二检索阶段,该阶段的计算量更大,但准确性更高。您也可以将此应用于 RAG 流程。

How vector search differs from reranking. Vector search returns a small portion of results from the entire database. Reranking takes in a list of items and returns the re-ordered list.
向量搜索与重排序的区别在于:向量搜索会返回整个数据库中一小部分结果。重排序则接收一个项目列表,并返回重新排序后的列表。

Finding the right chunk size to embed is not trivial. Too small, and you’ll lose important context. Too big, and you’ll lose semantic meaning. Many embedding models use mean pooling to average all token embeddings into a single vector representation of a chunk. So, if you have an embedding model with a large context window, you can technically embed an entire document. I forgot who said this, but I like this analogy: You can think of it like creating a movie poster for a movie by overlaying every single frame in the movie. All the information is there, but you won’t understand what the movie is about.
找到合适的词块大小来嵌入并非易事。太小,你会丢失重要的上下文信息。太大,你会丢失语义信息。许多嵌入模型使用均值池化将所有标记嵌入平均化为一个词块的向量表示。因此,如果你有一个具有较大上下文窗口的嵌入模型,那么从技术上讲,你可以嵌入整个文档。我忘了是谁说的,但我喜欢这个比喻:你可以把它想象成通过叠加电影中的每一帧来制作电影海报。所有信息都在那里,但你不会理解这部电影的内容。

Vector indexing libraries are different from vector databases. Both are incredibly fast for vector search. Both work really well to showcase vector search in “chat with your docs”-style RAG tutorials. However, only one of them adds data management features, like built-in persistence, CRUD support, metadata filtering, and hybrid search.
向量索引库与向量数据库不同。两者的向量搜索速度都非常快。在“与文档聊天”风格的 RAG 教程中,它们都能很好地展示向量搜索。然而,只有其中一个添加了数据管理功能,例如内置持久性、CRUD 支持、元数据过滤和混合搜索。

RAG has been dying since the release of the first long-context LLM. Every time an LLM with a longer context window is released, someone will claim that RAG is dead. It never is…
自从第一个长上下文 LLM 发布以来,RAG 就一直在走向衰亡。每次发布一个具有更长上下文窗口的 LLM,就会有人声称 RAG 已死。但事实并非如此……

You can throw out 97% of the information and still retrieve (somewhat) accurately. It’s called vector quantization. For example, with binary quantization you can change something like [-0.9837, 0.1044, 0.0090, …, -0.2049] into [0, 1, 1, …, 0] (a 32x storage reduction from 32-bit float to 1-bit) and you’ll be surprised how well retrieval will remain to work (in some use cases).
你可以丢弃 97% 的信息,但仍然可以(在一定程度上)准确地检索。这被称为矢量量化。例如,使用二进制量化,你可以将 [-0.9837, 0.1044, 0.0090, …, -0.2049] 之类的值更改为 [0, 1, 1, …, 0](存储空间从 32 位浮点数减少到 1 位浮点数,减少了 32 倍),你会惊讶地发现,在某些情况下,检索效果仍然非常出色。

Vector search is not robust to typos. For a while, I thought that vector search was robust to typos because these large corpora of text surely must contain a lot of typos and therefore help the embedding model learn these typos as well. But if you think about it, there’s no way that all the possible typos of a word are reflected in sufficient amounts in the training data. So, while vector search can handle some typos, you can’t really say it is robust to them.
向量搜索对拼写错误并不稳健。有一段时间,我认为向量搜索对拼写错误具有稳健性,因为这些大型文本语料库中肯定包含大量拼写错误,因此可以帮助嵌入模型学习这些拼写错误。但仔细想想,一个单词的所有可能的拼写错误不可能在训练数据中得到充分体现。所以,虽然向量搜索可以处理一些拼写错误,但你不能说它对拼写错误具有稳健性。

Knowing when to use which metric to evaluate search results. There are many different metrics to evaluate search results. Looking at academic benchmarks, like BEIR, you’ll notice that NDCG@k is prominent. But simpler metrics like precision and recall are a great fit for many use cases.
了解何时使用哪种指标来评估搜索结果。评估搜索结果的指标有很多种。看看 BEIR 等学术基准,你会注意到 NDCG@k 的表现尤为突出。但像精确度和召回率这样的简单指标也适用于许多用例。

The precision-recall trade-off is often depicted with a fisherman’s analogy of casting a net, but this e-commerce analogy made it click better for me: Imagine you have a webshop with 100 books, out of which 10 are ML-related.
精确度与召回率之间的权衡通常用渔夫撒网的比喻来描述,但这个电子商务的比喻让我更好地理解了它:假设你有一个网上商店,里面有 100 本书,其中 10 本与机器学习相关。

Now, if a user searches for ML-related books, you could just return one ML book. Amazing! You have perfect precision (out of the k=1 results returned, how many were relevant). But that’s bad recall (out of the relevant results that exist, how many did I return? In this case, 1 out of 10 relevant books). And also, that’s not so good for your business. Maybe the user didn’t like that one ML-related book you returned.
现在,如果用户搜索与机器学习相关的书籍,你甚至可以只返回一本。太棒了!你的精确度非常高(在返回的 k=1 个结果中,有多少是相关的)。但这会导致召回率很差(在现有的相关结果中,我返回了多少?在本例中,10 本书籍中只有 1 本相关)。而且,这对你的业务来说也不是什么好事。也许用户并不喜欢你返回的那本与机器学习相关的书籍。

On the other side of that extreme is if you return your entire selection of books. All 100 of them. Unsorted… That’s perfect recall because you returned all relevant results. It’s just that you also returned a bunch of irrelevant results, which can be measured by how bad the precision is.
另一个极端是,如果你返回了所有书籍,全部 100 本,且未排序……那么召回率是完美的,因为你返回了所有相关的结果。只是你也返回了一堆不相关的结果,这可以通过精确度来衡量。

There are metrics that include the order. When I think of search results, I visualize something like a Google search. So, naturally, I thought that the rank of the search results is important. But metrics like precision and recall don’t consider the order of search results. If the order of your search results is important for your use case, you need to choose rank-aware metrics like MRR@k, MAP@k, or NDCG@k.
有些指标会考虑顺序。当我想到搜索结果时,我会想象类似谷歌搜索的场景。因此,我自然而然地认为搜索结果的排名很重要。但像精确度和召回率这样的指标并没有考虑搜索结果的顺序。如果搜索结果的顺序对您的用例很重要,那么您需要选择排名感知指标,例如 MRR@k、MAP@k 或 NDCG@k。

Tokenizers matter. If you’ve been in the Transformer’s bubble too long, you’ve probably forgotten that other tokenizers exist next to Byte-Pair-Encoding (BPE). Tokenizers are also important for keyword search and its search performance. And if the tokenizer impacts the keyword-based search performance, it also impacts the hybrid search performance.
分词器至关重要。如果你沉迷于 Transformer 太久,可能已经忘记了除了字节对编码 (BPE) 之外还有其他分词器。分词器对于关键词搜索及其搜索性能也至关重要。如果分词器影响了基于关键词的搜索性能,那么它也会影响混合搜索的性能。

Out-of-domain is not the same as out-of-vocabulary. Earlier embedding models used to fail on out-of-vocabulary terms. If your embedding model had never seen or heard of “Labubu”, it would have just run into an error. With smart tokenization, unseen out-of-vocabulary terms can be handled graciously, but the issue is that they are still out-of-domain terms, and therefore, their vector embeddings look like a proper embedding, but they are meaningless.
领域外与词汇外不同。早期的嵌入模型常常会因为词汇外的术语而失败。如果你的嵌入模型从未见过或听说过“Labubu”,它就会出错。通过智能标记化,可以妥善处理未见过的词汇外术语,但问题在于它们仍然是领域外的术语,因此,它们的向量嵌入看起来像是正常的嵌入,但却毫无意义。

Query optimizations: You know how you’ve learned to type “longest river africa” into Google’s search bar, instead of “What is the name of the longest river in Africa?”. You’ve learned to optimize your search query for keyword search (yes, we know the Google search algorithm is more sophisticated. Can we just go with it for a second?). Similarly, we now need to learn how to optimize our search queries for vector search now.
查询优化:您已经学会了在 Google 搜索栏中输入“非洲最长的河流”,而不是“非洲最长的河流叫什么名字?”。您已经学会了如何针对关键词搜索优化您的搜索查询(是的,我们知道 Google 的搜索算法更加复杂。我们可以先简单介绍一下吗?)。同样,我们现在需要学习如何针对矢量搜索优化我们的搜索查询。

What comes after vector search? First, there was keyword-based search. Then, Machine Learning models enabled vector search. Now, LLMs with reasoning enable reasoning-based retrieval.
向量搜索之后会发生什么?首先,是基于关键词的搜索。然后,机器学习模型实现了向量搜索。现在,具有推理能力的法学硕士 (LLM) 实现了基于推理的检索。

Information retrieval is so hot right now. I feel fortunate to get to work in this exciting space. Although working on and with LLMs seems to be the cool thing now, figuring out how to provide the best information for them is equally exciting. And that’s the field of retrieval.
信息检索现在非常热门。我很荣幸能在这个令人兴奋的领域工作。虽然现在攻读法学硕士(LLM)学位似乎很流行,但研究如何为他们提供最好的信息也同样令人兴奋。这就是检索领域。

I’m repeating my last point, but looking back at the past two years, I feel grateful to work in this field. I have only scratched the surface so far, and there’s still so much to learn. When I joined Weaviate, vector databases were the hot new thing. Then came RAG. Now, we’re talking about “context engineering”. But what hasn’t changed is the importance of finding the best information to give the LLM so it can provide the best possible answer.
我再次强调最后一点,但回顾过去两年,我很感激能够在这个领域工作。到目前为止,我只是触及皮毛,还有很多东西需要学习。我加入 Weaviate 的时候,矢量数据库还是个热门的新兴事物。后来是 RAG。现在,我们谈论的是“上下文工程”。但不变的是,找到最合适的信息,提供给法学硕士(LLM),以便它能够提供最佳答案的重要性。

浏览 (10)
点赞
收藏
1条评论
miller
miller
https://huggingface.co/spaces/mteb/leaderboard
点赞
评论