5 minute read

Implementing and Migrating Vector Databases with PostgreSQL and LangChain: A Technical Guide to PGVector

Vector databases have become a crucial component in modern AI applications, particularly for semantic search and retrieval systems. PostgreSQL, with its pgvector extension, offers a powerful and accessible solution for vector storage and similarity search. In this guide, we’ll explore how to implement vector databases using LangChain’s PostgreSQL integrations, with a special focus on migrating from the older PGVector implementation to the newer langchain_postgres package.

Understanding the Migration Context

LangChain’s original PGVector class (from langchain_community.vectorstores.pgvector) is being deprecated in favor of the newer implementation in the langchain_postgres package. This migration brings several improvements and changes that developers should be aware of:

  1. The new implementation uses psycopg3 instead of psycopg2
  2. Filtering syntax has been updated to use $ prefixed operators for JSONB metadata fields
  3. Schema changes have been implemented to address issues in the existing implementation

Let’s explore how to work with both implementations, and then cover the migration process.

Setting Up PGVector with LangChain

Before diving into the migration, let’s understand how to implement a vector database using the original PGVector class:

from langchain_community.vectorstores.pgvector import PGVector
from langchain_openai import OpenAIEmbeddings

# Connection details
connection_string = "postgresql://username:password@localhost:5432/vectordb"

# Initialize embedding function
embedding_function = OpenAIEmbeddings()

# Create vector store
vector_store = PGVector(
    connection_string=connection_string,
    embedding_function=embedding_function,
    collection_name="my_documents",
    embedding_length=1536,  # Length depends on your embedding model
    distance_strategy="COSINE"  # Options include COSINE, EUCLIDEAN, or INNER_PRODUCT
)

Adding Documents to the Vector Store

Once your vector store is initialized, you can add documents to it:

from langchain_core.documents import Document

# Create some sample documents
documents = [
    Document(page_content="LangChain provides tools for working with language models", metadata={"source": "docs", "topic": "langchain"}),
    Document(page_content="Vector databases store embeddings for semantic search", metadata={"source": "article", "topic": "vector-db"}),
    Document(page_content="PostgreSQL is a powerful open-source database", metadata={"source": "docs", "topic": "postgres"})
]

# Add documents to the vector store
vector_store.add_documents(documents)

With documents added, you can now perform similarity searches:

# Basic similarity search
results = vector_store.similarity_search(
    query="How do language models work with databases?",
    k=2  # Return top 2 results
)

# Search with metadata filtering
filtered_results = vector_store.similarity_search(
    query="PostgreSQL capabilities",
    k=2,
    filter={"topic": "postgres"}
)

# Search with score
results_with_scores = vector_store.similarity_search_with_score(
    query="Vector database implementations",
    k=2
)

Migrating to langchain_postgres

Now, let’s look at how to migrate to the newer langchain_postgres implementation:

1. Install the New Package

pip install langchain_postgres

2. Update Import Statements

Replace:

from langchain_community.vectorstores.pgvector import PGVector

With:

from langchain_postgres import PGVector

3. Adjust Connection Configuration

The new implementation works with psycopg3, not psycopg2. This might require changes to your connection string or parameters:

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

# Initialize with the new implementation
new_vector_store = PGVector(
    connection_string="postgresql://username:password@localhost:5432/vectordb",
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_documents",
    embedding_length=1536,
    distance_strategy="cosine"
)

4. Update Filtering Syntax

The new implementation uses $ prefixed operators for JSONB metadata fields. Here’s how filtering changes:

# Old filtering approach
results = old_vector_store.similarity_search(
    query="PostgreSQL capabilities",
    filter={"topic": "postgres"}
)

# New filtering approach with $ prefix operators
results = new_vector_store.similarity_search(
    query="PostgreSQL capabilities",
    filter={"metadata": {"topic": {"$eq": "postgres"}}}
)

The new filtering syntax is more powerful, allowing for more complex queries:

# More advanced filtering with the new syntax
results = new_vector_store.similarity_search(
    query="Database technologies",
    filter={
        "metadata": {
            "topic": {"$in": ["postgres", "vector-db"]},
            "source": {"$eq": "docs"}
        }
    }
)

Schema Migration Considerations

The new implementation has made schema changes to address issues in the existing implementation. This means you’ll need to either:

  1. Re-create your tables and re-index your data, or
  2. Perform a manual migration

Option 1: Re-creating Tables and Re-indexing

This is the simplest approach:

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

# Initialize the new vector store with pre_delete_collection=True
new_vector_store = PGVector(
    connection_string="postgresql://username:password@localhost:5432/vectordb",
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_documents",
    pre_delete_collection=True  # This will delete the existing collection
)

# Re-add your documents
documents = [...]  # Your document collection
new_vector_store.add_documents(documents)

Option 2: Manual Migration

If you have a large dataset and want to avoid re-embedding everything, you might consider a manual migration. This is more complex and depends on your specific schema, but here’s a general approach:

import psycopg
from langchain_community.vectorstores.pgvector import PGVector as OldPGVector
from langchain_postgres import PGVector as NewPGVector
from langchain_openai import OpenAIEmbeddings

# 1. Retrieve all documents and embeddings from the old store
old_store = OldPGVector(
    connection_string="postgresql://username:password@localhost:5432/vectordb",
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_documents"
)

# Connect to the database directly for custom queries
conn = psycopg.connect("postgresql://username:password@localhost:5432/vectordb")
cursor = conn.cursor()

# 2. Query the old table structure to get documents and embeddings
cursor.execute("""
    SELECT id, embedding, metadata, document
    FROM langchain_my_documents
""")
records = cursor.fetchall()

# 3. Initialize the new store
new_store = NewPGVector(
    connection_string="postgresql://username:password@localhost:5432/vectordb",
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_documents_new"
)

# 4. Add the data to the new store using add_embeddings
for record in records:
    doc_id, embedding, metadata, content = record
    new_store.add_embeddings(
        texts=[content],
        embeddings=[embedding],
        metadatas=[metadata],
        ids=[doc_id]
    )

cursor.close()
conn.close()

Advanced Features in the New Implementation

The new implementation offers several improvements worth exploring:

MMR search helps balance relevance and diversity in search results:

results = new_vector_store.max_marginal_relevance_search(
    query="Database technologies",
    k=5,  # Number of documents to return
    fetch_k=20,  # Number of documents to consider before reranking
    lambda_mult=0.5  # Diversity vs. relevance trade-off (0=max diversity, 1=max relevance)
)

Asynchronous Operations

The new implementation supports async operations:

import asyncio

async def perform_async_search():
    results = await new_vector_store.asimilarity_search(
        query="Vector databases",
        k=3
    )
    return results

results = asyncio.run(perform_async_search())

Using as a Retriever

Both implementations can be used as retrievers in LangChain chains:

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Create a retriever
retriever = new_vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "fetch_k": 10}
)

# Use in a chain
llm = ChatOpenAI()
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following context:
{context}

Question: {question}
""")

chain = (
    {"context": retriever, "question": lambda x: x}
    | prompt
    | llm
)

result = chain.invoke("How do vector databases work with language models?")

Performance Considerations

When working with large vector databases in PostgreSQL, consider these performance tips:

  1. Indexing: Ensure you have properly configured pgvector indexes
    CREATE INDEX ON your_collection_name USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
    
  2. Partitioning: Consider partitioning large tables by metadata categories

  3. Connection Pooling: Use connection pooling to manage database connections efficiently

  4. Batch Operations: Use batch operations for adding multiple documents:
    vector_store.add_documents(documents_batch)
    

Conclusion

Migrating from LangChain’s original PGVector implementation to the newer langchain_postgres package brings several benefits, including improved filtering capabilities, better async support, and schema improvements. While the migration requires some code changes and potentially data migration, the benefits make it worthwhile for most applications.

The PostgreSQL vector database approach offers a powerful and accessible solution for implementing semantic search in your applications, combining the reliability of PostgreSQL with the performance of vector similarity search.

By following the guidance in this article, you should be able to implement a vector database using PostgreSQL and successfully migrate from the older LangChain implementation to the newer, more robust solution.

This post was originally written in my native language and then translated using an LLM. I apologize if there are any grammatical inconsistencies.

Categories:

Updated: