background-shape
Pinecone in Production, Pod Sizing, Upserts, and the Cost Math That Surprises Teams
April 6, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — Pinecone’s pod-based pricing means you pay for provisioned capacity, not usage. Get the sizing right or you’ll overpay by 3-5x. / Batch your upserts in groups of 100 vectors and use async clients above a few thousand writes per second. / Metadata filtering is powerful but adds memory pressure, so be ruthless about what you store.

When I first deployed Pinecone for a real workload back in February, I made every mistake the docs warn about and a few they don’t. I provisioned the wrong pod type, I upserted one vector at a time, and I stuffed every piece of metadata I had into the index because storage is cheap, right? It isn’t, in Pinecone, because metadata lives in RAM alongside the vectors.

This post is the operational playbook I should have had when I started. I’ll cover pod selection, index creation, batched ingestion, filtering, and the cost math that catches most teams off guard. If you’re still deciding whether Pinecone is the right fit at all, the vector database landscape post from earlier this week has the broader comparison.

Everything below assumes pinecone-client version 2.2.1, which is the current Python SDK as of early April. The API has been stable for the last few releases but the SDK has been moving, so double-check the import paths against your installed version.

Setting up

The setup story is unremarkable, which is the point of Pinecone. You sign up, you get an API key, you pick an environment, and you’re done.

pip install pinecone-client==2.2.1 openai==0.27.4
export PINECONE_API_KEY="..."
export PINECONE_ENVIRONMENT="us-west1-gcp"
export OPENAI_API_KEY="..."

The environment string matters. Each environment is a region plus a cloud provider plus a project, and indexes can’t be moved between them. If you start in us-west1-gcp and later need EU residency, you’re doing a full reindex into a new environment.

import os
import pinecone

pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment=os.environ["PINECONE_ENVIRONMENT"],
)

print(pinecone.list_indexes())

If that returns an empty list, you’re connected.

Pod types and the sizing math

This is the part most blog posts skip. Pinecone’s pricing in April 2023 is entirely pod-based: you provision pods of a specific type, and you pay an hourly rate per pod regardless of how busy they are.

The pod types you actually need to know about:

  • s1 (storage-optimized). 5M vectors at 768 dimensions per pod. Higher query latency, cheapest per vector. Good for archival or low-QPS workloads.
  • p1 (performance). 1M vectors at 768 dimensions per pod. Mid-tier latency. The default choice for most semantic search workloads.
  • p2 (high-performance). 1M vectors at 768 dimensions per pod, with lower latency and higher QPS than p1. The right choice when query latency under 50ms matters.

A pod also has a size suffix - x1, x2, x4, x8 - which is a multiplier on both capacity and price. An s1.x2 holds 10M vectors and costs twice an s1.x1.

Dimension matters. The capacity numbers above are for 768-dim vectors. OpenAI’s text-embedding-ada-002 returns 1536-dim vectors, which means you get roughly half the capacity per pod. The math:

effective_capacity = base_capacity * (768 / your_dimension)

So a p1.x1 with ada-002 embeddings holds about 500K vectors, not 1M. I have watched a team blow through their pod limit at 480K vectors and spend a frantic afternoon figuring out why.

Replicas multiply your QPS but don’t add capacity. A p1.x1 with 2 replicas costs 2x, handles 2x the queries, but still holds the same number of vectors.

Creating an index

pinecone.create_index(
    name="prod-docs",
    dimension=1536,
    metric="cosine",
    pods=2,
    pod_type="p1.x1",
    replicas=1,
    metadata_config={
        "indexed": ["source", "year", "team"],
    },
)

The metadata_config.indexed field is the single most important parameter on this call. By default, Pinecone indexes every metadata field you upload, which means every field lives in RAM. For high-cardinality fields like document IDs or timestamps, this can balloon your memory footprint and force you onto a larger pod tier. Explicitly listing only the fields you’ll filter on cuts memory significantly.

The metric is locked at creation time. cosine, dotproduct, and euclidean are the options. For ada-002, which returns normalized vectors, cosine and dotproduct are mathematically equivalent and dotproduct is marginally faster. I default to cosine for clarity unless I’m in a latency-critical path.

Batched upserts

The naive ingestion loop is the most common performance footgun:

# Don't do this
for doc in documents:
    embedding = embed(doc.text)
    index.upsert(vectors=[(doc.id, embedding, doc.metadata)])

That’s one HTTP round trip per vector. On a real corpus you’ll wait hours for what should take minutes.

The right pattern is batching to 100 vectors per upsert, which is the Pinecone-recommended batch size:

from itertools import islice

def chunks(iterable, n):
    it = iter(iterable)
    while batch := list(islice(it, n)):
        yield batch

def upsert_batch(index, docs):
    vectors = [
        (doc["id"], doc["embedding"], doc["metadata"])
        for doc in docs
    ]
    index.upsert(vectors=vectors)

for batch in chunks(documents, 100):
    upsert_batch(index, batch)

For higher throughput, the SDK supports async upserts via index.upsert(vectors=..., async_req=True), which returns a future. I typically fan out 20-30 in-flight requests, which saturates a single p1.x1 pod’s write capacity at around 1500-2000 vectors per second. Going above that requires more pods, not more concurrency.

def parallel_upsert(index, documents, batch_size=100, concurrency=20):
    futures = []
    for batch in chunks(documents, batch_size):
        vectors = [(d["id"], d["embedding"], d["metadata"]) for d in batch]
        future = index.upsert(vectors=vectors, async_req=True)
        futures.append(future)
        if len(futures) >= concurrency:
            for f in futures:
                f.get()
            futures = []
    for f in futures:
        f.get()

Querying with filters

The query API is small. You give it a vector, a top_k, and optionally a filter:

results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "year": {"$eq": 2023},
        "team": {"$in": ["platform", "infra"]},
    },
    include_metadata=True,
)

for match in results["matches"]:
    print(match["id"], match["score"], match["metadata"])

The filter language is MongoDB-flavored: $eq, $ne, $in, $nin, $gt, $gte, $lt, $lte. There’s no $exists, no regex matching, and no full-text search. If you need any of that, you’re either combining Pinecone with another system or you’re on the wrong tool.

One thing the Pinecone docs don’t emphasize enough: filters apply during the search, not after. That means a highly selective filter doesn’t force the index to scan more vectors, but it can return fewer than top_k results if not enough matches exist within the filtered subset. Always check len(results["matches"]) before assuming you have top_k items.

Namespaces

Namespaces are Pinecone’s tenancy primitive. Every operation can take a namespace parameter, and namespaces are essentially free up to the pod’s vector limit. I use them aggressively to partition data by tenant, environment, or document type:

index.upsert(vectors=vectors, namespace="tenant-acme")
index.query(vector=q, top_k=5, namespace="tenant-acme")

The catch: each query hits exactly one namespace. There’s no way to query across namespaces in a single call, which makes namespaces a coarse-grained partitioning tool, not a filter substitute.

Common Pitfalls

  • Forgetting include_metadata=True. By default, query results only return IDs and scores. If you need the metadata for downstream processing, set the flag explicitly. I’ve debugged “missing fields” bugs that turned out to be this.
  • Treating upsert as idempotent for metadata. It is, but only at the vector level. If you upsert the same ID with a different vector and the same metadata, the vector updates and the metadata stays. If you upsert with no metadata field, it doesn’t clear the existing metadata. To remove metadata, you have to delete and re-upsert.
  • Using delete(deleteAll=True) in tests. This wipes the index. There’s no soft delete and no undo. Always scope to a namespace in your test harness.
  • Underestimating the cold-start cost. Pinecone indexes take 1-3 minutes to provision and become queryable. CI pipelines that create-then-query indexes need to poll describe_index until status.ready is True.
  • Provisioning replicas before you need them. Replicas double your bill. I’ve seen teams add replicas to “improve performance” when the actual bottleneck was their embedding API call, not the vector search.

Cost math that catches teams

A worked example. You have 2 million documents, each producing one 1536-dim ada-002 vector. You want sub-100ms p95 query latency.

  • Effective capacity per p1.x1 pod = 500K vectors.
  • Pods needed = ceil(2_000_000 / 500_000) = 4.
  • At roughly $0.096/hour list price per p1.x1 pod = $0.384/hour = ~$280/month for the index.
  • Add one replica for query throughput headroom = ~$560/month.

That’s before the embedding cost itself. Embedding 2M documents at average 500 tokens each with ada-002 at $0.0001/1K tokens is 1M * 0.5 * $0.0001 = $50 one-time, plus query embeddings ongoing. The Pinecone bill dwarfs the OpenAI bill at this scale, which is the opposite of what most teams expect on day one.

Wrapping Up

Pinecone trades flexibility for simplicity and trades cost for managed convenience. If your workload sits at the 1-10M vector range and you want to stop thinking about infrastructure, it’s hard to beat. Past that range, or if cost predictability matters more than ops convenience, the self-hosted options are worth a serious look.

The next post in this series will get into embedding models themselves, which is where the actual quality of your semantic search lives or dies regardless of which database you put underneath it.