Skip to main content

Command Palette

Search for a command to run...

Database Sharding vs Partitioning

Updated
15 min read
Database Sharding vs Partitioning
Y

Yash is passionate about Distributed Systems and Observability, and creates content covering topics such as DevOps, and Cloud Native technologies.

Let me start with a little confession, when I first started my engineering journey, I genuinely used to think database partitioning and sharding were the exact same thing. Like, literally the same concept just with different buzzwords. I still remember sitting in an architecture review session during an internship early in my career. Our primary database was choking on some massive reporting queries. Someone said "we need to partition the transactions table," and out of excitement I said, "Yeah, we should just shard it across a few instances, that'll fix it!"

The room went dead quiet, and suddenly, all eyes were on me. I couldn't read the room perfectly, but some of them looked at me with this gentle 'he’ll learn eventually.' While others thought, 'Oh, you poor intern. Why are you even talking? Just nod, take notes and listen to the seniors.' However it took a few more embarrassing yet eye opening whiteboard sessions for me to realize just how spectacularly wrong I was that day. So, let me tell you, confusing partitioning with sharding is basically like confusing buying a drawer organizer with buying a second house across town.

Over the time, after scaling a few systems and feeling the operational pain of bad DB choices firsthand, I finally understood the profound difference. A lot of engineers still blur these lines until they are forced to deal with them in production at 3am or maybe somehow if they are vibe coding. So, let’s break down exactly what these two concepts are, how they actually work under the hood in systems like Postgres and MySQL, and most importantly, when you actually need to reach for them.

TL;DR: If you only take one mental model away from this massive post, let it be this:

  • Partitioning operates at the data level. You are taking one gigantic table and slicing it into smaller, more manageable physical segments. But this all lives inside the same database instance. The database engine handles the routing for you transparently.

  • Sharding operates at the hardware/infrastructure level. You are taking your data and splitting it across multiple separate database servers. Your application code (or a smart proxy layer) has to figure out which physical server to talk to.

And hey please, do not shard until it is an unavoidable necessity. I know there are engineers out there who live for the thrill of sharding and are currently rolling their eyes at this, if they are reading. But hold that thought! I'll explain more on this this towards the end of this blog that why I personally think so.

And don't get me wrong, it's not that I hate sharding... nope. I actually like it! But, you know, how not all likes are created equal. So, grab a coffee, cause this is gonna be a kinda deep dive and let's continue. Below is a diagram which gives you a rough idea of what we will be talking about...

Part 1: Database Partitioning

Partitioning is an absolute lifesaver. It solves a ton of query performance and maintenance headaches before you ever need to ask finance for a bigger AWS budget. Also when you partition a table, you create multiple physical storage segments behind a single logical table name. Your backend code doesn't change at all. You still fire off a SELECT * FROM payment_transactions, but Postgres or MySQL is smart enough to only look at the specific physical segment that holds your data. This magic trick is called partition pruning.

But before we look at the strategies, we need to clear up another common confusion i.e; what is this Horizontal vs Vertical Partitioning.

What we usually mean when we say "partitioning" is Horizontal Partitioning. This means splitting the rows across partitions. Every partition has the exact same columns, just different rows of data (e.g., January's rows vs February's rows).

Vertical Partitioning is different. It means splitting the columns. Let's say you have a user_profiles table, and it has a massive biography_text column and a profile_img_blob column that rarely get accessed, but they take up tons of memory and ruin your cache hit ratio. Vertical partitioning means moving those heavy, rarely-accessed columns into a totally separate table (user_profile_blobs) and using a foreign key to join them when strictly necessary.

This vertical keeps your "hot" table narrow and cache-friendly. It’s more of a schema design choice than a database clustering strategy. And for the rest of this, we are talking about Horizontal Partitioning.

Partitioning Strategies

1. Range Partitioning

This is the MVP of partitioning. You split data by contiguous ranges, almost always time or dates. It is absolutely perfect for time-series data, audit logs, or financial ledgers.

Let's look at how Range Partitioning looks in PostgreSQL. Imagine a massive payment_transactions table. Querying it is getting painfully slow. So, we partition it by processed_date:

-- Create the parent logical table
CREATE TABLE payment_transactions (
    txn_id BIGINT,
    processed_date TIMESTAMP,
    account_id INT,
    amount DECIMAL(12,2)
) PARTITION BY RANGE (processed_date);

-- Create physical partitions for each month
CREATE TABLE txns_2026_01 PARTITION OF payment_transactions
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE txns_2026_02 PARTITION OF payment_transactions
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Why is this awesome? First, query speed. If a user loads their dashboard for February 2026, Postgres completely ignores all other months. It only scans txns_2026_02.

Second, data retention! When data gets old and you want to delete data from 5 years ago, you don't run massive DELETE FROM ... statements. Deletes lock up rows, bloat your WAL (Write-Ahead Log), and ruin performance. With range partitioning, you just do DROP TABLE txns_2021_01;. It drops the file from the disk instantly. Boom.

2. List Partitioning

You use this to split data by discrete, categorical values. This is great for multi-tenant architectures or geographic isolation within a single database.

CREATE TABLE audit_logs (
    log_id BIGINT,
    compliance_zone TEXT,
    payload JSONB
) PARTITION BY LIST (compliance_zone);

-- Route data based on exact string matches
CREATE TABLE logs_na PARTITION OF audit_logs 
    FOR VALUES IN ('na-east', 'na-central', 'na-west');

CREATE TABLE logs_emea PARTITION OF audit_logs 
    FOR VALUES IN ('emea-north', 'emea-south');

3. Hash Partitioning

Sometimes you have a massive table, but there is no natural date range or logical list to split on. You just want the data spread out evenly across smaller physical tables to make things like VACUUM operations faster. You hash the partition key and use modulo math.

CREATE TABLE shopping_carts (
    cart_id BIGINT,
    buyer_id INT,
    cart_data JSONB
) PARTITION BY HASH (buyer_id);

-- Split into 5 even partitions
CREATE TABLE carts_p0 PARTITION OF shopping_carts FOR VALUES WITH (MODULUS 5, REMAINDER 0);
CREATE TABLE carts_p1 PARTITION OF shopping_carts FOR VALUES WITH (MODULUS 5, REMAINDER 1);
-- ... and so on up to REMAINDER 4

Note: Hash partitioning distributes data beautifully, but you lose the ability to do partition pruning on range queries.

The PostgreSQL vs MySQL Quirks

We cant even talk about partitioning without acknowledging how different engines handle it.

  • PostgreSQL has had awesome "declarative partitioning" since version 10. You just declare the rules, and it routes everything. However, you have to remember that indexes must be created on each partition individually (though modern Postgres versions let you CREATE INDEX ON the parent and it cascades). Also, VACUUM and ANALYZE run on a per-partition basis, which is great for maintenance.

  • MySQL supports native partitioning too, but it has a very annoying limitation: The partition key MUST be part of every unique index on the table, including your Primary Key. If you partition by created_at, your primary key has to be (id, created_at). No foreign keys are allowed on partitioned tables either. Also, MySQL partition pruning doesn't really work well with Hash partitions.

The Effective Rule of Partitioning

Your partition key must be in your WHERE clause for most of your queries.

If 90% of your queries filter by date, partition by date! But if you partition by date, and then someone writes an API that queries by account_id without specifying a date... the database has no idea which partition holds that account.

So what does it do? It scans every single partition. This is called a full cross-partition scan, and it will absolutely murder your database CPU. A bad partition key is literally worse than having no partitions at all.


Part 2: Database Sharding

Okay, let's fast forward. Your company has grown. You scaled your database vertically, you are running the biggest, most expensive instance money can buy. You added read replicas to offload the read-heavy traffic. You partitioned your tables flawlessly. But then... your product goes wildly viral.

You are suddenly getting tens of thousands of writes per second. CPU is pegged. Memory is exhausted. You log into your cloud console and realize there literally isn't a bigger server to buy. You've hit the physical limits of hardware. This is the moment. You can no longer scale up. You have to scale out. Here comes Sharding.

At its core, sharding is the process of scaling a database by spreading out the data across multiple servers, or shards. It is the go-to scaling solution for organizations managing petabyte-scale data.

But how does it actually work? Lets say in a simple monolithic setup, your application server talks directly to one database server. But when you shard, your data is split across 10, 50, or hundreds of database servers.

If your application code had to keep track of all those shards, know exactly which rows lived where, and maintain open connections to every single server... well, storing this logic in the application code can quickly become messy and difficult to maintain.

To fix this, we usually introduce a Proxy Layer, so when your app needs data, it sends the query to a proxy server. The proxy is then responsible for figuring out the routing and sending the query to the correct shard.

However, if you are thinking that bringing a proxy is an OG move... wait!!! Because proxies have limits too. If you only have one proxy server and a massive flood of traffic hits, the proxy server hits its capacity for simultaneous queries and has to start queueing inserts. That added latency is unacceptable in production, so you usually have to scale out your proxy servers alongside your shards.

But How Do We Split the Data Strategically to these Shards?

Sharding Strategies

The most critical decision you will make is your sharding strategy, the set of rules used to determine which rows of data go to which shards. You need to select a shard key (the column used to determine where a row lives). Here are the main ways to do it:

1. Range Sharding

The proxy decides where a row goes based on pre-defined ranges. For example, user_id 1 to 25 goes to Shard 1, 26 to 50 goes to Shard 2. The Trap: Using naive range-based sharding with IDs is generally a bad idea if your IDs are monotonically increasing. Why? Because all your new (and most active) users end up on the exact same shard. You get one "hot shard" melting down under traffic, while your other "cool shards" sit there doing absolutely nothing.

2. Hash Sharding

This is usually the safest bet. You choose a column to be the shard key, and the system generates a cryptographic hash of this value for each row. Because of how hashing works, similar inputs produce very different outputs (e.g., "josh" might hash to 45, while "joshua" hashes to 28). This beautifully spreads your data out evenly across all your servers.

3. Geographic Sharding,

Here the routing data is based on physical region. For instance the EU users go to a database hosted in Frankfurt, US users go to a database hosted in Virginia and similar others. This is often driven by legal compliance (like GDPR data residency) rather than just scale.

4. Lookup & Custom Sharding

We usually create lookup tables that map incoming data to a specific shard, or even write custom sharding functions that dictate exactly where data should live based on your own business logic.

Picking a Shard Key

It is is the most terrifying decision in architecture. If you pick poorly, you create unfixable hotspots, and fixing it requires a complete data migration.

A good shard key has three traits:

  1. High Cardinality: Lots of unique values. (Sharding by is_active_user boolean gives you exactly 2 shards. That is terrible).

  2. Even Distribution: Values appear evenly. If you shard by country, but 85% of your traffic is from India, your India shard will crash while the others do nothing.

  3. Query Alignment: The key must be present in almost all your queries.

For a B2B SaaS application, tenant_id or workspace_id is usually the holy grail. All queries are usually scoped to a single tenant anyway.

Database-Specific Sharding Options

You don't usually build this from scratch anymore.

  • PostgreSQL: Postgres has no native sharding out of the box. Most people use an extension like Citus (which adds distributed tables and handles the routing under the hood).

  • MySQL: Vitess is the absolute standard here. It was built by YouTube to handle their insane MySQL scaling needs. It sits as a proxy layer in front of your databases.

  • MongoDB: Mongo actually has built-in sharding natively. You define a shard key, and it handles chunk balancing automatically. If you are already in the NoSQL world, Mongo makes this relatively painless.

  • NewSQL (CockroachDB / TiDB): These are distributed SQL databases built from the ground up to be sharded. Your app just talks to it like it's a single Postgres database, but behind the scenes, it's a massive sharded cluster.


Part 3: The Operational Reality of Sharding

Okay, so by now, the concept of sharding seems simple and too good to be true, right?

But the operational reality of running it in production is an absolutely something that led me to leave software engineering, like literally in my initial days. IYKYK

If this is your first time with sharding, pay attention. However, if you are someone who already tasted this, you probably gonna have PTSD from this list.

1. The Cross-Shard Queries

If your query includes the shard key (WHERE workspace_id = 42), the routing layer sends the query to exactly one database. Fast and cheap. But what if your product manager asks for a global admin dashboard? SELECT COUNT(*) FROM users WHERE status = 'pending'.

That query has no shard key. Your proxy layer has to send that query to all 20 of your shards simultaneously. It has to wait for 20 databases to reply, aggregate the math in memory, and return it.

These are called scatter-gather queries. They eat up database connections like candy and will bring your system to its knees under load. Cross-shard joins? Forget about it. They are basically a distributed computing science project.

2. Rebalancing Live Data

What happens when Shard 3 gets full? You need to add a new shard (Shard 21) and move some data from Shard 3 to Shard 21. Oh, and you have to do this while the application is live, taking writes, with zero downtime. Moving gigabytes of live data across the network while keeping transaction consistency is incredibly stressful.

3. Distributed Schema Migrations

Remember when running an ALTER TABLE users ADD COLUMN phone VARCHAR took 5 seconds? Welcome to sharding. You now have to run that ALTER TABLE across 20 different databases. What if it succeeds on shards 1 through 18, but times out and fails on shard 19 and 20? Your application now has a split-brain schema. You need hardcore CI/CD tooling to manage distributed DDL operations safely.

4. The Multiplier Effect on Monitoring

You no longer have "a database" to monitor. You have a fleet. A random latency spike on Shard 8 will only affect 5% of your users, making it a ghost to track down. You now need to track replication lag, CPU, lock waits, and connection pools per shard.

This creates an explosion of high-cardinality metrics. If you use a tool like Prometheus or Datadog, adding shard_id as a label multiplies every single database metric by your shard count. Your observability bill is going to go up, and your dashboards are going to look like a cockpit.

The Conclusion:

So, what is the grand takeaway from all this?

Start with partitioning. Do not shard until you understand your requirement and the use cases. I cannot stress this enough. A significant number of the engineering teams out there that think they need to shard, but they do not need to shard at the first place. Instead take a moment and first think of every possibilities you can do.

Before you even utter the word "sharding" in a meeting, ask yourself:

  • Did we vertically scale to the largest possible hardware?

  • Did we add read-replicas and route all our analytical/read traffic to them?

  • Did we actually run EXPLAIN ANALYZE and fix our missing indexes?

  • Did we fix that terrible N+1 query loop in our ORM?

  • Did we put PgBouncer or ProxySQL in front of the DB to fix connection pool exhaustion?

  • Did we partition our largest tables?

Partitioning elegantly solves performance and maintenance bloat within a single instance. The database engine does the heavy lifting for you.

Sharding is a last resort. It is a necessary when your write throughput physically exceeds what silicon and metal can handle in a single box. It unlocks practically infinite scale, but it demands an incredibly high operational tax in return.

And you know what, tbh software engineering is amazing space, and we should put our self in the situation to learn more. So in a way it's a love and hate relationship, but we need for it for our survival.