When CosmosDB Performance Hit a Wall: My Migration to SQL Server

Last year, I found myself looking at some concerning performance metrics. Our CosmosDB-powered application was showing inconsistent query times and occasional data integrity issues that were affecting user experience. What started as a good fit for our initial requirements had become a reliability concern as our application's data access patterns evolved.

This is the story of how we migrated from CosmosDB to SQL Server when performance inconsistencies and data integrity requirements forced our hand.

The Breaking Point

Our application started as an evolving content management system where we weren't sure what the final data structure would look like. We were iterating quickly on features and the schema was changing frequently. CosmosDB seemed perfect—schemaless flexibility let us add new fields and restructure documents without migrations. For the first year, while we were still figuring out our data model, it was fantastic.

Then reality hit.

The Performance Wall

As our application's query patterns became more complex, queries that used to take 50ms were taking 1-2 seconds inconsistently. Cross-partition queries were unpredictable in performance, sometimes fast, sometimes timing out entirely.

The main issue was implementing audit logs with complex filtering and aggregation. Queries like "show me all changes made by users in department X to documents of type Y in the last 30 days" were unpredictable in CosmosDB, sometimes completing quickly, other times timing out or consuming excessive resources.

Data Integrity Challenges

The critical issue was referential integrity. Our documents frequently referenced other documents, but CosmosDB provided no way to ensure the referenced documents actually existed. We ended up with orphaned references - documents pointing to other documents that had been deleted or never existed.

Unlike relational databases with foreign key constraints, CosmosDB couldn't prevent these broken references. We had to build custom validation logic and cleanup processes, but the problem kept recurring. For our business application, these broken relationships were unacceptable.

The Cost Uncertainty

While performance and data integrity were our primary concerns, unpredictable costs added to our frustration. The same queries would consume different amounts of RUs depending on data distribution and system load, making it difficult to budget and plan features.

The Migration Decision

After several team discussions and architecture reviews, we decided to migrate to SQL Server. By this point, our data schema had stabilized - we knew what our entities looked like and how they related to each other. The decision was driven primarily by our need for referential integrity and predictable performance, with cost predictability as a welcome bonus. CosmosDB had served its purpose during the rapid iteration phase, but now we needed the structure and consistency of a relational database.

What We Actually Had

Our CosmosDB contained a mix of structured and semi-structured data with complex relationships between documents. User documents referenced department documents, which referenced location documents, creating a web of relationships that CosmosDB couldn't enforce.

The problem wasn't the flexibility—it was the lack of relationship integrity and the unpredictable performance when querying across these document relationships.

The Queries That Broke Us

The main culprit was our audit reporting system. We needed queries that joined related documents - users to departments, documents to categories, changes to original documents. These relationship-based queries were inefficient in CosmosDB's document model.

Without proper joins, we had to make multiple round trips to fetch related data, leading to the N+1 query problem and unpredictable performance. A simple monthly report involving related data became a complex orchestration of multiple queries.

The Numbers That Convinced Management

The performance analysis showed:

Average query time: 1.8 seconds (target: <200ms)
Query consistency: Highly variable performance for the same queries
Data integrity issues: Growing number of orphaned references
Development time: Significant effort building workarounds for missing referential integrity

Designing for Performance, Not Purity

The beauty of hindsight is that I now realize we should have started with SQL Server. But since we were here, I had to make some pragmatic decisions about the new schema.

The Normalization Dilemma

Every database textbook will tell you to normalize your schema. In practice, with the query patterns we had, pure normalization would have killed performance. I made a conscious decision to denormalize strategically:

We designed a SQL schema that prioritized our most problematic queries. Instead of perfect normalization, we denormalized department and document type information directly into the main Documents table. Yes, this meant some data duplication, but it made our audit queries lightning fast.

We also created targeted indexes for our most common query patterns - filtering by date, department, and document type. The query that was taking minutes in CosmosDB now completed in milliseconds.

The Reality of Migration Trade-offs

I could have spent weeks creating the "perfect" normalized schema. Instead, I optimized for:

Query performance - The reports that were killing us needed to be fast
Migration speed - Every day on CosmosDB was costing us money
Application changes - Minimize the code changes needed

Perfect is the enemy of done, and we needed done.

The Actual Migration Process

Here's the part that kept me awake for a week: how do you migrate your entire database while maintaining referential integrity and without taking your application offline?

The Extraction Reality Check

My first attempt at extraction was naive. I tried to pull everything at once:

My first attempt at extraction was naive - trying to pull everything at once would have been expensive and risky. Instead, I implemented a careful extraction process that respected RU limits and handled the complex document relationships properly.

The Data Transformation Challenge

Transforming nested JSON documents into relational tables required careful mapping logic. We built transformation classes that could break down complex CosmosDB documents into their constituent parts - extracting user information, preferences, and address arrays into separate SQL Server tables.

The trickiest part was handling the nested structures and arrays. What was a single document in CosmosDB became multiple related records in SQL Server. We had to maintain referential integrity while preserving all the original data.

Loading with Resilience

The loading phase used transactions to ensure data consistency. If any part of a user's data failed to insert, we'd roll back the entire operation for that user and retry later. This approach meant some users took multiple attempts to migrate, but we never ended up with partial data.

We also implemented extensive error logging to track which specific documents were causing issues. This proved invaluable for debugging edge cases in our data that we hadn't anticipated.

Application Layer Migration

Handling the Transition Period

The application layer needed to work with both databases during migration. We implemented a repository pattern with a unified interface that could route queries to either CosmosDB or SQL Server based on feature flags.

The SQL Server repository required more complex mapping logic than CosmosDB. Where CosmosDB returned complete documents with nested properties, SQL Server required joins across multiple tables and careful object assembly. The query for a user with preferences and addresses involved three table joins and custom mapping logic to reconstruct the object graph.

Performance was noticeably different between the two approaches. CosmosDB queries were simpler but expensive in RUs, while SQL Server queries required more complex SQL but executed quickly with proper indexes.

Feature Flag Strategy

We used feature flags to gradually migrate users from CosmosDB to SQL Server. This allowed us to test the new system with a small percentage of users before rolling it out completely.

The service layer abstracted the choice between databases, checking feature flags to determine which repository to use. This approach let us roll back quickly if we discovered issues, and gave us confidence in the new system before committing fully.

Performance Optimization

SQL Server performance was dramatically better than CosmosDB for our use cases, but it required thoughtful index design.

Indexing Strategy

We created indexes specifically for our most expensive CosmosDB queries. Email lookups, date range filters, and city-based searches all got dedicated indexes. The composite indexes were particularly effective - including commonly selected columns in the index meant SQL Server could satisfy entire queries without touching the base tables.

Query Transformation

Moving from document-based queries to relational queries required rethinking our approach. Where CosmosDB scanned entire collections, SQL Server could use indexes and joins efficiently. A query that filtered users by city went from an expensive cross-partition scan to a fast index seek with a join.

Connection Management

We configured connection pooling and retry policies to handle SQL Server's different connection model compared to CosmosDB's HTTP-based approach. Adding Redis caching for frequently accessed data reduced database load and improved response times further.

Testing and Validation

Testing a database migration requires validating both data integrity and performance.

Data Integrity Validation

We built automated tests that compared records between CosmosDB and SQL Server, checking that all fields matched after transformation. This was crucial for catching edge cases in our mapping logic - nested objects, null values, and array handling all needed specific attention.

The tests ran continuously during migration, alerting us immediately if any data inconsistencies appeared. This gave us confidence that our transformation logic was working correctly across all data variations.

Performance Comparison

Performance testing confirmed our expectations. Queries that took over a second in CosmosDB typically completed in under 100ms in SQL Server. We documented these improvements to justify the migration effort and establish baselines for future performance monitoring.

Production Cutover Strategy

The final cutover required careful coordination and health checking.

Gradual Rollout with Safety Checks

We implemented an automated cutover system that could enable SQL Server for all users while continuously monitoring system health. If any issues were detected, the system would automatically roll back to CosmosDB.

This approach gave us confidence to make the switch during business hours, knowing we could revert quickly if needed. The health checks validated both database connectivity and query performance before fully committing to the new system.

Monitoring and Alerting

We set up comprehensive monitoring to track database performance and catch issues early. This included custom metrics for query duration, error rates, and system health indicators to ensure our migration was successful and sustainable.

Common Pitfalls and Solutions

1. N+1 Query Problem

Problem: Loading related data inefficiently by making multiple database calls in loops.

Solution: Use proper joins or batch loading to retrieve related data in a single query, dramatically reducing database round trips and improving performance.

2. Transaction Scope Issues

Problem: Long-running transactions causing deadlocks and timeouts, especially when including external API calls within transaction scope.

Solution: Keep transactions short and focused by processing data in smaller batches. This approach reduces lock contention and makes the system more resilient to individual operation failures.

Would I Do It Again?

Absolutely. But I would have made the decision 18 months earlier.

The migration taught me that technology choices should evolve with your application. CosmosDB was perfect when we needed schema flexibility during rapid development. But once our data model stabilized and we understood our relationship patterns, SQL Server became the better choice for data integrity, performance predictability, and team productivity. There's no shame in choosing the right tool for each phase of your application's lifecycle.

If you're facing similar performance consistency or data integrity issues with CosmosDB, don't suffer in silence. Sometimes the best architecture decision is recognizing when your data patterns don't match your database choice and making the change.

After a tense weekend migration, the results were better than I'd hoped:

Query times: From 1.8 seconds to 80ms average, with consistent performance
Data integrity: Zero orphaned references with proper foreign key constraints
Failed queries: From 12% to less than 0.1%
Monthly costs: Reduced from unpredictable to a fixed, lower amount
Developer happiness: Immeasurable improvement

The biggest lesson? It's okay to change databases as your application matures. CosmosDB was the right choice for rapid schema iteration. SQL Server became the right choice once we needed referential integrity and predictable performance. Choose the database that matches your current phase, not your future dreams or past comfort zones.

Have you been through a similar migration nightmare? I'd love to hear your war stories. The pain is real, but so is the relief when you get to the other side. Let's talk if you're facing similar challenges.