Stop Blaming Storage: The Invisible Cost of Excessive Log Switches In Oracle Databases
Real-World Telemetry Analysis: Test 1 vs. Test 2 To understand how severe write volumes impact database latency, let us evaluate two distinct test profiles running the exact same heavy transactional workload. These profiles highlight the staggering volume of log writer activity occurring under typical enterprise applications: Database Profile (Test 1): Sustaining an intensive write rate of 35,550,156.8 bytes per second (~33.90 MB/sec) of redo generation. Database Profile (Test 2): Sustaining an even higher write rate of 40,691,343.8 bytes per second (~38.81 MB/sec) of redo generation. A consistent generation rate of 34 MB/s to 39 MB/s is classified as a highly active, heavy write workload. If the underlying layout of the database's log files is structured using default or undersized parameters, this heavy transactional density forces a systemic collision point between logical software processing and physical disk checkpointing. Reverse-Engineering Your Log Sizes from Switch Activity Because physical redo log dimensions are structural layouts rather than configuration variables, they are not listed inside the Modified Parameters section of standard database diagnostic summaries (such as AWR reports). Instead, engineers must combine the sustained redo byte velocity with recorded switch intervals to uncover the current physical geometry using this model: S Log = (R sec × 3600) / N switch Where S Log represents the calculated current log size, R sec represents the redo byte velocity per second, and N switch represents the total number of log switches executed per hour. Modeled Redo Layout Dimensions Based on Active Workloads Log Switches Observed / Hour Test 1 Profile (33.90 MB/sec) Test 2 Profile (38.81 MB/sec) Engine State & Systemic Latency Impacts 30 Switches / Hour (Every 2 minutes) ~4,068 MB (4 GB) ~4,657 MB (4.5 GB) Continuous, aggressive database checkpointing. Disk queues are consistently saturated writing dirty blocks to datafiles. 60 Switches / Hour (Every 1 minute) ~2,034 MB (2 GB) ~2,328 MB (2.3 GB) Severe operational throttling. High threat of transaction processing freezes while the engine waits for space. 120 Switches / Hour (Every 30 seconds) ~1,017 MB (1 GB) ~1,164 MB (1.1 GB) Critical architectural failure point. Heavy occurrence of log file switch completion wait states. The Mechanics of a Log Switch Bottleneck Why does a high log switch count destroy performance? It is crucial to understand what the Oracle database engine is forced to do behind the scenes every single time a log group fills up: Forced Incremental Checkpointing: When a log switches, the database must advance its checkpoint. This forces the Database Writer processes (DBWn) to aggressively flush dirty data blocks from memory (the Buffer Cache) out to the permanent datafiles on disk to ensure crash-recovery safety. Control File Serialization: The database must update its control files to record the new log sequence architecture. This introduces internal metadata synchronization locks (enqueues) that can cause user sessions to stall. Archiver Contention: The Archiver background processes (ARCn) must instantly awake and begin reading the newly filled redo log to copy it to the archive destination. If the logs are small and switching every few seconds, the archivers cannot keep pace, completely locking the log writer (LGWR) out of the next group in the rotation. The accumulation of these three internal operations manifests directly as elevated log file sync and foreground wait latencies. To an outside observer, it looks like the storage array is failing to write fast enough, but in reality, the database engine is choking on its own structural layout. Sizing for the 20-Minute Target Window To neutralize this threat, we apply standard best-practice mathematics to size the log allocations cleanly for a conservative, stable 20-minute operational window under the observed workloads: Mathematical Formulation: Test 1 Architecture Sizing: 33.90 MB/sec × 60 seconds = 2,034 MB/minute. For a 20-minute window: 2,034 MB × 20 minutes = 40,680 MB (~40 GB per log group). Test 2 Architecture Sizing: 38.81 MB/sec × 60 seconds = 2,328.6 MB/minute. For a 20-minute window: 2,328.6 MB × 20 minutes = 46,572 MB (~46 GB per log group). Sizing Standard: To provide a safe, cushioned operational margin during unpredicted transaction spikes, configuring an allocation of 40 GB to 48 GB per log group across a minimum of 4 to 5 log groups will completely iron out the checkpointing waves and restore a smooth, predictable processing flow. DBA Command and Verification Track To audit your live database environment immediately, run the following administrative query to verify your current log configuration and status: SELECT GROUP#, THREAD#, SEQUENCE#, BYTES/1024/1024/1024 AS SIZE_GB, STATUS FROM V$LOG; If this output returns sizes sitting at outdated, legacy defaults (such as 1 GB or 2 GB) while under modern, high-velocity workloads, you have found your hidden bottleneck. Correcting the redo allocation path will immediately relieve the artificial pressure on your data layer. Quantifiable Database Performance Savings The most profound impact of implementing best-practice redo log sizing is the immediate reclamation of database processing capacity. Reclamation of Core Processing Time: Production environments can anticipate an immediate 15% to 20% savings in overall database processing time, particularly on nodes operating under synchronous replication frameworks. Elimination of Forced Wait States: Diagnostic telemetry shows the database spends up to 20.65% of its total operational life completely frozen within log file sync events. While a portion of this is network transit overhead, a significant contributor is the engine constantly stalling to handle back-to-back log switches occurring multiple times per minute. CPU Cycle Optimization: Transitioning to a stabilized footprint of 2 to 3 log switches per hour removes self-inflicted logical barriers, dropping the active wait-state percentages down and immediately returning vital CPU cycles back to active user transactions and application processing. Targeted Systems and Subsystem Benefits Correcting the redo allocation geometry triggers a positive cascade of efficiency across multiple independent layers of the database infrastructure ecosystem: A. Storage I/O Optimization (Flattening the Checkpoint Waves) Every time an individual redo log file reaches capacity and triggers a switch, Oracle mandates an aggressive incremental checkpoint. The Database Writer background processes (DBWn) are forced to violently halt standard operation to clear, prioritize, and flush "dirty" data blocks from the volatile Buffer Cache down to the permanent physical storage datafiles. The Strategic Benefit: Instead of a chaotic, cyclic pattern where disk I/O heavily spikes and crashes every 30 to 60 seconds, the underlying storage fabric encounters a flattened, smooth, and highly predictable write curve. Physical disk queue depths drop significantly, completely removing artificial array-level performance chokes. B. Elimination of Control File Enqueue Serialization To cleanly finalize a log switch, the database engine must gain exclusive metadata locks to write updated sequence architectures directly into the database control files. When a misconfigured environment forces this action hundreds of times an hour, user sessions become trapped in an internal serialization traffic jam. The Strategic Benefit: Scaling the logs ensures that control file metadata modification occurs only a few times per hour. This completely erases internal enqueue contention and prevents micro-stalls from propagating to foreground user processes. C. Mitigation of Archiver Process (ARCn) Contention Under high-velocity write workloads (~34 MB/s to 39 MB/s), undersized logs fill up substantially faster than the Archiver background processes (ARCn) can read and copy them to designated archive log destinations. If the archivers fall behind the pace of the log writer, the Log Writer (LGWR) will freeze all database processing because it is structurally prohibited from overwriting an unarchived log group. The Strategic Benefit: Deploying 40 GB to 48 GB log groups builds a wide, stable, 20-minute processing window. This provides the ARCn processes ample buffer space to quietly copy data streams in the background without ever creating a risk of blocking active application transactions. D. Stabilization of Application Response Uniformity From an end-user and application integration perspective, transaction latency becomes completely uniform and highly predictable. The Strategic Benefit: Currently, a user session may encounter an instantaneous transaction response, followed a moment later by a multi-second delay simply because their specific COMMIT command executed simultaneously with a log switch checkpoint. Eliminating constant switches ensures uniform, predictable, and sub-second transaction commit processing across the entire user base. Conclusion and Core Directive Undersized redo logs force high-performance solid-state storage arrays to absorb massive amounts of unnecessary operational punishment by demanding that files be opened, written, closed, checkpointed, and archived hundreds of times per hour. Increasing the log file size to align with a 20-minute target window does not merely alter a structural capacity metric; it fundamentally upgrades the internal execution efficiency of the core Oracle database engine. It systematically clears the log file sync bottleneck, cools down spiking CPU usage, and allows your enterprise data infrastructure to operate at its true peak potential.3Views0likes0CommentsThe Lost Art of Sizing
Introduction — Why This Series Exists Technology has gone through one of the most extraordinary economic transformations in modern history. For over four decades, the industry benefited from continuously cheaper computing resources, exponentially faster processors, collapsing storage costs, and an almost limitless ability to scale systems through virtualization and cloud computing. During that time, many of the operational disciplines that once defined great engineering slowly faded into the background. Precise sizing, deep performance analysis, workload modeling, and resource optimization became less visible as organizations increasingly relied on abundant infrastructure to compensate for inefficiencies. But the economics are changing. Today we are entering an era defined by: exploding GPU costs massive AI infrastructure investments rising power consumption thermal and density limitations increasingly expensive semiconductor fabrication and cloud bills that are exposing years of architectural inefficiency As these pressures grow, the industry is rediscovering something earlier generations of technologists already understood: Efficiency matters. And ultimately: Sizing matters. This blog series is intended to explore both the history and the future of performance engineering, capacity planning, and system sizing. The first blog — this one — focuses on how the industry arrived where it is today: the Scarcity Era of computing the transition into abundance the rise of cloud abstraction and the re-emergence of constraints in the modern AI era Future blogs will move from theory and history into practical engineering. They will examine modern system architectures and explore the many bottlenecks that organizations often overlook, including: CPU saturation memory pressure NUMA effects storage latency queue depth issues network bottlenecks virtualization overhead cloud inefficiencies database scaling challenges and workload contention patterns The series will also discuss methods for properly monitoring, modeling, tuning, and sizing these environments. Because the scope of the subject is so large, future entries will likely be broken into multiple specialized blogs by technology area. Some topics may themselves require multi-part deep dives. About the Author I started my career in technology in 1978 working on a Basic Four-computer system during the early years of enterprise computing. Over the decades, I have worked across operations, engineering, architecture, product management, database performance tuning, and large-scale infrastructure analysis. I have architected sizing and performance analysis tools for technology vendors, worked internationally on database and infrastructure performance engagements, and spent much of my career focused on understanding how systems behave under real-world workloads. My background includes extensive work with Oracle technologies, enterprise performance tuning, workload analysis, and capacity planning across multiple industries and platforms. Today, I am employed at Everpure as a Field Solution Architect specializing in Oracle technologies and performance engineering. Having worked through the mainframe era, distributed systems revolution, virtualization, cloud computing, and now the rise of AI infrastructure, I believe the industry is once again approaching a point where operational discipline, efficiency, and proper sizing will become critical engineering skills. This series is both a technical discussion and a historical perspective from someone who has watched these cycles evolve over nearly five decades. The Lost Art of Sizing Part I — The Scarcity Era In the late 1970s, I started my career in technology. My first roles were in operations, running jobs on mainframes overnight and performing backups. Over time, I moved throughout the IT organization before eventually transitioning into engineering and product management in the late 1980s. I often refer to the 1970s and early 1980s as The Scarcity Era of computing. During that time, computing resources were extraordinarily expensive: Storage could cost the equivalent of hundreds of thousands of dollars per gigabyte Memory was frequently measured in tens or hundreds of thousands of dollars per megabyte CPU performance was discussed in terms of MIPS (Millions of Instructions Per Second), with systems delivering only a handful of MIPS costing millions of dollars Every component in the system represented a major financial investment. Because resources were scarce and expensive, sizing was treated almost as a science. Capacity planning was not optional — it was foundational to the survival of the business. Over-sizing a system could waste enormous capital. Under-sizing it could bring critical business operations to a halt. Every byte mattered. Every CPU cycle mattered. Every disk spindle mattered. This environment created a culture of discipline: Applications were optimized aggressively Developers understood resource constraints Operations teams monitored utilization closely Architects carefully modeled workloads Performance engineering was considered a core technical skill In many organizations, some of the best engineers were the people who could make systems smaller, faster, and more efficient. Software engineering was deeply connected to hardware realities. You could not simply “add more servers.” There often were no additional servers to add. This scarcity shaped an entire generation of technologists. Part II — The Abundance Era Then something extraordinary happened. Beginning in the late 1980s and accelerating through the 1990s and 2000s, the economics of computing changed completely. Moore’s Law, semiconductor scaling, manufacturing efficiencies, and global supply chains created an era of unprecedented abundance. For nearly forty years: CPUs became exponentially faster Memory became dramatically cheaper Storage costs collapsed Networks became faster Virtualization increased utilization Cloud computing made infrastructure appear almost limitless For the first time in computing history, performance improvements arrived faster than software inefficiencies could consume them. This fundamentally changed engineering culture. Disciplines that had once been mandatory slowly became optional. Applications no longer had to be highly optimized because hardware improvements continuously masked inefficiencies. Instead of tuning software, organizations increasingly solved problems by purchasing more infrastructure. A new mindset emerged: Hardware is cheaper than engineering time. And for many years, that was largely true. The rise of virtualization and cloud computing accelerated this transition even further. Infrastructure became abstracted from the engineers writing the software. Developers no longer saw physical systems, disk arrays, or memory limitations. Resources became API calls and provisioning scripts. Eventually, many organizations evolved toward a model where applications were simply “thrown over the wall” into the cloud. If performance was poor: allocate more CPUs add more memory scale horizontally increase cloud spending The business unit would absorb the cost. The direct connection between engineering decisions and infrastructure economics became increasingly invisible. In many environments: poor code was tolerated inefficient queries were normalized oversized containers became standard massive memory consumption was accepted idle cloud resources accumulated unchecked Traditional sizing disciplines faded because the financial pain was no longer immediate or visible to the engineering teams creating the workloads. The cloud did not eliminate capacity planning — it merely changed who paid for bad sizing decisions. In the mainframe era, poor sizing decisions were catastrophic because hardware was scarce. In the cloud era, poor sizing decisions became operational expenditures hidden inside monthly invoices. The result was a generation of systems that often consumed vastly more resources than their actual business function required. Ironically, many of the operational disciplines developed during the Scarcity Era were not technically obsolete — they had simply become economically unnecessary for a time. But that may now be changing again. Part III — The Return of Constraints For nearly four decades, the technology industry operated under a powerful assumption: Tomorrow’s hardware would solve today’s software problems. For a long time, that assumption held true. If an application consumed too much CPU: processors became faster If memory usage grew: RAM became cheaper If storage exploded: disk costs continued collapsing If workloads increased: cloud platforms scaled almost infinitely The economics of computing continuously compensated for inefficient engineering. But today, something significant is changing. The industry is beginning to encounter limits again. Not theoretical limits — real economic, physical, and operational limits. Modern computing infrastructure is no longer getting dramatically cheaper at the rate it once did. Instead, we are seeing: exploding GPU costs rising power consumption thermal limitations expensive high-bandwidth memory enormous cloud infrastructure bills increasingly expensive semiconductor fabrication AI workloads consuming unprecedented resources For the first time in decades, inefficient software design is becoming economically visible again. And this has exposed a reality that many organizations had quietly ignored for years: poor code oversized architectures inefficient databases excessive abstraction layers uncontrolled cloud sprawl wasteful microservice designs badly tuned queries overallocated Kubernetes clusters massive idle infrastructure footprints For years, these inefficiencies were masked by cheap hardware and elastic cloud scaling. Now they are appearing directly on financial statements. The cloud did not eliminate waste. It made waste easier to hide. Until the bills became too large to ignore. At the same time, another challenge has emerged. Many of the people who developed the operational disciplines of the Scarcity Era are no longer in the industry. They have: retired moved into leadership transitioned into consulting or left technology entirely The generation that deeply understood: workload modeling performance engineering memory optimization queue management efficient batch processing storage layout capacity forecasting low-level tuning is steadily disappearing. Much of that knowledge was never fully documented because it was simply considered part of being an experienced engineer. As a result, many younger organizations grew up in an environment where: infrastructure felt unlimited optimization seemed unnecessary cloud scaling replaced careful design operational cost was someone else’s problem Now the industry faces a difficult transition. The old constraints are returning, but many of the disciplines required to manage those constraints have faded. In many ways, the industry is rediscovering something that earlier generations of technologists already understood: Resources are never truly infinite. Eventually: power matters memory matters storage matters latency matters thermal density matters architecture matters And ultimately: sizing matters. The art of sizing has returned. Not because technology stopped advancing, but because economics, physics, and scale have once again forced the industry to confront efficiency. What was once viewed as an outdated operational skill may soon become one of the most important engineering disciplines again. Part IV — History Does Not Repeat, But It Rhymes What we are seeing today in technology is historically unusual — but it is not entirely unprecedented. Other industries have gone through similar transitions where periods of explosive advancement, falling costs, and seemingly limitless growth eventually collided with economic and physical realities. The railroad industry is one example. In the early days of rail expansion during the Industrial Age, railroads transformed economies. Expansion happened rapidly. Costs initially fell as infrastructure scaled, routes expanded, and technology improved. For a time, railroads represented nearly unlimited economic optimism. But eventually the easy growth ended. The cost of expanding and maintaining rail infrastructure began rising dramatically. Marginal improvements became more expensive. Complexity increased. Maintenance became a larger percentage of operating cost. Competition intensified. Returns diminished. The industry did not disappear. In fact, railroads remained enormously valuable to the economy. But the economics changed. The same pattern appeared in other industrial and technological revolutions: aviation after the jet age nuclear power generation telecommunications infrastructure automobile manufacturing even electrical grid expansion Early stages were driven by rapid gains and falling relative costs. Later stages became dominated by: scale complexity infrastructure costs power requirements operational efficiency regulation and diminishing economic returns on incremental improvements Technology did not stop advancing. It simply became harder, more expensive, and more complex to continue advancing at the same pace. That is increasingly where modern computing appears to be heading. We are now entering the Age of AI. AI will absolutely create enormous value. In many ways, it already has. But there is growing evidence that the economics of this era are going to be very different from the cloud and consumer internet revolutions that preceded it. AI infrastructure is extraordinarily expensive: massive GPU clusters enormous power consumption advanced cooling systems high-bandwidth memory increasingly expensive semiconductor fabrication global supply chain dependencies For years, the technology industry operated almost like a perpetual motion machine where computing became continuously cheaper while performance improved exponentially. Today, the relationship between cost and performance is changing. That does not mean AI is a failure. Far from it. But technological revolutions are not light switches. They are transitions. And transitions are messy. Industries often overspend before they stabilize. Architectures evolve through trial and error. Infrastructure expands ahead of efficient utilization. Economic models mature slowly. The railroad era experienced this. The electrical age experienced this. The internet boom experienced this. And now AI appears to be entering a similar phase. The challenge for the next generation of technologists will not simply be building larger systems. It will be learning how to build efficient, economically sustainable systems again. Which may ultimately bring the industry back to a lesson many believed had become obsolete: The art of sizing never really disappeared. It was merely waiting for constraints to return.28Views0likes0CommentsPlanning SQL Server Storage Layout for Snapshot Recovery
Two of the most important things you need to consider when thinking about snapshots are your snapshot recovery goals, and your database instance deployment model. To take full advantage of volume snapshots, your SQL Server environment should be planned with snapshot usage in mind. Your instance deployment model (physical vs. virtual), storage presentation (vVols, VMFS, iSCSI, etc.), and snapshot recovery scope all have a direct impact on storage and database layout. Making these decisions up front helps you ensure that snapshot operations align with recovery objectives, avoid unintended side effects, and remain manageable over time. In this post we'll walk through the different recovery goals folks might have, how those goals are impacted by technology choices, and what changes you might need to make to reach your goals. If you are introducing snapshots into an existing environment things can be a little less flexible, but this post can help you better understand the challenges you might run into. Snapshot recovery scope Below is a summary of some possible recovery goals along with the impact your technology choices can have as well as the impact on how you plan database storage layout. Instance-level recovery is easiest to implement but most coarse-grained; single-database recovery is the most flexible but requires the most careful volume design and operational discipline. Note: tempdb should NOT be included in volume snapshots. It is recreated automatically on startup, and not meant for recovery. The following figure summarizes the possible recovery scopes and how database volumes can be organized for each. Instance-level recovery With instance-level recovery you are looking to recover an entire SQL Server instance at a point in time (all user databases that live on the snapped volumes). This could be part of a data protection plan for an instance that hosts a single application or part of a workflow to snapshot a production system for use in a dev/test workflows. This could even be a temporary workflow for migrating an existing server or adding an HA/DR replica. Special considerations System databases (master, msdb, model) can be either included with the snapped volumes for true instance-level rollback, or kept separate and protected via regular backups, depending on your recovery strategy. If you plan to use instance-level snapshots to create an HA/DR replica, it would be best to leave system databases out of your snapshots. Potential layouts Physical / In-guest / vVols: Shared volumes for all user database data and logs. Best practices and performance will likely mean there are separate log vs. data volumes, and often multiple data volumes. These will need to be a part of the same Everpure volume protection group. Shared-datastore virtualized (VMFS, CSV, AHV): 1 datastore per SQL Server instance, with many VMDKs/VHDXs Impact on volume layout/recovery Very simple to implement and operate. All databases on those volumes/datastores share the same snapshot schedule and rollback behavior. A single database cannot be safely recovered without affecting the others on the same volumes. Application / DB-group recovery In some scenarios you might have groups of related databases that need to be recovered together, but the whole instance should not be recovered as a unit. Maybe the instance hosts different applications with different SLAs, or you just need the flexibility to recover applications separately. Whatever the reason, in this situation you need to keep groups of databases in sync and recoverable to the same point in time. Potential layouts Physical / In-guest / vVols: Application data and log volumes will need to be a part of the same application-specific protection group. For a given instance you will end up with multiple protection groups, one per unit of recovery needed. Shared-datastore virtualized (VMFS, CSV, AHV): VMDKs/VHDXs for each app grouped together on a datastore; other workloads use different datastores. Everpure volumes are created at the datastore level, so applications must also be separated at the datastore level. Impact on volume layout/recovery Databases related to a specific application need to be kept together on the same set of volumes; unrelated databases need to be kept separate Protection groups should be defined so that all volumes that contain files for that app’s databases are snapped together Single-database recovery In cases where you are managing single instances with many databases you often need to recover at the database level. Common situations where this type of recovery is desirable are multi-tenant systems where each customer or user has a dedicated database, highly consolidated environments where large numbers of unrelated databases are housed on the same instance. These cases could both come up in production, but are also very common in dev/test environments. Special considerations For single-database recovery it is possible to run into limits along your storage stack depending on how many databases you have and how they are distributed around your environment. When going down this route it's important to understand limitations around: Volume count (and drive letters) supported by Windows Volume and protection group counts supported by Purity Volume limits per host supported by Purity Potential layouts Physical / In-guest / vVols: Per-database volumes for data and log (or at least for each high-value database), grouped into per-database protection groups. Shared-datastore virtualized (VMFS, CSV, AHV): Shared-datastores are not ideal for this recovery goal as the whole datastore has to be recovered at once. Per-database datastores could work, but would add management complexity. Grouping databases by aggregate size or throughput characteristics could reduce this complexity, but will still present challenges for recovery. Impact on volume layout/recovery All files for the given database must live on volumes that are included in the same protection group. This gives the most recovery flexibility, but increases the number of volumes, mount points, and protection groups (and potentially datastores) to manage. In shared-datastore models, single-DB recovery typically requires cloning the datastore volume and extracting only the virtual disks for that database using vendor-specific tooling; this is significantly more complex than in-guest or vVol layouts. Overall Each recovery goal has its own challenges and trade-offs, but they all share a few core requirements. Keep the recovery unit together: All data and log volumes for a given recovery scope (instance, application, or single database) should reside in the same protection group. This ensures that snapshots capture a consistent point in time and that you can safely roll forward or roll back without orphaning files. Be intentional about what you exclude Since tempdb holds transient data, is recreated on startup, and cannot be used in application consistent snapshots, it is typically placed on its own volume(s) outside of snapshot protection groups. Also because of it's high change rate, a snapshot including tempdb can quickly consume capacity. System databases (master, msdb, model) are usually protected with traditional SQL backups and kept separate from user database protection groups, unless you have very specific reasons to include them. Plan for growth and change: Whatever layout you choose, it has to survive new databases, additional volumes, and changing workloads. Making sure new volumes are consistently added to the correct protection group (manually, via automation, or through Everpure Fusion presets) is key to continuing to meet your recovery goals over time. Read more about Fusion presets on the Everpure support portal. With proper planning, volume snapshots can be a powerful new tool in your toolbox. They can simplify day-to-day operations, make complex recovery scenarios more predictable, and unlock new possibilities for dev/test, reporting, and migration workflows without consuming a lot of time or additional storage.146Views0likes0CommentsAsk Us Everything: Everpure & Databases - From Firefighting to Forward Thinking
Databases aren’t going anywhere—in fact, they’re becoming more important than ever. In this Ask Us Everything session, Don Poorman sat down with Everpure database experts Anthony Nocentino and Ryan Arsenault to talk all things structured data. And while AI continues to dominate headlines, one theme came through clearly: AI doesn’t replace databases—it depends on them. If you’re running Oracle, SQL Server, SAP, or anything mission-critical, here’s what stood out.74Views2likes0CommentsBoosting SQL Server Backup/Restore Performance: Threads and Parallelism
In this post, we’ll discuss day 1 tuning you can do on your database hosts to take full advantage of your new high-performance backup storage. We’ll go over a few tricks around database layout and backup configuration for maximum throughput, discuss some quirks with SMB, and finally discuss using S3 effectively.124Views1like0CommentsTips for High Availability SQL Server Environments with ActiveCluster
Tip 1: Use Synchronous Replication for Zero RPO/RTO Why it matters: ActiveCluster mirrors every write across two FlashArrays before acknowledging the operation to the host. This ensures zero Recovery Point Objective (RPO) and zero Recovery Time Objective (RTO), which are critical for maintaining business continuity in SQL Server environments. Best Practice: Keep inter-site latency below 5 ms for optimal performance. While the system tolerates up to 11 ms, staying under 5 ms minimizes write latencies and transactional slowdowns. Tip 2: Group Related Volumes with Stretched Pods Why it matters: Stretched pods ensure all volumes within them are synchronously replicated as a unit, maintaining data consistency and simplifying management. This is crucial for SQL Server deployments where data, log, and tempdb volumes need to failover together. Best practice: Place all volumes related to a single SQL Server instance into the same pod. Use separate pods only for unrelated SQL Server instances or non-database workloads that have different replication, performance, or management requirements. Tip 3: Use Uniform Host Access with SCSI ALUA Optimization Why it matters: Uniform host access allows each SQL Server node to see both arrays. Combined with SCSI ALUA (Asymmetric Logical Unit Access), this setup enables the host to prefer the local array, improving latency while maintaining redundancy. Best practice: Use the Preferred Array setting in FlashArray for each host to route I/O to the closest array. This avoids redundant round-trips across WAN links, especially in multi-site or metro-cluster topologies. Install the correct MPIO drivers, validate paths, and use load-balancing policies like Round Robin or Least Queue Depth. Tip 4: Test Failover with a regular cadence Why it matters: ActiveCluster is designed for transparent failover, but you shouldn’t assume it just works. Testing failover with a regular schedule validates the full stack, from storage to SQL Server clustering and exposes misconfigurations before they cause downtime. Best practice: Simulate array failure by disconnecting one side and verifying that SQL Server remains online via the surviving array. Monitor replication and quorum health using Pure1, and ensure Windows Server Failover Clustering (WSFC) responds correctly. Tip 5: Use ActiveCluster for Seamless Storage Migration Why it matters: Storage migrations are inevitable for lifecycle refreshes, performance upgrades, or datacenter moves. ActiveCluster lets you replicate and migrate SQL Server databases with zero downtime. Best practice: Follow a 6-step phased migration: 1. Assess and plan 2. Set up environment 3. Configure ActiveCluster 4. Test replication and failover 5. Migrate by removing paths from source array 6. Validate with DBCC CHECKDB and application testing This ensures a smooth handover with no data loss or service interruption. Tip 6: Align with VMware for Virtualized SQL Server Deployments Why it matters: Many SQL Server instances run on VMware. Using ActiveCluster with vSphere VMFS or vVols brings granular control, high availability, and site-aware storage policies. Best practice: Deploy SQL Server on vVols for tighter storage integration, or use VMFS when simplicity is preferred. Stretch datastores across sites with ActiveCluster for seamless VM failover and workload mobility. Tip 7: Avoid Unsupported Topologies Why it matters: ActiveCluster is designed for two-site, synchronous setups. Misusing it across unsupported configurations like hybrid cloud sync or mixing non-uniform host access with SQL Server FCI can break failover logic and introduce data risks. Best practice: Do not use ActiveCluster between cloud and on-prem FlashArrays. Avoid non-uniform host access in SQL Server Failover Cluster Instances. Failover will not be coordinated. Instead, use ActiveDR™ or asynchronous replication for cloud or multi-site DR scenarios. Next Steps Pure Storage ActiveCluster simplifies high availability for SQL Server without extra licensing or complex configuration. If you want to go deeper, check out this whitepaper on FlashArray ActiveCluster for more details.175Views1like0Comments