Stop Blaming Storage: The Invisible Cost of Excessive Log Switches In Oracle Databases
Real-World Telemetry Analysis: Test 1 vs. Test 2 To understand how severe write volumes impact database latency, let us evaluate two distinct test profiles running the exact same heavy transactional workload. These profiles highlight the staggering volume of log writer activity occurring under typical enterprise applications: Database Profile (Test 1): Sustaining an intensive write rate of 35,550,156.8 bytes per second (~33.90 MB/sec) of redo generation. Database Profile (Test 2): Sustaining an even higher write rate of 40,691,343.8 bytes per second (~38.81 MB/sec) of redo generation. A consistent generation rate of 34 MB/s to 39 MB/s is classified as a highly active, heavy write workload. If the underlying layout of the database's log files is structured using default or undersized parameters, this heavy transactional density forces a systemic collision point between logical software processing and physical disk checkpointing. Reverse-Engineering Your Log Sizes from Switch Activity Because physical redo log dimensions are structural layouts rather than configuration variables, they are not listed inside the Modified Parameters section of standard database diagnostic summaries (such as AWR reports). Instead, engineers must combine the sustained redo byte velocity with recorded switch intervals to uncover the current physical geometry using this model: S Log = (R sec × 3600) / N switch Where S Log represents the calculated current log size, R sec represents the redo byte velocity per second, and N switch represents the total number of log switches executed per hour. Modeled Redo Layout Dimensions Based on Active Workloads Log Switches Observed / Hour Test 1 Profile (33.90 MB/sec) Test 2 Profile (38.81 MB/sec) Engine State & Systemic Latency Impacts 30 Switches / Hour (Every 2 minutes) ~4,068 MB (4 GB) ~4,657 MB (4.5 GB) Continuous, aggressive database checkpointing. Disk queues are consistently saturated writing dirty blocks to datafiles. 60 Switches / Hour (Every 1 minute) ~2,034 MB (2 GB) ~2,328 MB (2.3 GB) Severe operational throttling. High threat of transaction processing freezes while the engine waits for space. 120 Switches / Hour (Every 30 seconds) ~1,017 MB (1 GB) ~1,164 MB (1.1 GB) Critical architectural failure point. Heavy occurrence of log file switch completion wait states. The Mechanics of a Log Switch Bottleneck Why does a high log switch count destroy performance? It is crucial to understand what the Oracle database engine is forced to do behind the scenes every single time a log group fills up: Forced Incremental Checkpointing: When a log switches, the database must advance its checkpoint. This forces the Database Writer processes (DBWn) to aggressively flush dirty data blocks from memory (the Buffer Cache) out to the permanent datafiles on disk to ensure crash-recovery safety. Control File Serialization: The database must update its control files to record the new log sequence architecture. This introduces internal metadata synchronization locks (enqueues) that can cause user sessions to stall. Archiver Contention: The Archiver background processes (ARCn) must instantly awake and begin reading the newly filled redo log to copy it to the archive destination. If the logs are small and switching every few seconds, the archivers cannot keep pace, completely locking the log writer (LGWR) out of the next group in the rotation. The accumulation of these three internal operations manifests directly as elevated log file sync and foreground wait latencies. To an outside observer, it looks like the storage array is failing to write fast enough, but in reality, the database engine is choking on its own structural layout. Sizing for the 20-Minute Target Window To neutralize this threat, we apply standard best-practice mathematics to size the log allocations cleanly for a conservative, stable 20-minute operational window under the observed workloads: Mathematical Formulation: Test 1 Architecture Sizing: 33.90 MB/sec × 60 seconds = 2,034 MB/minute. For a 20-minute window: 2,034 MB × 20 minutes = 40,680 MB (~40 GB per log group). Test 2 Architecture Sizing: 38.81 MB/sec × 60 seconds = 2,328.6 MB/minute. For a 20-minute window: 2,328.6 MB × 20 minutes = 46,572 MB (~46 GB per log group). Sizing Standard: To provide a safe, cushioned operational margin during unpredicted transaction spikes, configuring an allocation of 40 GB to 48 GB per log group across a minimum of 4 to 5 log groups will completely iron out the checkpointing waves and restore a smooth, predictable processing flow. DBA Command and Verification Track To audit your live database environment immediately, run the following administrative query to verify your current log configuration and status: SELECT GROUP#, THREAD#, SEQUENCE#, BYTES/1024/1024/1024 AS SIZE_GB, STATUS FROM V$LOG; If this output returns sizes sitting at outdated, legacy defaults (such as 1 GB or 2 GB) while under modern, high-velocity workloads, you have found your hidden bottleneck. Correcting the redo allocation path will immediately relieve the artificial pressure on your data layer. Quantifiable Database Performance Savings The most profound impact of implementing best-practice redo log sizing is the immediate reclamation of database processing capacity. Reclamation of Core Processing Time: Production environments can anticipate an immediate 15% to 20% savings in overall database processing time, particularly on nodes operating under synchronous replication frameworks. Elimination of Forced Wait States: Diagnostic telemetry shows the database spends up to 20.65% of its total operational life completely frozen within log file sync events. While a portion of this is network transit overhead, a significant contributor is the engine constantly stalling to handle back-to-back log switches occurring multiple times per minute. CPU Cycle Optimization: Transitioning to a stabilized footprint of 2 to 3 log switches per hour removes self-inflicted logical barriers, dropping the active wait-state percentages down and immediately returning vital CPU cycles back to active user transactions and application processing. Targeted Systems and Subsystem Benefits Correcting the redo allocation geometry triggers a positive cascade of efficiency across multiple independent layers of the database infrastructure ecosystem: A. Storage I/O Optimization (Flattening the Checkpoint Waves) Every time an individual redo log file reaches capacity and triggers a switch, Oracle mandates an aggressive incremental checkpoint. The Database Writer background processes (DBWn) are forced to violently halt standard operation to clear, prioritize, and flush "dirty" data blocks from the volatile Buffer Cache down to the permanent physical storage datafiles. The Strategic Benefit: Instead of a chaotic, cyclic pattern where disk I/O heavily spikes and crashes every 30 to 60 seconds, the underlying storage fabric encounters a flattened, smooth, and highly predictable write curve. Physical disk queue depths drop significantly, completely removing artificial array-level performance chokes. B. Elimination of Control File Enqueue Serialization To cleanly finalize a log switch, the database engine must gain exclusive metadata locks to write updated sequence architectures directly into the database control files. When a misconfigured environment forces this action hundreds of times an hour, user sessions become trapped in an internal serialization traffic jam. The Strategic Benefit: Scaling the logs ensures that control file metadata modification occurs only a few times per hour. This completely erases internal enqueue contention and prevents micro-stalls from propagating to foreground user processes. C. Mitigation of Archiver Process (ARCn) Contention Under high-velocity write workloads (~34 MB/s to 39 MB/s), undersized logs fill up substantially faster than the Archiver background processes (ARCn) can read and copy them to designated archive log destinations. If the archivers fall behind the pace of the log writer, the Log Writer (LGWR) will freeze all database processing because it is structurally prohibited from overwriting an unarchived log group. The Strategic Benefit: Deploying 40 GB to 48 GB log groups builds a wide, stable, 20-minute processing window. This provides the ARCn processes ample buffer space to quietly copy data streams in the background without ever creating a risk of blocking active application transactions. D. Stabilization of Application Response Uniformity From an end-user and application integration perspective, transaction latency becomes completely uniform and highly predictable. The Strategic Benefit: Currently, a user session may encounter an instantaneous transaction response, followed a moment later by a multi-second delay simply because their specific COMMIT command executed simultaneously with a log switch checkpoint. Eliminating constant switches ensures uniform, predictable, and sub-second transaction commit processing across the entire user base. Conclusion and Core Directive Undersized redo logs force high-performance solid-state storage arrays to absorb massive amounts of unnecessary operational punishment by demanding that files be opened, written, closed, checkpointed, and archived hundreds of times per hour. Increasing the log file size to align with a 20-minute target window does not merely alter a structural capacity metric; it fundamentally upgrades the internal execution efficiency of the core Oracle database engine. It systematically clears the log file sync bottleneck, cools down spiking CPU usage, and allows your enterprise data infrastructure to operate at its true peak potential.3Views0likes0CommentsGuarantee Data Availability: How to Create a Snapshot Bunker
June 23 | Register Now! Without data integrity and availability, recovery is difficult and may take days or even weeks. Everpure prioritizes remediation and recovery as the critical path to cyber resilience. Data availability is the foundation for reliable remediation and rapid recovery, enabling organizations to restore data regardless of the severity of a disaster or cyberattack. Why a snapshot bunker is the cornerstone of effective layered resilience Key considerations for architecting the bunker to ensure survivability How to set up and operate a snapshot bunker Register Now!161Views0likes0CommentsThe Lost Art of Sizing
Introduction — Why This Series Exists Technology has gone through one of the most extraordinary economic transformations in modern history. For over four decades, the industry benefited from continuously cheaper computing resources, exponentially faster processors, collapsing storage costs, and an almost limitless ability to scale systems through virtualization and cloud computing. During that time, many of the operational disciplines that once defined great engineering slowly faded into the background. Precise sizing, deep performance analysis, workload modeling, and resource optimization became less visible as organizations increasingly relied on abundant infrastructure to compensate for inefficiencies. But the economics are changing. Today we are entering an era defined by: exploding GPU costs massive AI infrastructure investments rising power consumption thermal and density limitations increasingly expensive semiconductor fabrication and cloud bills that are exposing years of architectural inefficiency As these pressures grow, the industry is rediscovering something earlier generations of technologists already understood: Efficiency matters. And ultimately: Sizing matters. This blog series is intended to explore both the history and the future of performance engineering, capacity planning, and system sizing. The first blog — this one — focuses on how the industry arrived where it is today: the Scarcity Era of computing the transition into abundance the rise of cloud abstraction and the re-emergence of constraints in the modern AI era Future blogs will move from theory and history into practical engineering. They will examine modern system architectures and explore the many bottlenecks that organizations often overlook, including: CPU saturation memory pressure NUMA effects storage latency queue depth issues network bottlenecks virtualization overhead cloud inefficiencies database scaling challenges and workload contention patterns The series will also discuss methods for properly monitoring, modeling, tuning, and sizing these environments. Because the scope of the subject is so large, future entries will likely be broken into multiple specialized blogs by technology area. Some topics may themselves require multi-part deep dives. About the Author I started my career in technology in 1978 working on a Basic Four-computer system during the early years of enterprise computing. Over the decades, I have worked across operations, engineering, architecture, product management, database performance tuning, and large-scale infrastructure analysis. I have architected sizing and performance analysis tools for technology vendors, worked internationally on database and infrastructure performance engagements, and spent much of my career focused on understanding how systems behave under real-world workloads. My background includes extensive work with Oracle technologies, enterprise performance tuning, workload analysis, and capacity planning across multiple industries and platforms. Today, I am employed at Everpure as a Field Solution Architect specializing in Oracle technologies and performance engineering. Having worked through the mainframe era, distributed systems revolution, virtualization, cloud computing, and now the rise of AI infrastructure, I believe the industry is once again approaching a point where operational discipline, efficiency, and proper sizing will become critical engineering skills. This series is both a technical discussion and a historical perspective from someone who has watched these cycles evolve over nearly five decades. The Lost Art of Sizing Part I — The Scarcity Era In the late 1970s, I started my career in technology. My first roles were in operations, running jobs on mainframes overnight and performing backups. Over time, I moved throughout the IT organization before eventually transitioning into engineering and product management in the late 1980s. I often refer to the 1970s and early 1980s as The Scarcity Era of computing. During that time, computing resources were extraordinarily expensive: Storage could cost the equivalent of hundreds of thousands of dollars per gigabyte Memory was frequently measured in tens or hundreds of thousands of dollars per megabyte CPU performance was discussed in terms of MIPS (Millions of Instructions Per Second), with systems delivering only a handful of MIPS costing millions of dollars Every component in the system represented a major financial investment. Because resources were scarce and expensive, sizing was treated almost as a science. Capacity planning was not optional — it was foundational to the survival of the business. Over-sizing a system could waste enormous capital. Under-sizing it could bring critical business operations to a halt. Every byte mattered. Every CPU cycle mattered. Every disk spindle mattered. This environment created a culture of discipline: Applications were optimized aggressively Developers understood resource constraints Operations teams monitored utilization closely Architects carefully modeled workloads Performance engineering was considered a core technical skill In many organizations, some of the best engineers were the people who could make systems smaller, faster, and more efficient. Software engineering was deeply connected to hardware realities. You could not simply “add more servers.” There often were no additional servers to add. This scarcity shaped an entire generation of technologists. Part II — The Abundance Era Then something extraordinary happened. Beginning in the late 1980s and accelerating through the 1990s and 2000s, the economics of computing changed completely. Moore’s Law, semiconductor scaling, manufacturing efficiencies, and global supply chains created an era of unprecedented abundance. For nearly forty years: CPUs became exponentially faster Memory became dramatically cheaper Storage costs collapsed Networks became faster Virtualization increased utilization Cloud computing made infrastructure appear almost limitless For the first time in computing history, performance improvements arrived faster than software inefficiencies could consume them. This fundamentally changed engineering culture. Disciplines that had once been mandatory slowly became optional. Applications no longer had to be highly optimized because hardware improvements continuously masked inefficiencies. Instead of tuning software, organizations increasingly solved problems by purchasing more infrastructure. A new mindset emerged: Hardware is cheaper than engineering time. And for many years, that was largely true. The rise of virtualization and cloud computing accelerated this transition even further. Infrastructure became abstracted from the engineers writing the software. Developers no longer saw physical systems, disk arrays, or memory limitations. Resources became API calls and provisioning scripts. Eventually, many organizations evolved toward a model where applications were simply “thrown over the wall” into the cloud. If performance was poor: allocate more CPUs add more memory scale horizontally increase cloud spending The business unit would absorb the cost. The direct connection between engineering decisions and infrastructure economics became increasingly invisible. In many environments: poor code was tolerated inefficient queries were normalized oversized containers became standard massive memory consumption was accepted idle cloud resources accumulated unchecked Traditional sizing disciplines faded because the financial pain was no longer immediate or visible to the engineering teams creating the workloads. The cloud did not eliminate capacity planning — it merely changed who paid for bad sizing decisions. In the mainframe era, poor sizing decisions were catastrophic because hardware was scarce. In the cloud era, poor sizing decisions became operational expenditures hidden inside monthly invoices. The result was a generation of systems that often consumed vastly more resources than their actual business function required. Ironically, many of the operational disciplines developed during the Scarcity Era were not technically obsolete — they had simply become economically unnecessary for a time. But that may now be changing again. Part III — The Return of Constraints For nearly four decades, the technology industry operated under a powerful assumption: Tomorrow’s hardware would solve today’s software problems. For a long time, that assumption held true. If an application consumed too much CPU: processors became faster If memory usage grew: RAM became cheaper If storage exploded: disk costs continued collapsing If workloads increased: cloud platforms scaled almost infinitely The economics of computing continuously compensated for inefficient engineering. But today, something significant is changing. The industry is beginning to encounter limits again. Not theoretical limits — real economic, physical, and operational limits. Modern computing infrastructure is no longer getting dramatically cheaper at the rate it once did. Instead, we are seeing: exploding GPU costs rising power consumption thermal limitations expensive high-bandwidth memory enormous cloud infrastructure bills increasingly expensive semiconductor fabrication AI workloads consuming unprecedented resources For the first time in decades, inefficient software design is becoming economically visible again. And this has exposed a reality that many organizations had quietly ignored for years: poor code oversized architectures inefficient databases excessive abstraction layers uncontrolled cloud sprawl wasteful microservice designs badly tuned queries overallocated Kubernetes clusters massive idle infrastructure footprints For years, these inefficiencies were masked by cheap hardware and elastic cloud scaling. Now they are appearing directly on financial statements. The cloud did not eliminate waste. It made waste easier to hide. Until the bills became too large to ignore. At the same time, another challenge has emerged. Many of the people who developed the operational disciplines of the Scarcity Era are no longer in the industry. They have: retired moved into leadership transitioned into consulting or left technology entirely The generation that deeply understood: workload modeling performance engineering memory optimization queue management efficient batch processing storage layout capacity forecasting low-level tuning is steadily disappearing. Much of that knowledge was never fully documented because it was simply considered part of being an experienced engineer. As a result, many younger organizations grew up in an environment where: infrastructure felt unlimited optimization seemed unnecessary cloud scaling replaced careful design operational cost was someone else’s problem Now the industry faces a difficult transition. The old constraints are returning, but many of the disciplines required to manage those constraints have faded. In many ways, the industry is rediscovering something that earlier generations of technologists already understood: Resources are never truly infinite. Eventually: power matters memory matters storage matters latency matters thermal density matters architecture matters And ultimately: sizing matters. The art of sizing has returned. Not because technology stopped advancing, but because economics, physics, and scale have once again forced the industry to confront efficiency. What was once viewed as an outdated operational skill may soon become one of the most important engineering disciplines again. Part IV — History Does Not Repeat, But It Rhymes What we are seeing today in technology is historically unusual — but it is not entirely unprecedented. Other industries have gone through similar transitions where periods of explosive advancement, falling costs, and seemingly limitless growth eventually collided with economic and physical realities. The railroad industry is one example. In the early days of rail expansion during the Industrial Age, railroads transformed economies. Expansion happened rapidly. Costs initially fell as infrastructure scaled, routes expanded, and technology improved. For a time, railroads represented nearly unlimited economic optimism. But eventually the easy growth ended. The cost of expanding and maintaining rail infrastructure began rising dramatically. Marginal improvements became more expensive. Complexity increased. Maintenance became a larger percentage of operating cost. Competition intensified. Returns diminished. The industry did not disappear. In fact, railroads remained enormously valuable to the economy. But the economics changed. The same pattern appeared in other industrial and technological revolutions: aviation after the jet age nuclear power generation telecommunications infrastructure automobile manufacturing even electrical grid expansion Early stages were driven by rapid gains and falling relative costs. Later stages became dominated by: scale complexity infrastructure costs power requirements operational efficiency regulation and diminishing economic returns on incremental improvements Technology did not stop advancing. It simply became harder, more expensive, and more complex to continue advancing at the same pace. That is increasingly where modern computing appears to be heading. We are now entering the Age of AI. AI will absolutely create enormous value. In many ways, it already has. But there is growing evidence that the economics of this era are going to be very different from the cloud and consumer internet revolutions that preceded it. AI infrastructure is extraordinarily expensive: massive GPU clusters enormous power consumption advanced cooling systems high-bandwidth memory increasingly expensive semiconductor fabrication global supply chain dependencies For years, the technology industry operated almost like a perpetual motion machine where computing became continuously cheaper while performance improved exponentially. Today, the relationship between cost and performance is changing. That does not mean AI is a failure. Far from it. But technological revolutions are not light switches. They are transitions. And transitions are messy. Industries often overspend before they stabilize. Architectures evolve through trial and error. Infrastructure expands ahead of efficient utilization. Economic models mature slowly. The railroad era experienced this. The electrical age experienced this. The internet boom experienced this. And now AI appears to be entering a similar phase. The challenge for the next generation of technologists will not simply be building larger systems. It will be learning how to build efficient, economically sustainable systems again. Which may ultimately bring the industry back to a lesson many believed had become obsolete: The art of sizing never really disappeared. It was merely waiting for constraints to return.28Views0likes0CommentsSee It Before It Happens: Predict and Fix Infra Issues Early
June 25 | Register Now! Database and infrastructure teams often struggle to predict how changes will affect application real-world performance, capacity, and service levels. Without clear visibility into where applications are running and how they consume resources, even small adjustments can introduce unexpected latency, IOPS bottlenecks, or capacity shortfalls, making troubleshooting slower and increasing the risk of missing SLAs. Join us and learn how to: Use scaling simulations to predict impacts on latency, IOPS, and capacity before changes are made. Apply data intelligence to identify where applications are running and how they consume resources. Improve performance planning and reduce troubleshooting time while maintaining stronger adherence to SLAs. Register Now!146Views0likes0CommentsSecurity Is Not a Feature — It's the Foundation
Let's get something out of the way upfront: this is not a ransomware horror story. This is not a "cyber resilience framework" deep-dive full of three-letter acronyms that could potentially make your eyes glaze over if it's not your cup of tea. And this is definitely not a pitch deck disguised as a blog post. This is the real story of how Everpure thinks about security — at the architecture level — and why that distinction matters more than most people realize when they're evaluating storage platforms. Because here's the thing: security isn't a bolt-on. It's not a checkbox. And it's certainly not a conversation you should have to schedule separately from the one about performance or reliability. At Everpure, security is baked in from the ground up — and once you understand how, you'll never look at a storage spec sheet the same way again. Start With the Five S's At Everpure, we talk a lot about what we call the Five S's of data: Simplicity, Speed, Scale, Sustainability, and Security. They're not independent pillars — they're interlocking principles that define every design decision we make. Simplicity because complexity is the enemy of agility. If you can't iterate quickly, you can't grow. Speed because we've been all-flash since day one — full stop. Every generation of our platform has been optimized around flash, not retrofitted for it. Scale because data doesn't stop growing, and your storage shouldn't hit a wall when your business doesn't. Sustainability because power, cooling, and physical footprint are real constraints — especially now, as those pressures trickle down from hyperscalers to everyone else. Security because none of the other four matter if your data isn't protected. Security is the one that tends to get either oversimplified ("we encrypt everything") or overcomplicated ("here's our 47-page compliance matrix"). Neither is helpful. What's helpful is understanding how it works, why it's different, and what it means in a real conversation with a real customer. The Compliance Landscape: What Customers Are Actually Asking About Before we get into the architecture, let's talk about the validations — because customers are increasingly asking about them, and the answers matter. FIPS 140-3 is the latest standard from the Cryptographic Module Validation Program (CMVP), managed by NIST. It validates that a cryptographic module — the thing actually doing the encryption — meets a defined security standard. Everpure's FlashArray is FIPS 140-3 validated. That's the current gold standard, and it matters especially as post-quantum cryptography conversations start entering the room. (More on that in a moment.) Common Criteria is an international standard for evaluating the security of IT products — not just storage, but networking, applications, hardware modules, and more. Everpure's FlashArray is certified under the Network Device collaborative Protection Profile (NDcPP) via NIAP, while FlashBlade holds an EAL2 certification. Independent testing and verification confirm that each platform meets its defined security target. You can actually enable Common Criteria mode directly on a FlashArray — it's a CLI command, not a professional services engagement. PCI DSS compatibility is table stakes in financial services, but it increasingly shows up in other industries too. It means end-to-end data masking, encryption in-flight and at rest, and a well-documented audit trail. Everpure's platforms are designed to support PCI DSS requirements natively — though it's worth noting that PCI DSS certification belongs to the merchant environment as a whole, not to any individual storage component. TLS 1.2 and 1.3 are the current standards for securing data in-flight at the management layer. Everpure standardizes these across all management communications — and yes, you can turn off older cipher suites if your security posture requires it. TAA Compliance means that Everpure's hardware is manufactured in the United States. For customers in regulated industries or government, this isn't a nice-to-have — it's a requirement. And for anyone who cares about supply chain transparency, Everpure can show its work. None of this is marketing fluff. These are independently validated, publicly verifiable certifications. You can find all of them — current CVE database, FIPS status, NIST 800-53 alignment, media sanitization documentation — at our Customer Trust portal. Bookmark it as It's fully public-facing and constantly updated. The Hardware Story: Why No Keys on the Drive Is the Point Here's where things get interesting. Take a Direct Flash Module — Everpure's approach to flash — and look at what's not on it. No CPU. No memory. No encryption keys. It is not a self-contained storage array. It is purpose-built flash media, and everything else — the intelligence, the encryption, the key management — lives in software. Why does that matter? Because self-encrypting drives (SEDs) are a pain. Anyone who's managed them in a regulated environment knows this intimately. When the encryption is in the hardware, you inherit all the complexity that comes with it: drive-level key management, FTL overhead, KMIP integration headaches, and the ever-present risk that a single drive failure or misconfiguration creates a data accessibility nightmare. Everpure's approach flips this entirely. Because the Direct Flash Module has no CPU, no memory, and no keys, all encryption is handled at the software layer — in Purity, running across the entire system. This means no hardware dependency, no FTL management overhead, and no encryption key tied to a specific piece of media. The portability this creates is remarkable. And as you'll see in a moment, it's the foundation of everything else. How Everpure's Encryption Actually Works Let's peel back the layers here, because this is genuinely cool — and it's the kind of thing that separates a confident storage conversation from a "let me get back to you" one. Everpure's encryption architecture is built around three components: The Data Encryption Key (DEK) is the actual key used to encrypt customer data. There's one per array, and it doesn't change. You might think: why would you never rotate the key that's protecting your data? The answer is that the DEK never needs to rotate because of what wraps it. The Key Encrypting Key (KEK) is a key that encrypts other keys — specifically, it wraps the DEK. This is standard cryptographic practice, and it's the mechanism that makes key rotation safe, fast, and completely transparent to the workload. The Armored DEK is the DEK after it's been wrapped by the KEK. This is the piece that gets distributed. At no point is the raw Data Encryption Key exposed in clear text. It's always wrapped, always protected. Here's where the architecture gets elegant: when a FlashArray or FlashBlade initializes, it generates a KEK. That KEK wraps the DEK to create the Armored DEK. The Armored DEK is stored as a complete copy in every Direct Flash Module header — but it cannot be decrypted without the KEK. The KEK itself is derived from a scrambled key, which is split into individual shares and distributed one per DFM header using a sharding algorithm that requires a quorum to reconstruct. What does quorum mean in practice? The system can tolerate drive losses and still unlock all data, as long as enough DFMs remain present and healthy to reconstruct the scrambled key. No single drive is a single point of failure for your encryption keys. When a read request comes in, here's what happens: the system reconstructs the scrambled key from a quorum of DFM shares, derives the KEK, and uses it to unwrap the Armored DEK — exposing the DEK temporarily in memory, never persisted in clear text — and uses it to decrypt the data. The process is reversed for writes. At no point is customer data stored or persisted in clear text. Everything written to NVRAM is encrypted before it ever reaches upper-level system processes. This isn't "we encrypt everything." This is a specifically designed cryptographic architecture that is portable, resilient, and opaque to any unauthorized party — including someone who physically removes a drive. Key Rotation: The Part Most Vendors Skip By default, Everpure rotates the Key Encrypting Key every 24 hours. Automatically. No KMIP server required. No scheduled maintenance window. It just happens. When a KEK rotates, the system generates a new one, re-encrypts the Armored DEK, and redistributes the updated scrambled key shares across all DFM headers. The DEK itself doesn't change — the workload never sees it — but the wrapping layer that protects it is refreshed daily. When drives are added or removed, the system treats this as a high availability event: it generates a new KEK immediately, re-encrypts everything, and rebalances the shards across the new drive configuration. The key material always matches the current system state. And when a DFM is removed from the system? The scrambled key shares on that drive correspond to a KEK that no longer exists — or will be rotated away within 24 hours. A removed drive becomes cryptographically useless. This is how Everpure delivers what some would call "instant media sanitization" — not by wiping the drive, but by invalidating the key that makes its contents meaningful. Rapid Data Locking: When You Need the Nuclear Option For environments where security isn't just a compliance requirement but a physical reality — air-gapped facilities, defense deployments, high-security data centers — Everpure has a capability called Rapid Data Locking (RDL). The concept: the Key Encrypting Key can be placed on a pair of hardware security tokens (one YubiKey per controller, two total) and inserted into the array. As long as the tokens are present, the array operates normally. If they are removed and the array is subsequently rebooted or power-cycled, the array cannot complete startup without the tokens present — the data remains physically intact, but it is cryptographically inaccessible. The array becomes, in the most literal sense, an expensive brick. Reinsert the tokens and power the array back on, and it boots up normally. This is the kind of capability that used to require expensive, bespoke security architecture. For Everpure customers, it's a feature of the platform. Dark Sites Are Getting Less Dark One more topic worth addressing: dark site deployments. Air-gapped environments have always involved painful tradeoffs — disconnected from cloud management, manual support processes, limited visibility into system health. That's changing. Dark site customers can now see their assets within Pure1 — subscriptions, health status, the ability to open and manage support cases — without compromising their air-gap requirements. Log obfuscation tooling is available today and will be integrated directly into the platform going forward, giving customers granular control over what telemetry leaves their environment and when. For partners and customers managing dark site deployments, this is a meaningful quality-of-life improvement. And it's consistent with how Everpure builds everything: the security architecture makes the operational flexibility possible, not the other way around. The Takeaway Security conversations in the storage industry tend to go one of two ways: a recitation of certifications that nobody fully understands, or a vague reassurance that "everything is encrypted." Neither builds confidence. Neither answers the real question, which is: how does this actually work, and why should I trust it? Everpure's answer starts with architecture. Software-managed encryption, no hardware key dependency, automatic key rotation, cryptographic portability, quorum-based scrambled key distribution, and capabilities like Rapid Data Locking that scale to the most demanding security requirements in the world. The certifications — FIPS 140-3, Common Criteria, TLS 1.3, TAA — aren't the story. They're the evidence. The story is that security was designed in from the beginning, not layered on afterward. That's a meaningful difference. And now you know why.129Views0likes1CommentPart 2: MCP Is Interesting. Everpure Fusion Makes It Useful.
In Part 1, I tried to give MCP a proper “…splanation,” mostly because the first several times I heard people talking about Model Context Protocol, I had the same look Joey had in Friends when the salesman asked him if his friends ever had a conversation and he just nodded along without really knowing what they were talking about. That was me. MCP this. MCP server that. Agentic AI. Tool calling. Context windows. Protocols. Hosts. Clients. Servers. At some point, I realized I was nodding with the confidence of a man who had understood approximately 41% of the conversation and was hoping nobody asked a follow-up question. The simple version is this: MCP is a standard way for AI applications to connect to tools and data. It is not the AI model itself. It is not the magic brain. It is the plumbing that lets the AI reach into approved systems, ask better questions, retrieve useful context, and potentially take action through well-defined tools. That is important in the abstract. But for Everpure customers and prospects, it becomes much more interesting when we stop talking about MCP as a general AI concept and start talking about what it could mean for storage operations, data infrastructure, and Everpure Fusion. Because this is where the conversation moves from “AI is coming someday” to “your infrastructure may already need to be ready for how AI will interact with it.” Everpure recently published a blog with a sneak peek of the Everpure Fusion MCP Server, describing it as an open-source service that connects AI assistants to Everpure Fusion storage fleets through the Model Context Protocol. The important part is not simply that an AI assistant can talk to storage. That would be interesting, but it would also be easy to misunderstand. The important part is that the assistant can interact with the storage environment through the Fusion control plane, which already understands fleet-wide context across FlashArray and FlashBlade. That distinction matters. Without Fusion, many environments are still managed in a way that looks very familiar to anyone who has spent time supporting infrastructure. One array over here. Another array over there. Scripts in one folder. Notes in another. Naming standards that started strong and then apparently met reality. Screenshots in tickets. Tribal knowledge in the heads of a few people who somehow remember which workload lives where, which array is doing what, and why nobody should touch that one volume because “there was a reason,” even if nobody is entirely sure what the reason was anymore. That model may work, but it does not scale gracefully. More importantly, it is not especially friendly to automation, and it is definitely not ideal for AI-assisted operations. Most troubleshooting in mature environments is not hard because people lack tools. It is hard because the context is not immediately obvious. The storage admin has one view. The DBA has another view. The virtualization team has another view. The application owner has a completely different view, usually delivered through a ticket that says something deeply scientific like “the app feels slow.” Everyone may be looking at a valid piece of the puzzle, but the real work is in the correlation. Which volume maps to which workload? Which array is hosting it? What did latency look like during the reported window? Were IOPS elevated? Was bandwidth constrained? Did anything change recently? Are we looking at a storage issue, a database issue, an application issue, a noisy neighbor, a misconfigured VM, a bad query, or just another case of “the network is innocent until proven guilty, but still somehow looks suspicious standing there”? That is where Fusion and MCP together become compelling. The Everpure Fusion MCP example makes the idea real. Instead of forcing an administrator to manually build low-level REST API calls or jump between tools, the MCP-aware AI assistant can query Fusion through higher-level tools exposed by the MCP server. In the example Everpure blog described, a storage admin can ask about workloads and volumes supporting a production SQL environment, including arrays, IOPS, latency, and bandwidth over a recent time window. The assistant can then correlate that storage perspective with information from another MCP server, such as SQL Server context around database files, wait types, and query behavior. That does not mean the AI replaces the storage admin. It does not mean the AI replaces the DBA. It does not mean everyone goes to lunch while the robot fixes production. And this is where I need to bring in The Big Bang Theory again, because apparently this is who I am now. There is a scene in the show where Raj is very open to the idea of aliens and extraterrestrial life. At the planetarium, Raj can look at flashes of light in the sky and talk about how scientists cannot fully rule out the possibility of alien civilizations. It is funny because Raj is a scientist, but he is also Raj, so the line between rigorous possibility and “maybe the aliens are waving at us” gets wonderfully blurry. That is how some people talk about AI operations right now. A light flashes in the sky, and suddenly someone is ready to announce that the robots are here to run the data center. Let’s not do that. The point is not that the AI is an alien civilization arriving to take over infrastructure operations. The point is that the interface is changing. The way humans interact with infrastructure is starting to move from manual lookup, command execution, and tribal knowledge toward assisted reasoning, guided action, and cross-system correlation. That is much more practical than aliens. It is also much more useful. Fusion already gives customers a fleet-wide control plane. It gives you the ability to think above individual arrays, above one-off configuration, and above the old habit of managing infrastructure like every system is its own little island with its own weather pattern. MCP gives that control plane another interface, one designed for the way AI agents work. This is why Fusion adoption matters. If your environment is still managed mostly array by array, script by script, ticket by ticket, and screenshot by screenshot, then AI can only help so much. It may summarize the pain beautifully, but it is still summarizing pain. When you use Fusion to create a more consistent, policy-driven, fleet-aware operating model, you are not just modernizing storage management. You are making the environment more understandable to automation, to operations teams, and now to AI agents that need structured context in order to be useful. That is a very different conversation from “look, the AI can query storage.” The better conversation is this: if AI is going to become part of operational workflows, then your infrastructure needs to be ready to participate in those workflows. Fusion is one of the ways you prepare for that. Not someday. Now. And Fusion is not the only example of this direction. Another Everpure technical article shows how an MCP server can be built to integrate with FlashBlade, allowing an AI assistant to query system data and even take direct actions through a natural-language interface. That example is useful because it shows the bridge between the old world and the new one. In the old world, storage management often meant CLI commands, scripts, API calls, screenshots, and specialized knowledge living in the heads of a few very tired people. In the new world, those capabilities can be surfaced through an AI-assisted experience that understands the available tools and can help operators ask better questions in plain English. Again, that does not mean the AI should blindly run your infrastructure while everyone disappears. Please do not read this article and tell your change advisory board that “the blog guy said the robot can handle it.” That is not the point, and I would like to remain welcome in polite infrastructure society. The point is that the operational model is changing. For years, we have talked about automation in infrastructure, but a lot of what we called automation still required a human to know exactly what to automate, where to look, which command to run, which script was safe, which API endpoint mattered, and which piece of documentation had not quietly aged into fiction. AI-assisted operations changes the interaction pattern. Instead of always beginning with the operator knowing the exact command or API call, the operator can begin with the question. Why did this workload slow down? Which volumes support this application? What changed in the last four hours? Which arrays are carrying the highest latency? Which workloads are consuming the most bandwidth? Which policies are inconsistent across the fleet? Where do we have capacity pressure? Which storage objects are tied to this SQL environment? Those are the kinds of questions humans actually ask when something is happening. MCP gives AI assistants a standard way to ask approved systems for the data behind those questions. Fusion gives the storage estate a more consistent, policy-aware, fleet-level way to answer. That combination is where the opportunity lives. Now, because this is enterprise technology and not a children’s book, we also need to talk about the dangerous part. One of the readers posted this comment on Linked in yesterday: The moment an AI system can access tools and data, the conversation changes. A chatbot that gives a bad answer is annoying. An agent that takes the wrong action in a business system can become a real incident. If a model can read sensitive files, query databases, send messages, modify records, trigger workflows, or touch infrastructure, then security is not a feature. Security is the premise. This is where some of the MCP enthusiasm needs adult supervision. We have spent years telling users not to click strange links, not to approve unknown applications, not to reuse passwords, and not to download random files. Now we are building systems where an AI assistant might read strange content, call external tools, and act on behalf of the user. That can be incredibly powerful, but only if we are honest about the risk. In some ways, MCP may expose organizational problems faster. If your data is scattered, stale, contradictory, or politically curated, an AI agent connected to it will not magically produce truth. It may simply produce a more polished version of the confusion. If your workflows are unclear, connecting AI to them may help automate the ambiguity, which is not quite the same thing as progress. The model can gather information, call tools, and complete steps, but people still need to define what should happen, what should not happen, what requires approval, and what good looks like. For Everpure customers and prospects, the more important question is not whether MCP is interesting. It is whether your environment is ready for this kind of interaction. That is where I would encourage customers to take a serious look at Fusion. Not because Fusion is another checkbox on a feature list, and not because every new technology conversation needs to end with someone saying “platform” three times into a mirror. Fusion matters because it changes the operational model. It gives you a way to manage data infrastructure as a fleet, with policy, consistency, automation, and context. Those are exactly the things AI agents need if they are going to do more than produce nicely formatted guesses. If you already met all the prerequisites (Purity 6.8.+, LDAP enabled), use it. Explore it. Get comfortable with it. Stop thinking about Fusion as something reserved for a future automation project after everyone finally gets through the current list of fires, renewals, upgrades, and meetings that should have been emails. MCP may be the plumbing that helps AI connect to the enterprise. Fusion helps make the storage environment worth connecting to. And that is the real call to action. Fusion is how Everpure customers make sure their data infrastructure is ready for it. Appreciate you reading. Dmitry Gorbatov © 2025 Dmitry Gorbatov | #dmitrywashere66Views0likes0CommentsSmarter VCF Deployments: Simplifying Storage with Everpure FlashArray
May 26 | Register Now! Join us for this Expert-led Demos webinar and learn how Everpure FlashArray™ simplifies and accelerates VMware Cloud Foundation (VCF) deployments by serving as a high-performance NFS-backed datastore. Whether you’re planning a new VCF deployment or want to optimize an existing one, this session will give you practical guidance you can apply immediately to simplify deployment and lifecycle management. This integration can: Streamline Day 0/Day 1 setup Decouple compute and storage in VCF for better scalability, right-sizing, and improved cost efficiency Deliver predictable performance and data services for your cloud foundation Enhance protection and resiliency for VCF workloads with built-in FlashArray data services Register Now!196Views0likes0CommentsHands-on with Everpure's FlashBlade//EXA
This is a syndicated repost from the WWT Company Blog. The original post can be found here. The Everpure FlashBlade and why the need for a new design The original FlashBlade was released in 2016 and was the first of its kind, delivering an all-flash solution for unstructured data, which had long been served by the spinning-disk market. With the exponential growth of unstructured data, Everpure (formerly Pure Storage) updated the FlashBlade design with a modular approach in 2022 called the FlashBlade//S that allowed compute blades to scale independently from the storage by using their DirectFlash Modules (DFMs) instead of the NAND chips being soldered onto each blade as was done in the first generation of the FlashBlade design. Despite the hardware changes, the heart of the solution (Purity//FB software) still attains phenomenal performance by using a Key-Value database as the metadata engine. In fact, the latest testing shows that a single FlashBlade//S chassis can support 3.5 trillion objects in about 100 MB of metadata space. The FlashBlade//S solution scales to 10 chassis (100 blades) and is well-suited for many AI storage use cases, such as data ingest and model training. As AI Dataset sizes increase into the petabytes, and the number of GPUs used for training and inferencing grows into the tens of thousands, the FlashBlade//S architecture doesn't scale as efficiently and economically to meet the needs; thus, the FlashBlade//EXA was born in 2025, which expanded the FlashBlade//S architecture by separating the data storage from the metadata operations. //EXA Architecture In traditional High Performance Computing and AI environments, storage systems that incorporate parallel filesystems have been dominant due to their performance, but they are also very difficult to install and complex to manage. With the maturity of parallel NFS (pNFS), we are seeing more vendors offering pNFS solutions because of the similar performance it delivers without all the extra complexity. FlashBlade//EXA utilizes pNFS in its new disaggregated storage architecture, pairing one or more FlashBlade//S500 chassis as Metadata Nodes (MN) with commodity rack servers filled with SSDs as Data Nodes (DN). This allows you to scale and size the solution based on your performance and capacity needs. How does data flow and client connections work in this new design….I'm glad you asked. When a client initiates a read or write operation, it establishes a parallel NFS (pNFS) connection to the MN. The MN acts as an "air traffic controller", redirecting the client to the appropriate DNs serving the File System for a direct access connection via the blazing-fast NFSv3 over RDMA protocol. Meanwhile, the MN(s) and DN(s) are in constant communication behind the scenes, handling file system creation and updating the metadata key-value store to keep track of where the data resides across the DNs. This architecture is purpose-built for high throughput and parallel access, ensuring that neither the metadata operations nor data access becomes a bottleneck. The results of this architecture change for FlashBlade//EXA are a high-performance, scale-out storage solution built for modern data needs. The updated design provides significant parallelism, high throughput, and the flexibility to handle both AI and HPC workloads. As Metadata requirements change, customers can simply scale the FlashBlade//S cluster from 1 to 10 chassis with each chassis supporting up to 10 blades, while still utilizing a single virtual interface port (VIP) connection that spreads the load across the cluster to utilize all the blades efficiently. As capacity needs change, simply add more DNs (up to 1000) with the SSD capacities and quantities required to meet your needs. The MNs, DNs and clients are all connected via 400 Gb network switches for low-latency, high-throughput connectivity while limiting the number of cables used to simplify the installation process. Installation Historically, Everpure's hardware appliances (FlashArray and FlashBlade) have always been just that, an appliance. Simply rack the gear, connect the cables, copy the desired software version from a USB drive, and run through the setup wizard. Within a few hours, the array would be ready to provision storage and allow client connections. In the ATC, we've installed numerous FlashArrays and FlashBlades for customer evaluations and can testify that the installation process is straightforward and quick. The FlashBlade//S (a.k.a. MN) installation was what we were used to. The recommended software version was installed on the External Fabric Modules (XFMs); we then connected the FlashBlade chassis cables to the XFMs, where the software was pushed to each of the blades and ran through the setup wizard to complete the base install steps and access it across the network. It's worth noting that any time you open up your ecosystem to use commodity servers in the design, there's going to be new challenges and growing pains around the installation, configuration, and management. And the responsibilities for securing unauthorized access and out-of-band management falls to the customer as it's no longer a hardened appliance. This was a new experience for us with Everpure as we went into this with the appliance mentality and forgetting that this design incorporated the SDS characteristics for the installation and ongoing maintenance. Note - while storage appliances typically incorporate all the firmware, drivers, and software updates as part of the upgrade process, those ongoing maintenance steps are separate tasks for the SDS approach and need to be managed by the team(s) responsible for the hardware. As it relates to management, every OEM's out-of-band management interface is different, some better than others, and requires trial and error to get it right, both on the cables/adapters used and the settings required to make a successful connection to remotely manage the device. With all that said, the rack servers (a.k.a. DNs) installation was not a simple and quick installation…but that's the beauty of the AIPG - allow WWT to iron out the kinks, prove out the steps required to make things work together, all while reducing time and risk for the entire process. The deployment in our lab sandbox consisted of a Linux management VM that runs the FlashBlade//EXA Services Container. This Services Container provides TFTP & DHCP services, a repository for installation files and scripts, and a Prometheus and Grafana instance for ongoing monitoring of the Data Node's performance. This is also were maintenance tasks, such as disk replacements, on the DNs are initiated. While this was only a small 8 DN configuration, we wanted to treat it as if it was 100, 500 or even a 1000 node install to get an idea of what a customer would expect during the installation process. While we could have simply copied the installation files and software to a USB drive to plug in locally to each server, we used the provided automation scripts and steps for the installation process by having the DNs boot over the network to load the software and configuration files from the management VM. This meant we needed to configure out-of-band networking on the DNs and change the BIOS to allow network booting. Next, we captured the MAC address for the server's onboard NICs to set up DHCP reservations and node names that would be used in the FB//EXA deployment. Finally, we configured the DHCP options to direct the DNs to the TFTP server running on the Linux VM. After a few attempts and a couple of tweaks with our management network setup, we were able to start the DN installation. The upside of troubleshooting new installations is that you really get to learn the product, how things work under the covers, and to collaborate with the OEMs so they can update their install docs and environment prerequisites to help customers avoid the same challenges in the future. In our experience, no two environments are the same; they are all configured a little differently and use different switch models and OEMs. With the base setup and deployment complete, it was time to configure the solution. At the time of our testing, the Viking VSS2320 servers are the only currently supported server model, as they provide hardware-based redundancy for high availability (HA) by allowing each server controller in the 2RU chassis to connect to all installed SSDs. In the event of a server failure, the remaining server can take over access to the drives and the data they contain. In a future software release, the resiliency will be done via software-based erasure coding, which will remove the hardware requirement for HA and allow additional server OEMs and models to be supported. Configuration FB//EXA With the Purity//DN image installed on the DNs, a few tasks remained before we could join them to the MN. For each DN, we needed to run a command to format the DN's internal storage (local NVMe drives), then another command to run a health check. Once all the DNs were in a healthy state, the last couple of steps were done via an SSH session to the MN to create the first Node Group and add the DNs to it. Note - In a large-scale FB//EXA deployment, there may be a need for multiple Node Groups (e.g., different departments or multi-tenancy), and a DN can belong to multiple Node Groups. We started with only 6 DNs in the group and later added 2 more, as shown in the image below. In the current release tested, there is no DN rebalancing of the data as reflected with DNs 9/10 having less consumed data on them. And in case you are wondering DNs 1/2 needed a firmware update at the time of the Node Group creation and will be used for future customer POCs. At this point, the system was ready to have a File System created. This step consisted of associating the File System to a single Node Group, specifying the size of the File System, and providing a name - which was all done through a single command. The only thing left to configure was the protocols enabled for the File System and the rules & policies for who can access the network share. Clients On the client side, we used two high-performant servers with GPUs and 2 x 400 Gb network cards running an Ubuntu OS. There are only a few requirements related to BGP and RoCEv2 networking that need to be configured so we installed the standard FRRouting package on the clients, enabling bgpd and configuring the service. Note - FlashBlade//EXA utilizes a common layer 3 Border Gateway Protocol (BGP) network designed for performance and efficiency, along with Remote Direct Access Memory (RDMA) that is optimized for high speed and low latency. The dual 400 Gb Connect-X network ports were then configured with the correct Priority Flow Control and DSCP mapping settings to support RoCEv2. Finally, to complete the configuration phase of the install, we installed the Everpure-provided "nfs-client-pure-dkms" Linux package, which optimizes the Linux kernel NFS. sudo apt install ./nfs-client-pure-dkms_1.0_amd64.deb Testing With the File System created on the FB//EXA and the clients configured, we were ready to start the testing. All that was left to do was mount the File System on the Clients using the below mount command that specifies the single MN VIP and File System. This is because the FlashBlade//S internally load balances the connections automatically across all the available blades. sudo mount -t nfs -o vers=4.1,proto=tcp,nconnect=16 <data_vip>:<filesystem> /mnt/nfs Note – the mount command specifies the file system type of NFS, with options for NFS version 4.1 and nconnect=16 to establish multiple TCP connections to the VIP. Here's where things got fun. During baseline synthetic testing, FlashBlade//EXA achieved near line-rate performance on a single client with dual 400 Gb ConnectX adapters. In a 100% read workload, aggregate throughput of the two 400 Gb NICs reached 781 Gb/s (97.65 GB/s), effectively saturating the available 800 Gb/s of network bandwidth on a single client. In a 100% write workload test using 512k block size a single client with two 400 Gb NICs averaged a sequential write throughput of 83 GB/s (77.3 GiB/s). As we added a second client in the mix with the same hardware specs, latency remained consistently low, and throughput scaled linearly across our tests. 100% Write across 2 x clients each with 2 x 400 Gb/s NICs In the end, we found that client-side networking was the bottleneck in our lab setup. The FB//EXA did a great job of balancing metadata operations across the blades and spreading read/write operations across the DNs that serviced the file system presented to clients. Our best guess is that it would take 8-10 clients, each with 2 x 400 Gb NICs, to saturate the network connections to the 8 DNs in our setup. Power requirements are another important factor to consider. While in an idle state, the solution consumed about ~5-6 kW of power. During the 100% write workload test using two clients, the FB//EXA solution consumed approximately 8.5 kW during sustained write tests and about 7.2 kW during sustained read tests. Summary In closing, FlashBlade//EXA is fast and made a strong impression on our AI Proving Ground team. From the disaggregated design to the simple client setup, it's a solid choice for anyone needing serious storage horsepower—especially if you want to spend more time running workloads and less time tinkering. And with FlashBlade//EXA running the same Purity//FB operating system, the learning curve will be quick for those already familiar with FlashBlade's UI. We're excited to collaborate with our customers as they explore use cases that require FB//EXA-level performance and future enhancements as the product evolves. Our initial impression is that this platform truly delivers on its promises for today's data-driven environments. Are you ready to evaluate FB//EXA for your demanding AI and HPC workloads? Let our AIPG teams help de-risk and accelerate decision-making for your next-generation, high-performance storage needs. AI Proving Ground in the ATC WWT's Advanced Technology Center (ATC) is a state-of-the-art facility that allows customers, partners, and employees to explore, test, and validate technology solutions in a collaborative environment. The AI Proving Ground (AIPG) is an initiative to develop, test, and implement artificial intelligence solutions within the ATC. The AIPG enables AI technologies to be explored, validated, and demonstrated in real-world scenarios, allowing organizations to assess the capabilities and potential of AI solutions before deploying them at scale. Technologies51Views1like0CommentsNFS over TLS on FlashArray (Purity//FA 6.10.6)
Purity//FA 6.10.6 introduces NFS over TLS for FlashArray File Services: an in-transit encryption layer that wraps NFSv3 and NFSv4.1 RPC traffic in a TLS 1.3 session as defined by RFC 9289 - Towards Remote Procedure Call Encryption By Default. Server authentication is mandatory, and mutual TLS (mTLS) is available as an optional second factor. This post is a technical feature description plus a minimum viable configuration walkthrough. It assumes you are already comfortable with FlashArray File Services (file servers, exports, policies) and Linux NFS clients. What the feature actually is Transport encryption for NFS - NFSv3 and NFSv4.1 RPC traffic is carried inside a TLS 1.3 record layer over TCP/2049. No NFS-level changes; applications and mount paths stay the same. Server authentication - the FlashArray presents an X.509 certificate; the client validates it against its own trust store. Server certificates must include the file-server VIF in the SAN. Optional mTLS - the array can require and verify a client certificate against a configured trusted CA (single certificate or a certificate group). Per-server policy - TLS configuration is a first-class tls policy attached to a specific file server, not a global toggle. End-to-end data path NFS over TLS data path. tlshd on the client performs the TLS handshake against the FlashArray; the resulting session encrypts all consequent NFS traffic on established connection. Building blocks on the FlashArray The feature is exposed as a new tls policy type that ties together three existing concepts: certificates (imported or self-signed), the tls-policy object, and a file server. The policy holds the appliance certificate, the TLS version/cipher constraints, the protocols TLS is enforced for, and (optionally) the trusted CA used to authenticate clients. TLS versions and cipher suites NFS over TLS on FlashArray negotiates TLS 1.3 for the NFS data path. The tls-policy object accepts --minimum-tls-version values of 1.2 or 1.3 , but that minimum is a floor, not a contract - for NFS the negotiated version will always be 1.3. The default TLS 1.3 cipher set is: TLS_AES_256_GCM_SHA384 TLS_CHACHA20_POLY1305_SHA256 TLS_AES_128_GCM_SHA256 (mandatory per RFC 8446) On clients with AES-NI, TLS_AES_256_GCM_SHA384 is the natural choice. TLS_CHACHA20_POLY1305_SHA256 is the cipher to prefer on clients without AES hardware acceleration. NFS protocol versions and mount options Both NFSv3 and NFSv4.1 are supported. The Linux client opts into TLS at mount time via the xprtsec option, mediated by tlshd : Option Meaning xprtsec=tls One-way TLS, server authentication only xprtsec=mtls Mutual TLS - client also presents a certificate vers=4.1 / vers=3 NFS protocol version Prerequisites FlashArray: Purity//FA 6.10.6 or later, with at least one configured file server. Client OS: a recent Linux distribution with NFS-over-TLS support (e.g. Rocky Linux 10), including nfs-utils , tlshd and openssl (for certificate handling). Certificates: a server certificate signed by a CA the client trusts; if there is no proper DNS record set up, the certificate must include the file-server IP Address in its subjectAltName . For mTLS, a client certificate signed by a CA the array trusts. Configuration walkthrough This is the minimum sequence to land an encrypted NFS mount. Replace IPs, names and certificate paths to taste. If you don't yet have a CA to issue the server (and, for mTLS, client) certificate from, see the test-CA appendix at the end of this post. 1. FlashArray - import the appliance certificate # on the FlashArray CLI - interactive paste of key, then certificate purecert imported create nfs-server-cert --key # for mTLS only: import the CA used to sign client certificates purecert imported create nfs-client-ca 2. FlashArray - create a TLS policy Server-auth-only policy: purepolicy tls create nfs-tls-policy \ --appliance-certificate nfs-server-cert \ --tls-enforced-for nfs mTLS variant - require the client to present a certificate and verify it against a trusted CA (the trusted CA argument accepts either a single certificate or a certificate_group ): purepolicy tls create nfs-mtls-policy \ --appliance-certificate nfs-server-cert \ --tls-enforced-for nfs \ --client-certificates-required \ --client-certificate-trust-verify-enabled \ --trusted-client-ca-certificate nfs-client-ca Optional version / cipher tuning: purepolicy tls setattr nfs-tls-policy --minimum-tls-version 1.3 purepolicy tls setattr nfs-tls-policy \ --enabled-tls-ciphers TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256 purepolicy tls list --effective 3. FlashArray - attach the policy to a file server pureserver list purepolicy tls add nfs-tls-policy --server your-file-server purepolicy tls list --member Once the policy is attached, the file server starts requiring TLS for any new NFS connection on that VIF. Existing un-encrypted sessions are not renegotiated or dropped on policy change - clients must remount or restart their NFS service to pick up the new requirements. The same caveat applies when removing or rotating the trusted client CA. 4. FlashArray - create the export (unchanged from regular NFS) purefs create your-filesystem puredir create your-filesystem:your-managed-dir purepolicy nfs create your-nfs-policy purepolicy nfs rule add your-nfs-policy \ --client "*" --no-root-squash --rw --version nfsv3,nfsv4 puredir export create your-export \ --dir your-filesystem:your-managed-dir \ --policy your-nfs-policy \ --server your-file-server The export must live on the same file server that the TLS policy is attached to (note the --server argument). 5. Linux client - install and configure tlshd dnf install -y nfs-utils ktls-utils systemctl enable --now tlshd mkdir -p /etc/pki/nfs cp ca.crt /etc/pki/nfs/ca.crt chmod 644 /etc/pki/nfs/ca.crt Minimal /etc/tlshd.conf for server-only TLS: [debug] loglevel=1 tls=1 nl=0 [authenticate] [authenticate.client] x509.truststore=/etc/pki/nfs/ca.crt [authenticate.server] For mTLS, add the client identity: [authenticate.client] x509.certificate=/etc/pki/nfs/client.crt x509.private_key=/etc/pki/nfs/client.key x509.truststore=/etc/pki/nfs/ca.crt Restart tlshd after any change: systemctl restart tlshd . 6. Mount # server authentication only mount -t nfs -o vers=4.1,xprtsec=tls,rw \ 10.0.0.100:/your-export /mnt/nfs-tls # mutual TLS mount -t nfs -o vers=4.1,xprtsec=mtls,rw \ 10.0.0.100:/your-export /mnt/nfs-mtls # verify mount | grep xprtsec What the wire actually looks like Connection bring-up: AUTH_TLS probe per RFC 9289 → TLS 1.3 handshake brokered by tlshd → encrypted NFS traffic on the same TCP connection. Operational notes Policy changes are not retroactive. Tightening a policy (turning TLS on, switching to mTLS, removing a cipher in use) does not drop or renegotiate existing connections. Affected clients need to remount or restart NFS. Same applies to CA removal/expiry. Server certificate must carry the VIF in SAN. Without a matching subjectAltName entry the client refuses the certificate; common symptom is a mount failure with Protocol not supported and tlshd logging a verification error. NFSv4.1 connection reuse amortises the handshake cost across many operations; NFSv3 mounts re-do the handshake more often, so the relative cost is higher on connection churn. Troubleshooting cheat sheet Symptom Likely cause First thing to check mount.nfs: Connection refused Policy enforces TLS, client mounts plain NFS, or tlshd not running systemctl status tlshd ; add xprtsec=tls access denied by server while mounting mTLS client cert missing/untrusted, or export rule mismatch journalctl -u tlshd -n 100 ; puredir export list Protocol not supported Server certificate SAN does not include the mounted IP, or CA not trusted openssl x509 -in server.crt -text -noout | grep -A1 "Subject Alternative Name" Useful diagnostics on the client: journalctl -u tlshd -f sysctl -w sunrpc.rpc_debug=0x7fff sunrpc.nfs_debug=0x7fff tcpdump -i any -nn -v 'host <file-server-ip> and port 2049' -w /tmp/nfs-tls.pcap # remember to restore: sysctl -w sunrpc.rpc_debug=0 sunrpc.nfs_debug=0 Appendix: a throwaway CA for testing For lab and PoC work it is much more useful to stand up a tiny local CA than to hand out self-signed certs. The workflow mirrors what you would do with a real PKI - the array trusts a CA, that CA signs the appliance certificate, and (for mTLS) the same or a different CA signs each client certificate. Anything below is for non-production use; do not reuse these keys anywhere you care about. Set a couple of variables to keep the commands short: mkdir -p ~/nfs-tls-ca && cd ~/nfs-tls-ca VIP=10.0.0.100 # file-server VIP the client will mount FQDN=nfs.lab.example.com # optional DNS name for the same VIP CLIENT_CN=client01.lab.example.com # only needed for mTLS 1. Root CA # 4096-bit RSA root, valid 10 years openssl genrsa -out ca.key 4096 openssl req -x509 -new -nodes -key ca.key -sha256 -days 3650 \ -subj "/CN=NFS-TLS Lab Root CA/O=Lab" \ -out ca.crt # inspect openssl x509 -in ca.crt -noout -subject -issuer -dates 2. Appliance (server) certificate The server certificate must include the file-server VIP in subjectAltName ; without it the client refuses the certificate during handshake. Add the FQDN as well if you have DNS for it. openssl genrsa -out server.key 2048 openssl req -new -key server.key \ -subj "/CN=${FQDN}/O=Lab" \ -addext "subjectAltName=DNS:${FQDN},IP:${VIP}" \ -out server.csr cat > server.ext <<EOF basicConstraints = CA:FALSE keyUsage = digitalSignature, keyEncipherment extendedKeyUsage = serverAuth subjectAltName = DNS:${FQDN},IP:${VIP} EOF openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \ -out server.crt -days 825 -sha256 -extfile server.ext # verify the chain and the SAN openssl verify -CAfile ca.crt server.crt openssl x509 -in server.crt -noout -ext subjectAltName Import this pair into the FlashArray as nfs-server-cert and reference it from the TLS policy as --appliance-certificate : # key first, then certificate, when prompted purecert imported create nfs-server-cert --key 3. Client certificate (mTLS only) openssl genrsa -out client.key 2048 openssl req -new -key client.key \ -subj "/CN=${CLIENT_CN}/O=Lab" \ -out client.csr cat > client.ext <<EOF basicConstraints = CA:FALSE keyUsage = digitalSignature extendedKeyUsage = clientAuth EOF openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial \ -out client.crt -days 825 -sha256 -extfile client.ext openssl verify -CAfile ca.crt client.crt 4. What goes where File Goes to Used as server.key + server.crt FlashArray ( purecert imported create nfs-server-cert ) TLS policy --appliance-certificate ca.crt (for mTLS) FlashArray ( purecert imported create nfs-client-ca ) TLS policy --trusted-client-ca-certificate ca.crt NFS client ( /etc/pki/nfs/ca.crt ) tlshd truststore ( x509.truststore ) client.key + client.crt (for mTLS) NFS client ( /etc/pki/nfs/ ) tlshd client identity ( x509.private_key , x509.certificate ) From here, finish with the Configuration walkthrough steps above: create the TLS policy, attach it to the file server, create the export, configure tlshd , mount with xprtsec=tls or xprtsec=mtls . References RFC 9289 - Towards Remote Procedure Call Encryption By Default RFC 8446 - TLS 1.3 tlshd(8) and tlshd.conf(5) manual pages Everpure FlashArray File Services administration guide (Purity//FA 6.10.6)318Views1like0CommentsFlashArray File Multi-Server
File support on FlashArray gets another high demanded feature. With version 6.8.7, purity introduces a concept of Server, which connects exports and directory services and all other necessary objects, which are required for this setup, namely DNS configuration and networking. From this version onwards, all directory exports are associated with exactly one server. To recap, server has (associations) to following objects: DNS Active Directory / Directory Service (LDAP) Directory Export Local Directory Service Local Directory Service is another new entity introduced in version 6.8.7 and it represents a container for Local Users and Groups. Each server has it's own Local Directory Service (LDS) assigned to it and LDS also has a domain name, which means "domain" is no longer hardcoded name of a local domain, but it's user-configurable option. All of these statements do imply lots of changes in user experience. Fortunately, commonly this is about adding a reference or possibility to link a server and our GUI contains newly Server management page, including Server details page, which puts everything together and makes a Server configuration easy to understand, validate and modify. One question which you might be asking right now is - can I use File services without Servers? The answer is - no, not really. But don't be alarmed. Significant effort has been made to keep all commands and flows backwards compatible, so unless some script is parsing exact output and needs to be aligned because there is a new "Server" column added, there should be any need for changing those. How did we managed to do that? Special Server called _array_server has been created and if your configuration has anything file related, it will be migrated during upgrade. Let me also offer a taste of how the configuration could look like once the array is updated to the latest version List of Servers # pureserver list Name Dns Directory Services Local Directory Service Created _array_server management - domain 2025-06-09 01:00:26 MDT prod prod - prod 2025-06-09 01:38:14 MDT staging management stage staging 2025-06-09 01:38:12 MDT testing management testing testing 2025-06-09 01:38:11 MDT List of Active Directory accounts Since we can join multiple AD servers, we now can have multiple AD accounts, up to one per server # puread account list Name Domain Computer Name TLS Source ad-array <redacted>.local ad-array required - prod::ad-prod <redacted>.local ad-prod required - ad-array is a configuration for the _array_server and for backwards compatibility reasons, the prefix of the server name hasn't been added. The prefix is there for account connected to server prod (and to any other server). List of Directory Services (LDAP) Directory services got also slightly reworked, since before 6.8.7 there were only two configurations, management and data. Obviously, that's not enough for more than one server (management is reserved for array management access and can't be used for File services). After 6.8.7 release, it's possible to completely manage Directory Service configurations and linking them to individual servers. # pureserver list Name Dns Directory Services Local Directory Service Created _array_server management - domain 2025-06-09 01:00:26 MDT prod prod - prod 2025-06-09 01:38:14 MDT staging management stage staging 2025-06-09 01:38:12 MDT testing management testing testing 2025-06-09 01:38:11 MDT Please note that these objects are intentionally not enabled / not configured. List of Directory exports # puredir export list Name Export Name Server Directory Path Policy Type Enabled prod::smb::accounting accounting prod prodpod::accounting:root / prodpod::smb-simple smb True prod::smb::engineering engineering prod prodpod::engineering:root / prodpod::smb-simple smb True prod::smb::sales sales prod prodpod::sales:root / prodpod::smb-simple smb True prod::smb::shipping shipping prod prodpod::shipping:root / prodpod::smb-simple smb True staging::smb::accounting accounting staging stagingpod::accounting:root / stagingpod::smb-simple smb True staging::smb::engineering engineering staging stagingpod::engineering:root / stagingpod::smb-simple smb True staging::smb::sales sales staging stagingpod::sales:root / stagingpod::smb-simple smb True staging::smb::shipping shipping staging stagingpod::shipping:root / stagingpod::smb-simple smb True testing::smb::accounting accounting testing testpod::accounting:root / testpod::smb-simple smb True testing::smb::engineering engineering testing testpod::engineering:root / testpod::smb-simple smb True testing::smb::sales sales testing testpod::sales:root / testpod::smb-simple smb True testing::smb::shipping shipping testing testpod::shipping:root / testpod::smb-simple smb True The notable change here is that the Export Name and Name has slightly different meaning. Pre-6.8.7 version used the Export Name as a unique identifier, since we had single (implicit, now explicit) server, which naturally created a scope. Now, the Export Name can be the same as long as it's unique in scope of a single server, as seen in this example. The Name is different and provides array-unique export identifier. It is a combination of server name, protocol name and the export name. List of Network file interfaces # purenetwork eth list --service file Name Enabled Type Subnet Address Mask Gateway MTU MAC Speed Services Subinterfaces Servers array False vif - - - - 1500 56:e0:c2:c6:f2:1a 0.00 b/s file - _array_server prod False vif - - - - 1500 de:af:0e:80:bc:76 0.00 b/s file - prod staging False vif - - - - 1500 f2:95:53:3d:0a:0a 0.00 b/s file - staging testing False vif - - - - 1500 7e:c3:89:94:8d:5d 0.00 b/s file - testing As seen above, File network VIFs now are referencing specific server. (this list is particularly artificial, since neither of them is properly configured nor enabled, anyway the main message is that File VIF now "points" to a specific server). Local Directory Services Local Directory Service (LDS) is a newly introduced container for Local Users and Groups. # pureds local ds list Name Domain domain domain testing testing staging staging.mycorp prod prod.mycorp As already mentioned, all local users and groups now has to belong to a LDS, which means management of those also contains that information # pureds local user list Name Local Directory Service Built In Enabled Primary Group Uid Administrator domain True True Administrators 0 Guest domain True False Guests 65534 Administrator prod True True Administrators 0 Guest prod True False Guests 65534 Administrator staging True True Administrators 0 Guest staging True False Guests 65534 Administrator testing True True Administrators 0 Guest testing True False Guests 65534 # pureds local group list Name Local Directory Service Built In Gid Audit Operators domain True 65536 Administrators domain True 0 Guests domain True 65534 Backup Operators domain True 65535 Audit Operators prod True 65536 Administrators prod True 0 Guests prod True 65534 Backup Operators prod True 65535 Audit Operators staging True 65536 Administrators staging True 0 Guests staging True 65534 Backup Operators staging True 65535 Audit Operators testing True 65536 Administrators testing True 0 Guests testing True 65534 Backup Operators testing True 65535 Conclusion I did show how the FA configuration might look like, without providing much details about the actual way how to configure or test these configs, anyway, this article should provide a good overview about what to expect from 6.8.7 version. There is plenty of information about this particular aspect of the release in the updated product documentation. Please let me know if there is any demand to deep-dive into any aspect of this feature.650Views2likes2Comments