Stop Blaming Storage: The Invisible Cost of Excessive Log Switches In Oracle Databases
Real-World Telemetry Analysis: Test 1 vs. Test 2 To understand how severe write volumes impact database latency, let us evaluate two distinct test profiles running the exact same heavy transactional workload. These profiles highlight the staggering volume of log writer activity occurring under typical enterprise applications: Database Profile (Test 1): Sustaining an intensive write rate of 35,550,156.8 bytes per second (~33.90 MB/sec) of redo generation. Database Profile (Test 2): Sustaining an even higher write rate of 40,691,343.8 bytes per second (~38.81 MB/sec) of redo generation. A consistent generation rate of 34 MB/s to 39 MB/s is classified as a highly active, heavy write workload. If the underlying layout of the database's log files is structured using default or undersized parameters, this heavy transactional density forces a systemic collision point between logical software processing and physical disk checkpointing. Reverse-Engineering Your Log Sizes from Switch Activity Because physical redo log dimensions are structural layouts rather than configuration variables, they are not listed inside the Modified Parameters section of standard database diagnostic summaries (such as AWR reports). Instead, engineers must combine the sustained redo byte velocity with recorded switch intervals to uncover the current physical geometry using this model: S Log = (R sec × 3600) / N switch Where S Log represents the calculated current log size, R sec represents the redo byte velocity per second, and N switch represents the total number of log switches executed per hour. Modeled Redo Layout Dimensions Based on Active Workloads Log Switches Observed / Hour Test 1 Profile (33.90 MB/sec) Test 2 Profile (38.81 MB/sec) Engine State & Systemic Latency Impacts 30 Switches / Hour (Every 2 minutes) ~4,068 MB (4 GB) ~4,657 MB (4.5 GB) Continuous, aggressive database checkpointing. Disk queues are consistently saturated writing dirty blocks to datafiles. 60 Switches / Hour (Every 1 minute) ~2,034 MB (2 GB) ~2,328 MB (2.3 GB) Severe operational throttling. High threat of transaction processing freezes while the engine waits for space. 120 Switches / Hour (Every 30 seconds) ~1,017 MB (1 GB) ~1,164 MB (1.1 GB) Critical architectural failure point. Heavy occurrence of log file switch completion wait states. The Mechanics of a Log Switch Bottleneck Why does a high log switch count destroy performance? It is crucial to understand what the Oracle database engine is forced to do behind the scenes every single time a log group fills up: Forced Incremental Checkpointing: When a log switches, the database must advance its checkpoint. This forces the Database Writer processes (DBWn) to aggressively flush dirty data blocks from memory (the Buffer Cache) out to the permanent datafiles on disk to ensure crash-recovery safety. Control File Serialization: The database must update its control files to record the new log sequence architecture. This introduces internal metadata synchronization locks (enqueues) that can cause user sessions to stall. Archiver Contention: The Archiver background processes (ARCn) must instantly awake and begin reading the newly filled redo log to copy it to the archive destination. If the logs are small and switching every few seconds, the archivers cannot keep pace, completely locking the log writer (LGWR) out of the next group in the rotation. The accumulation of these three internal operations manifests directly as elevated log file sync and foreground wait latencies. To an outside observer, it looks like the storage array is failing to write fast enough, but in reality, the database engine is choking on its own structural layout. Sizing for the 20-Minute Target Window To neutralize this threat, we apply standard best-practice mathematics to size the log allocations cleanly for a conservative, stable 20-minute operational window under the observed workloads: Mathematical Formulation: Test 1 Architecture Sizing: 33.90 MB/sec × 60 seconds = 2,034 MB/minute. For a 20-minute window: 2,034 MB × 20 minutes = 40,680 MB (~40 GB per log group). Test 2 Architecture Sizing: 38.81 MB/sec × 60 seconds = 2,328.6 MB/minute. For a 20-minute window: 2,328.6 MB × 20 minutes = 46,572 MB (~46 GB per log group). Sizing Standard: To provide a safe, cushioned operational margin during unpredicted transaction spikes, configuring an allocation of 40 GB to 48 GB per log group across a minimum of 4 to 5 log groups will completely iron out the checkpointing waves and restore a smooth, predictable processing flow. DBA Command and Verification Track To audit your live database environment immediately, run the following administrative query to verify your current log configuration and status: SELECT GROUP#, THREAD#, SEQUENCE#, BYTES/1024/1024/1024 AS SIZE_GB, STATUS FROM V$LOG; If this output returns sizes sitting at outdated, legacy defaults (such as 1 GB or 2 GB) while under modern, high-velocity workloads, you have found your hidden bottleneck. Correcting the redo allocation path will immediately relieve the artificial pressure on your data layer. Quantifiable Database Performance Savings The most profound impact of implementing best-practice redo log sizing is the immediate reclamation of database processing capacity. Reclamation of Core Processing Time: Production environments can anticipate an immediate 15% to 20% savings in overall database processing time, particularly on nodes operating under synchronous replication frameworks. Elimination of Forced Wait States: Diagnostic telemetry shows the database spends up to 20.65% of its total operational life completely frozen within log file sync events. While a portion of this is network transit overhead, a significant contributor is the engine constantly stalling to handle back-to-back log switches occurring multiple times per minute. CPU Cycle Optimization: Transitioning to a stabilized footprint of 2 to 3 log switches per hour removes self-inflicted logical barriers, dropping the active wait-state percentages down and immediately returning vital CPU cycles back to active user transactions and application processing. Targeted Systems and Subsystem Benefits Correcting the redo allocation geometry triggers a positive cascade of efficiency across multiple independent layers of the database infrastructure ecosystem: A. Storage I/O Optimization (Flattening the Checkpoint Waves) Every time an individual redo log file reaches capacity and triggers a switch, Oracle mandates an aggressive incremental checkpoint. The Database Writer background processes (DBWn) are forced to violently halt standard operation to clear, prioritize, and flush "dirty" data blocks from the volatile Buffer Cache down to the permanent physical storage datafiles. The Strategic Benefit: Instead of a chaotic, cyclic pattern where disk I/O heavily spikes and crashes every 30 to 60 seconds, the underlying storage fabric encounters a flattened, smooth, and highly predictable write curve. Physical disk queue depths drop significantly, completely removing artificial array-level performance chokes. B. Elimination of Control File Enqueue Serialization To cleanly finalize a log switch, the database engine must gain exclusive metadata locks to write updated sequence architectures directly into the database control files. When a misconfigured environment forces this action hundreds of times an hour, user sessions become trapped in an internal serialization traffic jam. The Strategic Benefit: Scaling the logs ensures that control file metadata modification occurs only a few times per hour. This completely erases internal enqueue contention and prevents micro-stalls from propagating to foreground user processes. C. Mitigation of Archiver Process (ARCn) Contention Under high-velocity write workloads (~34 MB/s to 39 MB/s), undersized logs fill up substantially faster than the Archiver background processes (ARCn) can read and copy them to designated archive log destinations. If the archivers fall behind the pace of the log writer, the Log Writer (LGWR) will freeze all database processing because it is structurally prohibited from overwriting an unarchived log group. The Strategic Benefit: Deploying 40 GB to 48 GB log groups builds a wide, stable, 20-minute processing window. This provides the ARCn processes ample buffer space to quietly copy data streams in the background without ever creating a risk of blocking active application transactions. D. Stabilization of Application Response Uniformity From an end-user and application integration perspective, transaction latency becomes completely uniform and highly predictable. The Strategic Benefit: Currently, a user session may encounter an instantaneous transaction response, followed a moment later by a multi-second delay simply because their specific COMMIT command executed simultaneously with a log switch checkpoint. Eliminating constant switches ensures uniform, predictable, and sub-second transaction commit processing across the entire user base. Conclusion and Core Directive Undersized redo logs force high-performance solid-state storage arrays to absorb massive amounts of unnecessary operational punishment by demanding that files be opened, written, closed, checkpointed, and archived hundreds of times per hour. Increasing the log file size to align with a 20-minute target window does not merely alter a structural capacity metric; it fundamentally upgrades the internal execution efficiency of the core Oracle database engine. It systematically clears the log file sync bottleneck, cools down spiking CPU usage, and allows your enterprise data infrastructure to operate at its true peak potential.3Views0likes0CommentsNew IDC Research: Is Your AI Stack Ready? IDC Settles the Debate
July 9 | Register Now More than half of enterprise AI projects never make it to production and new IDC research reveals the surprising reasons why. Join us to see the exclusive results from our joint global survey of 1,300 organizations. This session will uncover what separates AI leaders from those stuck in pilot purgatory, including critical findings about data security, GPU underutilization, and the hidden time tax that slows data scientists. See how your AI maturity stacks up against IT peers and learn the concrete steps that leading organizations are taking to move from experimentation to mastery. Key takeaways include: New research that uncovers what’s stalling enterprise AI initiatives Which infrastructure and data decisions actually predict AI project success How your organization benchmarks against the IDC AI Readiness Index What leading enterprises are doing differently to achieve ROI Register Now!114Views0likes0CommentsFlashblade & Ansible collection
Hello community, We are facing some issues with our FlashBlade when provisioning policies with Ansible. We have updated all the components (py-pure-client and the collection), and we are failing on a simple policy deployment. - name: Test purestorage.flashblade.purefb_policy: name: "filesystem_export_policy" enabled: True policy_type: "nfs" at: True client: "mynfsexport.flashblade.local" secure: True security: "sys" permission: "rw" access: "no-squash" fb_url: "{{ fb_host }}" api_token: "{{ fb_api_token }}" The complete stack trace is : The full traceback is: Traceback (most recent call last): File "/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py", line 107, in <module> _ansiballz_main() File "/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py", line 99, in _ansiballz_main invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS) File "/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py", line 47, in invoke_module runpy.run_module(mod_name='ansible_collections.purestorage.flashblade.plugins.modules.purefb_policy', init_globals=dict(_module_fqn='ansible_collections.purestorage.flashblade.plugins.modules.purefb_policy', _modlib_path=modlib_path), File "<frozen runpy>", line 226, in run_module File "<frozen runpy>", line 98, in _run_module_code File "<frozen runpy>", line 88, in _run_code File "/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py", line 4233, in <module> File "/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py", line 4008, in main File "/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py", line 2392, in update_nfs_policy AttributeError: 'ErrorResponse' object has no attribute 'items' fatal: [localhost]: FAILED! => { "changed": false, "module_stderr": "Traceback (most recent call last):\n File \"/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py\", line 107, in <module>\n _ansiballz_main()\n File \"/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py\", line 99, in _ansiballz_main\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n File \"/home/user/.ansible/tmp/ansible-tmp-1778656887.1852276-3067455-48057529014464/AnsiballZ_purefb_policy.py\", line 47, in invoke_module\n runpy.run_module(mod_name='ansible_collections.purestorage.flashblade.plugins.modules.purefb_policy', init_globals=dict(_module_fqn='ansible_collections.purestorage.flashblade.plugins.modules.purefb_policy', _modlib_path=modlib_path),\n File \"<frozen runpy>\", line 226, in run_module\n File \"<frozen runpy>\", line 98, in _run_module_code\n File \"<frozen runpy>\", line 88, in _run_code\n File \"/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py\", line 4233, in <module>\n File \"/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py\", line 4008, in main\n File \"/opt/tmp/tmp/ansible_purestorage.flashblade.purefb_policy_payload_xaietsz7/ansible_purestorage.flashblade.purefb_policy_payload.zip/ansible_collections/purestorage/flashblade/plugins/modules/purefb_policy.py\", line 2392, in update_nfs_policy\nAttributeError: 'ErrorResponse' object has no attribute 'items'\n", "module_stdout": "", "msg": "MODULE FAILURE: No start of json char found\nSee stdout/stderr for the exact error", "rc": 1 } If the policy is already set, the playbook runs perfectly fine. The problem occurs when a new policy is injected. However, we have to install the collection from GitHub. It seems that the artifact built and installed by Ansible Galaxy is not correct. Some commits are not propagated. For example, with the FlashBlade collection 1.24 (installed from Galaxy), we have: >>> from pypureclient.flashblade import ( ... PolicyRuleObjectAccessCondition, ... PolicyRuleObjectAccessPost, ... PolicyRuleObjectAccess, ... NfsExportPolicy, ... NfsExportPolicyRule, ... Policy, ... PolicyPatch, ... PolicyRule, ... SmbSharePolicyRule, ... SmbSharePolicy, ... SmbClientPolicyRule, ... SmbClientPolicy, ... ObjectStoreAccessPolicyPost, ... NetworkAccessPolicy, ... NetworkAccessPolicyRule, ... WormDataPolicy, ... ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/conda/envs/ansible/lib/python3.12/site-packages/pypureclient/flashblade/__init__.py", line 40, in __getattr__ raise ImportError(f'module {__package__} has no attribute {name}') ImportError: module pypureclient.flashblade has no attribute PolicyPatch We have compared the archive from GitHub and there are many differences with the upstream collection code. To help, find my my env package : Package Version ----------------- ----------- annotated-types 0.7.0 ansible-core 2.18.5 bcrypt 5.0.0 certifi 2026.4.22 cffi 2.0.0 cryptography 44.0.0 dnspython 2.8.0 invoke 3.0.3 Jinja2 3.1.4 MarkupSafe 2.1.1 packaging 24.1 paramiko 5.0.0 pip 24.2 py-pure-client 1.88.0 pycparser 2.22 pydantic 2.13.4 pydantic_core 2.46.4 PyJWT 2.12.1 PyNaCl 1.6.2 python-dateutil 2.9.0.post0 PyYAML 6.0.2 resolvelib 1.0.1 setuptools 75.8.2 six 1.17.0 typing_extensions 4.15.0 typing-inspection 0.4.2 urllib3 2.7.0 wheel 0.45.0 And our flashblades are running the Purity//FB 4.6.9. If anyone has a lead to help us get our Ansible back up and running, I thank them in advance. We appreciate your help. Matth556Views0likes10CommentsGuarantee Data Availability: How to Create a Snapshot Bunker
June 23 | Register Now! Without data integrity and availability, recovery is difficult and may take days or even weeks. Everpure prioritizes remediation and recovery as the critical path to cyber resilience. Data availability is the foundation for reliable remediation and rapid recovery, enabling organizations to restore data regardless of the severity of a disaster or cyberattack. Why a snapshot bunker is the cornerstone of effective layered resilience Key considerations for architecting the bunker to ensure survivability How to set up and operate a snapshot bunker Register Now!161Views0likes0CommentsSee It Before It Happens: Predict and Fix Infra Issues Early
June 25 | Register Now! Database and infrastructure teams often struggle to predict how changes will affect application real-world performance, capacity, and service levels. Without clear visibility into where applications are running and how they consume resources, even small adjustments can introduce unexpected latency, IOPS bottlenecks, or capacity shortfalls, making troubleshooting slower and increasing the risk of missing SLAs. Join us and learn how to: Use scaling simulations to predict impacts on latency, IOPS, and capacity before changes are made. Apply data intelligence to identify where applications are running and how they consume resources. Improve performance planning and reduce troubleshooting time while maintaining stronger adherence to SLAs. Register Now!146Views0likes0CommentsAnnouncing the General Availability of Purity//FB 4.7.10 LLR
We are happy to announce the general availability of 4.7.10, the first Long-Life Release (LLR) of the 4.6/4.7 line, providing long-term maintenance of the 4.6 feature set with consistency in capabilities, user experience, and interoperability. For more detailed information about bug fixes and security updates included in each release, see the release notes. Highlights The 4.7 LLR adds many new features and improvements for customers upgrading from the 4.5 LLR: AI and high-performance workload acceleration Major gains for AI, HPC, and other demanding unstructured workloads through higher-performance platform options, increased bandwidth, better read/write concurrency, and S3 over RDMA, provide customers faster time to results, better GPU utilization, and more value from large-scale file and object environments. Lower cost at scale with smarter data placement and efficiency Zero Move Tiering for file and object, larger supported system scale, tenant-level performance controls, and DeepReduce for FlashBlade//E help customers scale capacity more efficiently, improve workload density, and reduce total cost of ownership without adding operational complexity. Stronger security and compliance Important security capabilities including NFS over TLS, mutual TLS, FIDO2 and authentication policy support for SSH, object auditing, and proactive Purity patching, helping customers protect data in flight, strengthen access controls, simplify compliance, and respond faster to risk. Multi-tenancy and service delivery Realms, management access policies, secure multi-tenancy for object, local users and groups, and Fusion workload and preset enhancements give customers stronger isolation, simpler delegated administration, and more consistent service delivery across shared environments and dark sites. Improved observability, resilience, and operations On-box observability, richer protocol and connection statistics, Rapid Replica enhancements, larger-scale file data protection, and hardware-based availability improvements make it easier for customers to troubleshoot faster, protect more data at scale, and run mission-critical environments with greater confidence. This release is supported on the following FlashBlade Platforms: FB//S100, FB//S200 (R1, R2), FB//S500 (R1, R2), FB//ZMT, FB//E, FB//EXA Release recommendations Customers looking for long-term maintenance of the complete 4.6 feature set are encouraged to upgrade to the 4.7 LLR. Customers looking for the latest features as soon as they are available should upgrade to the 4.8 FR. Check out our AI Copilot intelligent assistant for deeper insights into release content and recommendations. EOL schedule The 4.6/4.7 release line is planned to reach End-of-Life (EOL) in April 2029. Links and References Purity//FB 4.7 Release Notes Everpure Product Lifecycle and Support Policies Overview Purity//FB Release and End-of-Life Schedule Purity//FB Release Guidelines FlashBlade Hardware and End-of-Support FlashBlade Capacity and Feature Limits Pure1 Manage AI Copilot450Views0likes0CommentsHands-on with Everpure's FlashBlade//EXA
This is a syndicated repost from the WWT Company Blog. The original post can be found here. The Everpure FlashBlade and why the need for a new design The original FlashBlade was released in 2016 and was the first of its kind, delivering an all-flash solution for unstructured data, which had long been served by the spinning-disk market. With the exponential growth of unstructured data, Everpure (formerly Pure Storage) updated the FlashBlade design with a modular approach in 2022 called the FlashBlade//S that allowed compute blades to scale independently from the storage by using their DirectFlash Modules (DFMs) instead of the NAND chips being soldered onto each blade as was done in the first generation of the FlashBlade design. Despite the hardware changes, the heart of the solution (Purity//FB software) still attains phenomenal performance by using a Key-Value database as the metadata engine. In fact, the latest testing shows that a single FlashBlade//S chassis can support 3.5 trillion objects in about 100 MB of metadata space. The FlashBlade//S solution scales to 10 chassis (100 blades) and is well-suited for many AI storage use cases, such as data ingest and model training. As AI Dataset sizes increase into the petabytes, and the number of GPUs used for training and inferencing grows into the tens of thousands, the FlashBlade//S architecture doesn't scale as efficiently and economically to meet the needs; thus, the FlashBlade//EXA was born in 2025, which expanded the FlashBlade//S architecture by separating the data storage from the metadata operations. //EXA Architecture In traditional High Performance Computing and AI environments, storage systems that incorporate parallel filesystems have been dominant due to their performance, but they are also very difficult to install and complex to manage. With the maturity of parallel NFS (pNFS), we are seeing more vendors offering pNFS solutions because of the similar performance it delivers without all the extra complexity. FlashBlade//EXA utilizes pNFS in its new disaggregated storage architecture, pairing one or more FlashBlade//S500 chassis as Metadata Nodes (MN) with commodity rack servers filled with SSDs as Data Nodes (DN). This allows you to scale and size the solution based on your performance and capacity needs. How does data flow and client connections work in this new design….I'm glad you asked. When a client initiates a read or write operation, it establishes a parallel NFS (pNFS) connection to the MN. The MN acts as an "air traffic controller", redirecting the client to the appropriate DNs serving the File System for a direct access connection via the blazing-fast NFSv3 over RDMA protocol. Meanwhile, the MN(s) and DN(s) are in constant communication behind the scenes, handling file system creation and updating the metadata key-value store to keep track of where the data resides across the DNs. This architecture is purpose-built for high throughput and parallel access, ensuring that neither the metadata operations nor data access becomes a bottleneck. The results of this architecture change for FlashBlade//EXA are a high-performance, scale-out storage solution built for modern data needs. The updated design provides significant parallelism, high throughput, and the flexibility to handle both AI and HPC workloads. As Metadata requirements change, customers can simply scale the FlashBlade//S cluster from 1 to 10 chassis with each chassis supporting up to 10 blades, while still utilizing a single virtual interface port (VIP) connection that spreads the load across the cluster to utilize all the blades efficiently. As capacity needs change, simply add more DNs (up to 1000) with the SSD capacities and quantities required to meet your needs. The MNs, DNs and clients are all connected via 400 Gb network switches for low-latency, high-throughput connectivity while limiting the number of cables used to simplify the installation process. Installation Historically, Everpure's hardware appliances (FlashArray and FlashBlade) have always been just that, an appliance. Simply rack the gear, connect the cables, copy the desired software version from a USB drive, and run through the setup wizard. Within a few hours, the array would be ready to provision storage and allow client connections. In the ATC, we've installed numerous FlashArrays and FlashBlades for customer evaluations and can testify that the installation process is straightforward and quick. The FlashBlade//S (a.k.a. MN) installation was what we were used to. The recommended software version was installed on the External Fabric Modules (XFMs); we then connected the FlashBlade chassis cables to the XFMs, where the software was pushed to each of the blades and ran through the setup wizard to complete the base install steps and access it across the network. It's worth noting that any time you open up your ecosystem to use commodity servers in the design, there's going to be new challenges and growing pains around the installation, configuration, and management. And the responsibilities for securing unauthorized access and out-of-band management falls to the customer as it's no longer a hardened appliance. This was a new experience for us with Everpure as we went into this with the appliance mentality and forgetting that this design incorporated the SDS characteristics for the installation and ongoing maintenance. Note - while storage appliances typically incorporate all the firmware, drivers, and software updates as part of the upgrade process, those ongoing maintenance steps are separate tasks for the SDS approach and need to be managed by the team(s) responsible for the hardware. As it relates to management, every OEM's out-of-band management interface is different, some better than others, and requires trial and error to get it right, both on the cables/adapters used and the settings required to make a successful connection to remotely manage the device. With all that said, the rack servers (a.k.a. DNs) installation was not a simple and quick installation…but that's the beauty of the AIPG - allow WWT to iron out the kinks, prove out the steps required to make things work together, all while reducing time and risk for the entire process. The deployment in our lab sandbox consisted of a Linux management VM that runs the FlashBlade//EXA Services Container. This Services Container provides TFTP & DHCP services, a repository for installation files and scripts, and a Prometheus and Grafana instance for ongoing monitoring of the Data Node's performance. This is also were maintenance tasks, such as disk replacements, on the DNs are initiated. While this was only a small 8 DN configuration, we wanted to treat it as if it was 100, 500 or even a 1000 node install to get an idea of what a customer would expect during the installation process. While we could have simply copied the installation files and software to a USB drive to plug in locally to each server, we used the provided automation scripts and steps for the installation process by having the DNs boot over the network to load the software and configuration files from the management VM. This meant we needed to configure out-of-band networking on the DNs and change the BIOS to allow network booting. Next, we captured the MAC address for the server's onboard NICs to set up DHCP reservations and node names that would be used in the FB//EXA deployment. Finally, we configured the DHCP options to direct the DNs to the TFTP server running on the Linux VM. After a few attempts and a couple of tweaks with our management network setup, we were able to start the DN installation. The upside of troubleshooting new installations is that you really get to learn the product, how things work under the covers, and to collaborate with the OEMs so they can update their install docs and environment prerequisites to help customers avoid the same challenges in the future. In our experience, no two environments are the same; they are all configured a little differently and use different switch models and OEMs. With the base setup and deployment complete, it was time to configure the solution. At the time of our testing, the Viking VSS2320 servers are the only currently supported server model, as they provide hardware-based redundancy for high availability (HA) by allowing each server controller in the 2RU chassis to connect to all installed SSDs. In the event of a server failure, the remaining server can take over access to the drives and the data they contain. In a future software release, the resiliency will be done via software-based erasure coding, which will remove the hardware requirement for HA and allow additional server OEMs and models to be supported. Configuration FB//EXA With the Purity//DN image installed on the DNs, a few tasks remained before we could join them to the MN. For each DN, we needed to run a command to format the DN's internal storage (local NVMe drives), then another command to run a health check. Once all the DNs were in a healthy state, the last couple of steps were done via an SSH session to the MN to create the first Node Group and add the DNs to it. Note - In a large-scale FB//EXA deployment, there may be a need for multiple Node Groups (e.g., different departments or multi-tenancy), and a DN can belong to multiple Node Groups. We started with only 6 DNs in the group and later added 2 more, as shown in the image below. In the current release tested, there is no DN rebalancing of the data as reflected with DNs 9/10 having less consumed data on them. And in case you are wondering DNs 1/2 needed a firmware update at the time of the Node Group creation and will be used for future customer POCs. At this point, the system was ready to have a File System created. This step consisted of associating the File System to a single Node Group, specifying the size of the File System, and providing a name - which was all done through a single command. The only thing left to configure was the protocols enabled for the File System and the rules & policies for who can access the network share. Clients On the client side, we used two high-performant servers with GPUs and 2 x 400 Gb network cards running an Ubuntu OS. There are only a few requirements related to BGP and RoCEv2 networking that need to be configured so we installed the standard FRRouting package on the clients, enabling bgpd and configuring the service. Note - FlashBlade//EXA utilizes a common layer 3 Border Gateway Protocol (BGP) network designed for performance and efficiency, along with Remote Direct Access Memory (RDMA) that is optimized for high speed and low latency. The dual 400 Gb Connect-X network ports were then configured with the correct Priority Flow Control and DSCP mapping settings to support RoCEv2. Finally, to complete the configuration phase of the install, we installed the Everpure-provided "nfs-client-pure-dkms" Linux package, which optimizes the Linux kernel NFS. sudo apt install ./nfs-client-pure-dkms_1.0_amd64.deb Testing With the File System created on the FB//EXA and the clients configured, we were ready to start the testing. All that was left to do was mount the File System on the Clients using the below mount command that specifies the single MN VIP and File System. This is because the FlashBlade//S internally load balances the connections automatically across all the available blades. sudo mount -t nfs -o vers=4.1,proto=tcp,nconnect=16 <data_vip>:<filesystem> /mnt/nfs Note – the mount command specifies the file system type of NFS, with options for NFS version 4.1 and nconnect=16 to establish multiple TCP connections to the VIP. Here's where things got fun. During baseline synthetic testing, FlashBlade//EXA achieved near line-rate performance on a single client with dual 400 Gb ConnectX adapters. In a 100% read workload, aggregate throughput of the two 400 Gb NICs reached 781 Gb/s (97.65 GB/s), effectively saturating the available 800 Gb/s of network bandwidth on a single client. In a 100% write workload test using 512k block size a single client with two 400 Gb NICs averaged a sequential write throughput of 83 GB/s (77.3 GiB/s). As we added a second client in the mix with the same hardware specs, latency remained consistently low, and throughput scaled linearly across our tests. 100% Write across 2 x clients each with 2 x 400 Gb/s NICs In the end, we found that client-side networking was the bottleneck in our lab setup. The FB//EXA did a great job of balancing metadata operations across the blades and spreading read/write operations across the DNs that serviced the file system presented to clients. Our best guess is that it would take 8-10 clients, each with 2 x 400 Gb NICs, to saturate the network connections to the 8 DNs in our setup. Power requirements are another important factor to consider. While in an idle state, the solution consumed about ~5-6 kW of power. During the 100% write workload test using two clients, the FB//EXA solution consumed approximately 8.5 kW during sustained write tests and about 7.2 kW during sustained read tests. Summary In closing, FlashBlade//EXA is fast and made a strong impression on our AI Proving Ground team. From the disaggregated design to the simple client setup, it's a solid choice for anyone needing serious storage horsepower—especially if you want to spend more time running workloads and less time tinkering. And with FlashBlade//EXA running the same Purity//FB operating system, the learning curve will be quick for those already familiar with FlashBlade's UI. We're excited to collaborate with our customers as they explore use cases that require FB//EXA-level performance and future enhancements as the product evolves. Our initial impression is that this platform truly delivers on its promises for today's data-driven environments. Are you ready to evaluate FB//EXA for your demanding AI and HPC workloads? Let our AIPG teams help de-risk and accelerate decision-making for your next-generation, high-performance storage needs. AI Proving Ground in the ATC WWT's Advanced Technology Center (ATC) is a state-of-the-art facility that allows customers, partners, and employees to explore, test, and validate technology solutions in a collaborative environment. The AI Proving Ground (AIPG) is an initiative to develop, test, and implement artificial intelligence solutions within the ATC. The AIPG enables AI technologies to be explored, validated, and demonstrated in real-world scenarios, allowing organizations to assess the capabilities and potential of AI solutions before deploying them at scale. Technologies51Views1like0CommentsEnabling Agentic AI via Pure1 Manage MCP Server
Everpure now offers a Pure1® Manage MCP Server so you can query information about your fleet using natural language questions. In this post, I’ll explain how the Pure1 Manage MCP Server works. The first section will explain MCP in general, and the second section will explain how to use our specific server. Feel free to skip to the Quick Start section if you’re already familiar with MCP and just need the parameters to plug into your host. What is MCP? MCP stands for "Model Context Protocol," and it's a way for users to connect their AI applications to external systems using tool calls. MCP tools are fundamentally rooted in application programming interfaces (APIs). An API is a set of rules and protocols that allows different software applications to communicate with each other. It acts as an intermediary, enabling one piece of software (the client) to request information or functionality from another piece of software (the server) without needing to know the server's internal workings. For instance, when you check the weather on your phone, the weather app uses an API to send a request to a weather service, which then returns the current weather data. AI applications have trouble making API calls directly because APIs are designed for completeness and correctness, not for an LLM to use easily. When an AI application wants to use an external system to handle a user’s request, it uses the MCP protocol to make a tool call. The AI (client) requests a function (the tool) from an external system (the server), and the system executes the function and returns a result. This makes MCP a system that standardizes and mediates API-like interactions, allowing AI models to leverage external, real-world capabilities. For more information, see this article on the MCP website: “What is the Model Context Protocol (MCP)?” How can customers benefit from the Pure1 Manage MCP Server? The Pure1 Manage MCP Server enables customers to securely integrate AI assistants, copilots, and agentic systems with live Pure1 telemetry and operational data—without building custom API integrations. It transforms Pure1 from a dashboard-centric experience into an AI-accessible platform, enabling natural language interaction, contextual automation, and real-time operational intelligence. Customers benefit from faster AI integration, reduced engineering effort, preserved security controls, and improved decision velocity across hybrid environments. What types of customer workflows are best suited for MCP? The Pure1 Manage MCP Server is particularly well-suited for agentic and AI-driven workflows, including: Fleet telemetry integration with customer copilots Expose Pure1 telemetry—arrays, volumes, workloads, metrics, and alerts—into internal copilots, chatbots, or AI platforms via MCP endpoints. Value: Unified operational visibility across hybrid and multi-platform environments Automation with context awareness Use MCP to validate storage state, health, performance, or capacity before executing provisioning, backup, or disaster recovery workflows. Value: Safer automation with contextual validation, reduced execution errors, and greater rollback confidence Hybrid cloud observability Correlate Everpure array performance and capacity metrics with application, VM, container, or cloud telemetry across environments. Value: Faster troubleshooting and improved end-to-end performance insights Conversational operational analytics Enable operators to ask real-time natural language questions, such as: “Which arrays are nearing capacity risk?” “Show me the top latency spikes in the last 24 hours.” “Summarize all critical alerts across regions.” Value: Rapid insight without navigating dashboards or exporting reports What the Pure1 Manage MCP Server can do The Pure1 Manage MCP Server exposes the Pure1 Manage REST APIs to AI applications. This means you can ask ad hoc questions about your fleet using natural language, and your AI application will use the server’s tools to query the answer. No programming required. Example conversation Here’s an example conversation demonstrating the use of the Pure1 Manage MCP Server: User: "What is the name of my oldest FlashArray and what is its serial number?" Chatbot:Processing Request: The AI recognizes this as a request for fleet information requiring a tool call. Tool Invocation: Get_hardware Response: The oldest FlashArray in your fleet is named FA-Prod-01 with serial number FA10123456. User: "Can you show me the average I/O latency for that array over the last 7 days?" Chatbot: Let me first check the available latency metrics, then query the historical data: Tool Invocation: Get_metrics Tool Invocation: Get_metrics_history Response: The average I/O latency for FA-Prod-01 (serial FA10123456) over the last 7 days was 0.385 milliseconds. User: "How many arrays do I have in total that are still under warranty?" Chatbot: Processing Request: The AI recognizes this as a request for a count of arrays filtered by warranty status. Tool Invocation: Get_array_support_contracts Response: You currently have 12 arrays in your fleet that are still under active warranty. Quick start Step 1: Register an API key in Pure1 Manage The Pure1 Manage MCP Server leverages the Pure1 Manage REST APIs. In order to access those APIs, you need to register an API key in Pure1 Manage. To do that, follow the directions in the The Pure1® REST API introductory blog post. After going through the instructions, you will have an application id and a private key file, which will be used to generate an access token to access the MCP server in step 2. Step 2: Set up the pure1_token_factory.py script Prerequisites: you need Python 3.12 or greater to run the script. Download pure1_token_factory.zip. Unzip the archive. Go to the unzipped folder in your command-line terminal. Optional but recommended: create and activate a Python virtual environment: python3 -m venv .venv source .venv/bin/activate Install the requirements: pip3 install -r requirements.txt. Run python3 pure1_token_factory.py <application_id> <private_key_file> Copy the generated access token from the script output for the next step. Step 3: Add remote MCP server to your AI application Follow the directions for your AI application to add a remote MCP server (see the Pure1 Manage MCP Server User Guide for instructions for specific chatbots). In general, they need the following information: Remote MCP Server address: https://api.pure1.purestorage.com/mcp Authorization type: header Header name: Authorization Header value: Bearer <access-token> Important: <access-token> is just a placeholder for the access token you generated in step 2. The actual header value should look something like “Bearer eyJ0eXAiO…” Important: you need to generate a new access token every 10 hours and copy it into your AI application You’ll need to run pure1_token_factory.py to generate a new access token every 10 hours, and manually copy the access token into your AI application’s config. Claude Desktop instructions Claude Desktop is a special case because it doesn’t let you set the Authorization header directly. You have to run the mcp-remote local MCP server and configure that to use the Pure1 Manage remote MCP server. Prerequisites You need to have Node.js version 18 or newer installed on your system. Configuration In Claude Desktop, go to Settings > Developer, and click Edit Config. Open the claude_desktop_config.json file in a plain-text editor like VS Code. Configure the mcp-remote server, which is necessary to pass the Authorization header to the Pure1 Manage MCP Server. Paste the token into the configuration file, then restart Claude Desktop. { "mcpServers": { "Pure1 API": { "command": "npx", "args": [ "-y", "mcp-remote", "https://api.pure1.purestorage.com/mcp", "--header", "Authorization:${AUTHORIZATION_HEADER}" ], "env": { "AUTHORIZATION_HEADER": " Bearer <paste access token here>" } } } Note: there might be other configuration options in this file. Be sure to leave them unchanged, and only insert the Pure1 API config in the mcpServers section. The space in the AUTHORIZATION_HEADER environment variable is important. It's there to work around a bug in Windows argument parsing. Please note that: The first time it uses a tool, it will ask you for permission. You can grant permission to all tools at once by going to Customize > Connectors > Pure1 API, and selecting Always Allow under Other tools. For more detailed instructions from Anthropic, please refer to: Connect to local MCP servers - Model Context Protocol.170Views0likes0CommentsWhy Object Storage Still Matters
In Part 2, I wrote a line that, at the time, felt almost like a side comment — something I typed without fully appreciating how much it would change the direction of the story: “BREAKING NEWS: The FlashArray now supports Object??? What in the world? I may need to write an article about that!!” That reaction wasn’t planned, and it definitely wasn’t me being clever. It was me looking at the GUI and thinking, “that can’t be right… can it?” It didn’t line up with how I’ve been modeling storage architectures in my head for years, which usually means one of two things: either something fundamentally changed… or I’ve been confidently wrong about part of this for a while. And if I’m being completely honest, there was also a second reaction happening in parallel — one that I didn’t write down at the time because it sounded slightly ridiculous even in my own head: “Wait… do I actually understand why object storage exists in the first place? And more importantly… what exactly was wrong with files?” That’s the part nobody likes to admit out loud. We’ve all spent years confidently explaining block, file, and object as if we were born with that knowledge, when in reality most of us learned it incrementally, retroactively, and with just enough conviction to sound credible in front of a customer. Object storage, in particular, has always carried this aura of inevitability — like of course it’s better, of course it scales, of course it’s what modern applications need — without always forcing us to question why the previous model stopped being enough. Because for as long as most of us have been designing infrastructure, object storage has not simply been another protocol layered onto an existing system. It has represented a fundamentally different way of organizing and accessing data, one that required its own architectural approach, its own scaling model, and, more often than not, its own dedicated platform. The separation between block, file, and object was not arbitrary; it was a reflection of how deeply different those paradigms were in terms of metadata handling, access patterns, and performance expectations. This is precisely why platforms such as Everpure FlashBlade exist in the first place. They were not created as extensions of traditional storage systems but as purpose-built architectures designed to treat unstructured data — and particularly object data — as a first-class citizen. The use of distributed metadata services, sharded across independent nodes, combined with a key-value store storage model, allows such systems to achieve levels of parallelism and throughput that simply cannot be replicated within a controller-based design. In that context, object storage is not something that is “added” to the system; it is the system. Which is why seeing S3 support appear on FlashArray required a pause. Not excitement. Not skepticism alone. Something closer to intellectual friction. Reconciling Two Architectural Worlds The most important step in understanding what FlashArray has introduced is to resist the temptation to treat it as a direct comparison to FlashBlade. These aren’t two different ways of solving the same problem. They’re two different answers to two different problems—and pretending otherwise is where people get themselves into trouble. FlashBlade is built for object, not adapted to it. S3 talks directly to a distributed engine that thinks in objects, not files pretending to be objects. Metadata is spread across blades instead of becoming a centralized choke point, and the whole system scales the way modern workloads actually need it to. There’s no file system layer to fight with, no directory structure to navigate, no POSIX semantics getting in the way. It just does what you’d expect when you remove all of that: it goes fast, it scales cleanly, and it keeps up with workloads like HPC, AI and analytics without breaking a sweat. FlashArray takes a very different path, and in reality, it’s not what most people expect. It doesn’t try to reinvent itself as an object platform, and it doesn’t throw an S3 gateway in front of the array and call it a day. With Purity 6.10.5+, S3 just shows up as another protocol the system understands, right next to block and file. That distinction matters more than it seems. This isn’t something duct-taped on the side — it’s part of the same control plane, the same data path, the same system you’ve already been running. But let’s not pretend it turned into FlashBlade overnight. This is still a controller-driven architecture. The primary controller does the heavy lifting — handling requests, authenticating them, coordinating operations — before anything actually hits the storage engine. Which means it behaves differently, especially as workloads scale. So it ends up in this interesting middle ground. Not a native object system in the pure sense, but not a hack either. Just a different way of exposing what’s already there. The Translation Layer and Its Consequences It would be irresponsible to discuss FlashArray S3 without explicitly addressing the implications of this design. Even with its native integration into Purity, S3 operations are still subject to the realities of a controller-bound architecture. Every request must be processed, authenticated, and coordinated before it is executed, introducing a measurable difference in behavior compared to both native block operations and distributed object systems. The most immediate effect is latency. While FlashArray continues to deliver sub-150 microsecond performance for block workloads, S3 operations typically operate at higher latencies (in 1 millisecond range) due to the additional processing steps involved. This is not a flaw; it is the natural outcome of introducing a protocol that was designed for scale and flexibility into a system optimized for low-latency transactional workloads. Metadata handling further reinforces this distinction. FlashBlade distributes metadata across its architecture, enabling massive parallelism and consistent performance at scale. FlashArray processes metadata through its controller framework, which introduces natural serialization points under high concurrency. As workloads become increasingly metadata-heavy — particularly with small objects — this difference becomes more pronounced. The system also enforces clearly defined operational limits to maintain predictable performance. As of Purity 6.10.5+, FlashArray supports up to 250 S3 buckets per array and a maximum of 1,000,000 objects per bucket. FlashArray Object Store Limits Object storage operates at the array scope and does not integrate with multi-tenancy or “realms”, which has implications for service provider models and strict tenant isolation requirements. These constraints are not arbitrary limitations; they are guardrails that ensure the system behaves consistently within its architectural boundaries. Where the Architecture Becomes Secondary Having established those boundaries, the conversation naturally shifts from “how it works” to “why it matters”. In many enterprise environments, particularly within SLED organizations, the challenge is not achieving exabyte-scale throughput or supporting billions of objects. The challenge is delivering capabilities in a way that is operationally sustainable, economically efficient, and aligned with existing infrastructure. This is where FlashArray’s approach becomes compelling. By exposing object storage within the same platform that already supports block and file workloads, it eliminates the need to introduce a separate system, a separate operational model, and a separate set of dependencies. The same management interface, the same automation framework, and the same data services extend across all protocols. More importantly, object data inherits the full set of Purity capabilities. Global inline deduplication and compression apply to S3 workloads, significantly improving storage efficiency compared to many object-native platforms. SafeMode snapshots extend immutability to object storage, providing a critical layer of protection against ransomware. ActiveCluster, combined with ActiveDR, enables a three-site resilience model that ensures data availability across multiple locations with zero RPO between primary sites. These are not incremental improvements. They represent a shift in how object storage can be consumed within an enterprise. Practical Use Cases in a Unified Model When viewed through this lens, the use cases for FlashArray S3 become both clear and grounded in reality. Development and Staging Environments Some applications rely on S3 APIs but do not require massive scale, FlashArray provides a consistent and integrated object interface without introducing additional infrastructure. Developers can build and test against a familiar model while remaining within the same operational environment. Backup and Recovery Workflows FlashArray S3 enables modern data protection strategies that leverage object storage while benefiting from flash performance, deduplication, and indelible snapshots. This combination improves both recovery times and storage efficiency. Tier-two repositories and application-integrated storage represent another natural fit. Workloads such as document management systems, logs, and archival data often require object semantics but do not justify the higher cost of a dedicated object platform. Consolidating these workloads onto FlashArray simplifies operations while maintaining reliability and performance. Where the Boundaries Still Matter None of this diminishes the importance of selecting the appropriate platform for workloads that demand a different architecture. High-performance AI pipelines, large-scale analytics environments, and use cases requiring massive parallelism remain firmly within the domain of FlashBlade. The ability to scale performance linearly, distribute metadata across many nodes, and support billions of objects is not optional in these scenarios — it is essential. What has changed is not the relevance of those systems, but the necessity of deploying them for every object storage use case. A Subtle but Significant Shift The introduction of S3 on FlashArray does not represent a replacement of one architecture with another. It represents a convergence of capabilities within a unified operational framework. Object storage, in this model, is no longer a destination that requires its own platform. It becomes a capability — one of several ways to access and manage data within the same system. That shift is easy to overlook, but its implications are significant. It allows organizations to design around outcomes rather than protocols, to reduce complexity without sacrificing capability, and to align infrastructure more closely with the needs of modern applications. Closing Reflection Looking back at that line in Part 2, it is clear that the reaction was not just about a new feature appearing in the interface. It was about the recognition — however incomplete at the time — that something foundational was beginning to change. Object storage did not suddenly become simpler, nor did it lose the architectural complexity that defines it. What changed is where it lives. And once that becomes clear, you start asking a slightly uncomfortable but very honest question: If this works… and it works well enough for most of what I actually need… why was I so convinced it had to live somewhere else in the first place? That is usually where the interesting work begins. Appreciate you reading. Dmitry Gorbatov © 2025 Dmitry Gorbatov | #dmitrywashere99Views1like0CommentsArchitectural Deep Dive: Building Data Pipelines for AI Agents
May 7 | Register Now! The leap from a "hello world" AI agent to a production-ready system is a massive data challenge. Autonomous agents are coming your way, and it's up to you to figure out how to get your data stack ready for production. In this live session, we'll build a high-velocity data pipeline for AI agents from scratch. Starting with the fundamentals of a strong data storage foundation, we'll walk through every layer end-to-end. We'll cover real-time data ingestion, vector storage, retrieval, orchestration, and inference. In this session you’ll learn: How to build a production-ready data storage pipeline for AI agents The foundational decisions IT, Data, and AI teams need to make to handle "context lag" and memory before the first agent goes live A practical framework for assessing whether your current infrastructure is ready to support AI agents at scale Register Now!348Views0likes0Comments