OT: The Architecture of Interoperability
In previous post, we explored the fundamental divide between Information Technology (IT) and Operational Technology (OT). We established that while IT manages data and applications, OT controls the physical heartbeat of our world from factory floors to water treatment plants. In this post we are diving deeper into the bridge that connects them: Interoperability. As Industry 4.0 and the Internet of Things (IoT) accelerate, the "air gap" that once separated these domains is evolving. For modern enterprises, the goal isn't just to have IT and OT coexist, but to have them communicate seamlessly. Whether the use-cases are security, real time quality control, or predictive maintenance, to name a few, this is why interoperability becomes the critical engine for operational excellence. The Interoperability Architecture Interoperability is more than just connecting cables; it’s about creating a unified architecture where data flows securely between the shop floor and the “top floor”. In legacy environments, OT systems (like SCADA and PLCs) often run on isolated, proprietary networks that don’t speak the same language as IT’s cloud-based analytics platforms. To bridge this, a robust interoperability architecture is required. This architecture must support: Industrial Data Lake: A single storage platform that can handle block, file, and object data is essential for bridging the gap between IT and OT. This unified approach prevents data silos by allowing proprietary OT sensor data to coexist on the same high-performance storage as IT applications (such as ERP and CRM). The benefit is the creation of a high-performance Industrial Data Lake, where OT and IT data from various sources can be streamed directly, minimizing the need for data movement, a critical efficiency gain. Real Time Analytics: OT sensors continuously monitor machine conditions including: vibration, temperature, and other critical parameters, generating real-time telemetry data. An interoperable architecture built on high performance flash storage enables instant processing of this data stream. By integrating IT analytics platforms with predictive algorithms, the system identifies anomalies before they escalate, accelerating maintenance response, optimizing operations, and streamlining exception handling. This approach reduces downtime, lowers maintenance costs, and extends overall asset life. Standards Based Design: As outlined in recent cybersecurity research, modern OT environments require datasets that correlate physical process data with network traffic logs to detect anomalies effectively. An interoperable architecture facilitates this by centralizing data for analysis without compromising the security posture. Also, IT/OT convergence requires a platform capable of securely managing OT data, often through IT standards. An API-First Design allows the entire platform to be built on robust APIs, enabling IT to easily integrate storage provisioning, monitoring, and data protection into standard, policy-driven IT automation tools (e.g., Kubernetes, orchestration software). Pure Storage addresses these interoperability requirements with the Purity operating environment, which abstracts the complexity of underlying hardware and provides a seamless, multiprotocol experience (NFS, SMB, S3, FC, iSCSI). This ensures that whether data originates from a robotic arm or a CRM application, it is stored, protected, and accessible through a single, unified data plane. Real-World Application: A Large Regional Water District Consider a large regional water district, a major provider serving millions of residents. In an environment like this, maintaining water quality and service reliability is a 24/7 mission-critical OT function. Its infrastructure relies on complex SCADA systems to monitor variables like flow rates, tank levels, and chemical compositions across hundreds of miles of pipelines and treatment facilities. By adopting an interoperable architecture, an organization like this can break down the silos between its operational data and its IT capabilities. Instead of SCADA data remaining locked in a control room, it can be securely replicated to IT environments for long-term trending and capacity planning. For instance, historical flow data combined with predictive analytics can help forecast demand spikes or identify aging infrastructure before a leak occurs. This convergence transforms raw operational data into actionable business intelligence, ensuring reliability for the communities they serve. Why We Champion Compliance and Governance Opening up OT systems to IT networks can introduce new risks. In the world of OT, "move fast and break things" is not an option; reliability and safety are paramount. This is why Pure Storage wraps interoperability in a framework of compliance and governance, not limited to: FIPS 140-2 Certification & Common Criteria: We utilize FIPS 140-2 certified encryption modules and have achieved Common Criteria certification. Data Sovereignty: Our architecture includes built-in governance features like Always-On Encryption and rapid data locking to ensure compliance with domestic and international regulations, protecting sensitive data regardless of where it resides. Compliance: Pure Fusion delivers policy defined storage provisioning, automating the deployment with specified requirements for tags, protection, and replication. By embedding these standards directly into the storage array, Pure Storage allows organizations to innovate with interoperability while maintaining the security posture that critical OT infrastructure demands. Next in the series: We will explore further into IT/OT interoperability and processing of data at the edge. Stay tuned!26Views0likes0CommentsHow to Improve Python S3 Client Performance with Rust
This article originally appeared on PureStorage.com. It has been republished with permission from the author. Python is the de facto language for data science because of its ease of use and performance. But performance comes only because libraries like NumPy offload computation-heavy functions, like matrix multiplication, to optimized C code. Data science tooling and workflows continue to improve, data sets get larger, and GPUs get faster. So as object storage systems, like S3, become the standard for large data sets, the retrieval of data from object stores has become a bottleneck. Slow S3 access results in idle compute, wasting expensive CPU and GPU resources. Almost all Python-based use of data in S3 leverages the Boto3 library, an SDK that enables flexibility but comes with the performance limitations of Python. Native Python execution is relatively slow and especially poor at leveraging multiple cores due to the Global Interpreter Lock (GIL). There are other projects, such as a plugin for PyTorch or leveraging Apache Arrow via PyArrow bindings, that aim to improve S3 performance for a specific Python application. I have also previously written about issues with S3 performance in Python: cli tool speeds, object listing, Pandas data loading, and metadata requests. This blog post points in a promising direction for solving the Python S3 performance problem: replacing Boto3 with equivalent functionality written in a modern, compiled language. My simple Rust reimplementation FastS3results in 2x-3x performance gains versus Boto3 for both large object retrieval and object listings. Surprisingly, this result is consistent for both fast, all-flash object stores like FlashBlade®, as well as traditional object stores like AWS’s S3. Experimental Results Python applications access object storage data primarily through either 1) object store specific SDKs like Boto3 or 2) filesystem-compatible wrappers like s3fs and fsspec. Both Boto3 and s3fs will be compared against my minimal Rust-based FastS3 code to both 1) retrieve objects and 2) list keys. S3fs is a commonly used Python wrapper around the Boto3 library that provides a more filesystem-like interface for accessing objects on S3. Developers benefit because file-based Python code can be adapted for objects with minimal or no rewrites. Fsspec provides an even more general interface that provides a similar filesystem-like API for many different types of backend storage. My FastS3 library should be viewed as a first step toward an fsspec-complaint replacement for the Python-based s3fs. In Boto3, there are two ways to retrieve an object: get_object and download_fileobj. Get_object is easier to work with but slower for large objects, and download_fileobj is a managed transfer service that uses parallel range GETs if an object is larger than a configured threshold. My FastS3 library mirrors this logic, reimplemented in Rust. S3fs enables reading from objects using a pattern similar to standard Python file opens and reads. The tests focus on two common performance pain points: retrieving large objects and listing keys. There are other workloads that are not yet implemented or optimized, e.g., small objects and uploads. All tests are run on a virtual machine with 16 cores and 64GB DRAM and run against either a small FlashBladesystem or AWS S3. Result 1: GET Large Objects The first experiment measures retrieval (GET) time for large objects using FastS3, s3fs, and both Boto3 codepaths. The goal is to retrieve an object from FlashBlade S3 into Python memory as fast as possible. All four functions scale linearly as the object size increases, with the Rust-based FastS3 being 3x and 2x faster than sf3s-read/boto3-get and boto3-download respectively. The relative speedup of FastS3 is consistent from object sizes of 128MB up to 4GB. Result 2: GETs on FlashBlade vs. AWS The previous results focused on retrieval performance against a high-performance, all-flash FlashBlade system. I also repeated the experiments using a traditional object store with AWS’s S3 and found similar performance gains. The graph below shows relative performance of FastS3 and Boto3 download(), with values less than 1.0 indicating Boto3 is faster than FastS3. For objects larger than 1GB-2GB, the Rust-based FastS3 backend is consistently 2x faster at retrieving data than Boto3’s download_fileobj function, against both FlashBlade and AWS. Recall that download_fileobj is significantly faster with large objects than the basic Boto3 get_object function. As a result, FastS3 is at least 3x faster than Boto3’s get_object. The graph compares FastS3 against download_fileobj because it is Boto3’s fastest option, though it is also the least convenient to use. For objects smaller than 128MB-256MB, the FastS3 calls are slower than Boto3, indicating that there are still missing optimizations in my FastS3 code. FastS3 currently uses 128MB as the download chunk size to control parallelism, which works best for large objects but clearly is not ideal for smaller objects. Result 3: Listing Objects Performance on metadata listings is commonly a slow S3 operation. The next test compares the Rust-based implementation of ls(), i.e., listing keys based on a prefix and delimiter with a prefix, with Boto3’s list_objects_v2() and s3fs’s ls() operation. The objective is to enumerate 400k objects with a given prefix. Surprisingly, FastS3 is significantly faster than Boto3 at listing objects, despite FastS3 not being able to leverage concurrency. The FastS3 listing is 4.5x faster than Boto3 against FlashBlade and 2.7x faster against AWS S3. The s3fs implementation of ls() also introduces a slight overhead of 4%-8% when compared to directly using boto3 list_objects_v2. Code Walkthrough All the code for FastS3 can be found on GitHub, including the Rust implementation and a Python benchmark program. I leverage the Pyo3 library to create the bindings between my Rust functions and Python. I also use the official AWS SDK for Rust, which at the time of this writing is still in tech preview at version 0.9.0. The Rust code issues concurrent requests to S3 using the Tokio runtime. Build the Rust-FastS3 library using maturin, which packages the Rust code and pyo3 bindings into a Python wheel. maturin build --release The resulting wheel can be installed as with any Python wheel. python3 -m pip install fasts3/target/wheels/*.whl Initialization logic for Boto3 and FastS3 are similarly straightforward, using only an endpoint_url to specify FlashBlade data VIP or an empty string for AWS. The access key credentials are found automatically by the SDK, e.g., as environment variables or a credentials file. <code">import boto3 import fasts3 s3r = boto3.resource('s3', endpoint_url=ENDPOINT_URL) # boto3 <code">s = fasts3.FastS3FileSystem(endpoint=ENDPOINT_URL) # fasts3 (rust) And then FastS3 is even simpler to use in some cases. # boto3 download_fileobj() bytes_buffer = io.BytesIO() s3r.meta.client.download_fileobj(Bucket=BUCKET, Key=SMALL_OBJECT, Fileobj=bytes_buffer) # fasts3 get_objects contents = s.get_objects([BUCKETPATH]) FastS3 requires the object path to be specified as “bucketname/key,” which maps to the s3fs and fsspec API and treats the object store as a more generic file-like backend. The Rust code for the library can be found in a single file. I am new to Rust, so this code is not “well-written” or idiomatic Rust, just demonstrative. To understand the flow of the Rust code, there are three functions that serve as interconnects between Python and Rust: new(), ls(), and get_objects(). pub fn new(endpoint: String) -> FastS3FileSystem This function is a simple factory function for creating a FastS3 object with the endpoint argument that should point to the object store endpoint. pub fn ls(&self, path: &str) -> PyResult<Vec<String>> The ls() function returns a Python list[] of keys found in the given path. The implementation is a straightforward use of a paginated list_objects_v2. There is no concurrency in this implementation; each page of 1,000 keys is returned serially. Therefore, any performance advantage of this implementation is strictly due to Rust performance gains over Python. pub fn get_objects(&self, py: Python, paths: Vec<String>) -> PyResult<PyObject> The get_objects functions take a list of paths and concurrently download all objects, returning a list of Bytes objects in Python. Internally, the function first issues a HEAD request to all objects in order to get their sizes and then allocates the Python memory for each object. Finally, the function concurrently starts retrieving all objects, splitting large objects into chunks of 128MB. A key implementation detail is to first allocate the memory for the objects in Python space using a PyByteArray and then copy downloaded data into that memory using Rust, which avoids needing a memory copy to move the object data between Rust and Python-managed memory. As a side note, dividing a memory buffer into chunks so that data can be written in parallel really forced me to better understand Rust’s borrow checker! What About Small Objects? Notably lacking in the results presented are small objects retrieval times. The FastS3 library as I have written it is not faster (and sometimes slower) than Boto3 for small objects. But I am happy to speculate this is nothing to do with the language choice but largely because my code is so far only optimized for large objects. Specifically, my code does a HEAD request to retrieve the object size before starting the downloads in parallel, whereas with a small object, it is more efficient to just GET the whole data in a single remote call. Clearly, there is opportunity for optimization here. Summary Python prominence in data science machine learning continues to grow. And the mismatch in performance between accessing object storage data and compute hardware (GPUs) continues to widen. Faster object storage client libraries are required to keep modern processors fed with data. This blog post has shown that one way to significantly improve performance is to replace native Python Boto3 code with compiled Rust code. Just as NumPy makes computation in Python efficient, a new library needs to make S3 access more efficient. While my code example shows significant improvement over Boto3 in loading large objects and metadata listings, there is still room for improvement in small object GET operations and more of the API to be reimplemented. The goal of my Rust-based Fasts3 library is to demonstrate the 2x-3x scale of improvements possible to encourage more development on this problem.7Views0likes0CommentsHow to Use the FlashBlade Network Plumbing Validation Tool
This article originally appeared on Medium.com. It has been republished with permission from the author. Did you spend a few hours trying to debug why Apache Spark on FlashBladeⓇ is slower than expected, only to realize you have an underlying networking issue? Flashblade-plumbing is a tool to validate NFS and S3 read/write performance from a single client to a FlashBlade array with minimal dependencies and input required. The only inputs required are the FlashBlade’s management IP and login token and, after a few minutes, it will output the read and write throughputs for both NFS and S3. The alternative is to manually configure filesystems and S3 accounts, generate some test data, and then configure and use command line tools like “dd” and “s5cmd,” or even worse, slower alternatives like “cp” and “s3cmd.” See the accompanying github repository for source code and instructions. How the FlashBlade Plumbing Tool Works This tool leverages three different APIs: A management REST interface on the FlashBlade User-space NFS AWS S3 SDK First, the tool uses the FlashBlade REST API to discover data ports and to create test file systems, object store accounts, keys, and buckets. Second, user-space NFS and S3 libraries enable the generation of write and read workloads. Finally, the REST API is used to remove everything previously created and return the system to the original state. The data written to the FlashBlade is random and incompressible. Each test phase runs for 60 seconds. In many FlashBlade environments there are multiple subnets and data VIPs configured, allowing access to clients in different parts of the network. In case of multiple data VIPs defined on the FlashBlade, the program will test against one data VIP per configured subnet; if a dataVIP is not accessible after a period of time, the plumbing tool proceeds to the next subnet. How To Use FlashBlade Plumbing Only two inputs are required: 1) the FlashBlade management VIP and 2) login token. Together, these allow the plumbing program to access the FlashBlade management API to collect and create the necessary information to run the plumbing tests. Specify these input parameters using environment variables FB_MGMT_VIP and FB_TOKEN. There are multiple different ways to run these tests, depending on the environment: Kubernetes, Docker, or a simple Linux server. First, the login token can be created or retrieved via the FlashBlade CLI: > pureadmin [create|list] --api-token --expose An example output looks like below, where the client can only reach the FlashBlade on one of the configured data VIPs: dataVip,protocol,result,write_tput,read_tput 192.168.170.11,nfs,SUCCESS,3.1 GB/s,4.0 GB/s 192.168.40.11,nfs,MOUNT FAILED,-,- 192.168.40.11,s3,FAILED TO CONNECT,-,- 192.168.170.11,s3,SUCCESS,1.7 GB/s,4.3 GB/s Three Different Ways to Run Depending on your environment, choose the approach easiest for you: Kubernetes, Docker, or Linux executable.Kubernetes. The tool can be run within Kubernetes via a simple batch Job. See the example below and insert your MGMT_VIP and TOKEN. The nodeSelector field is optional and can be used to constrain which Kubernetes worker node runs the plumbing test pod. apiVersion: batch/v1 kind: Job metadata: name: go-plumbing spec: template: spec: containers: - name: plumbing image: joshuarobinson/go-plumbing:0.3 env: - name: FB_MGMT_VIP value: “10.6.6.20.REPLACEME” - name: FB_TOKEN value: “REPLACEME” nodeSelector: nodeID: worker01 restartPolicy: Never backoffLimit: 2 Docker The following docker run command invokes the plumbing tool. Use your values for the MGMT_VIP and TOKEN environment variables. docker run -it --rm -e FB_MGMT_VIP=$FB_MGMT_VIP -e FB_TOKEN=$FB_MGMT_TOKEN joshuarobinson/go-plumbing:0.3 Binary Standalone For systems without Docker installed or access to Docker hub, download and run directly the 14MB Linux binary from the release page: wget https://github.com/joshuarobinson/flashblade-plumbing/releases/download/v0.3/fb-plumbing-v0.3 chmod a+x fb-plumbing-v0.3 FB_MGMT_VIP=10.1.1.1 FB_TOKEN=REPLACEME ./fb-plumbing-v0.3 Running on Multiple Servers Ansible makes it easy to run the plumbing test on a group of servers, either one at a time or all together. Note that if running multiple instances of the tool in parallel, the test phases will not be fully synchronized. The following Ansible ad hoc commands first copy the downloaded binary to all nodes and then run the tool one host at a time using the “ — forks” option to disable parallelism. ansible myhosts -o -m copy -a "src=fb-plumbing-v0.3 dest=fb-plumbing mode=+x" ansible myhosts --forks 1 -m shell -a "FB_TOKEN=REPLACEME FB_MGMT_VIP=10.2.6.20 ./fb-plumbing" Code Highlights The source code for this plumbing utility is open and available on github and interacts with the FlashBlade using three different APIs: management via REST API and data via user-space NFS and AWS S3. FlashBlade REST API The FlashBlade REST API has a Python SDK, which simplifies interacting with the management API. In order to have one binary for both management operations and data plane testing, I implemented a subset of the REST API calls in Golang. The primary elements to a working Golang REST client are 1) negotiating authentication and 2) making specific API calls. First, the authentication section requires choosing a support API version and then POSTing the login token to the API and receiving a session authentication token back. This session token is added to the header of all subsequent API calls for authentication. The code for this login process follows this pattern: authURL, _:= url.Parse("https://" + c.Target + "/api/login") req, _:= http.NewRequest("POST", authURL.String(), nil) req.Header.Add("api-token", c.APIToken) resp, _:= c.client.Do(req) if resp.StatusCode >= 200 && resp.StatusCode <= 299 { c.xauthToken = resp.Header["X-Auth-Token"][0] } Then every subsequent call the following header: req.Header.Add("x-auth-token", c.xauthToken) Second, the REST calls are made using a helper function to create the request with the provided parameters and request body. Example calls look this: data, err := json.Marshal(filesystem) _, err = c.SendRequest("POST", "file-systems", nil, data) … var params = map[string]string{"names": accountuser} _, err := c.SendRequest("DELETE", "object-store-users", params, nil) For the FlashBlade REST API, the request body data is encoded as JSON and request parameters are key/value pairs. Note that creating the necessary parameters or request bodies required inspection of the REST API specification for the FlashBlade and a little reverse engineering of the Python SDK. Userspace NFS Traditionally, NFS leverages the NFS client in the Linux kernel. But this introduces extra dependencies in a plumbing test, i.e, the need to mount a filesystem using root privileges. By using a userspace NFS library, the plumbing application does not require mounting from the host operating system. Instead the mount operation happens from within the Go code: mount, err := nfs.DialMount("10.62.64.200", false) … auth := rpc.NewAuthUnix("anon", 1001, 1001) target, err := mount.Mount("filesystem-name", auth.Auth(), false) A key outcome of accessing NFS via userspace code is that the application operates the same inside and outside of container environments. This helps achieve the overall goal of eliminating dependencies for running the plumbing tool. For example, there is no need to configure a CSI driver inside of Kubernetes, or to have root privileges to mount on a bare-metal host. A second advantage is that multiple tcp connections are leveraged, resulting in higher performance similar to the nconnect kernel feature. Reading and writing NFS files then follows the same Go patterns as writing to local files: f, err := target.OpenFile(filename, os.FileMode(int(0744))) n, _ := f.Write(srcBuf) … f, err := target.Open(filename) <code”>n, err := f.Read(p) AWS S3 SDK The S3 protocol always leverages userspace code, meaning that I can simply use the AWS S3 SDK for Golang within the plumbing application. To use this library with FlashBlade, the S3 config object needs to include the endpoint parameter that corresponds to a data VIP on the FlashBlade. s3Config := &aws.Config{ Endpoint: aws.String("10.62.64.200"), Credentials: credentials.NewStaticCredentials(accessKey, secretKey, ""), Region: aws.String("us-east-1"), DisableSSL: aws.Bool(true), S3ForcePathStyle: aws.Bool(true), } The operations to upload and download objects are the same as for any other S3 backend. Example Results Running the plumbing tool on a high-end client machine with 96 cores and 100Gbps networking results in client read throughputs averaging 6.2 GB/s for NFS and 7.7 GB/s for S3. The corresponding GUI shows performance (throughput, IOPS, and latency) during the tests. The FlashBlade itself can deliver more performance with more clients, and perhaps the client as well given that it has been tested and tuned on smaller client hardware profiles. Conclusion Most applications using high-performance file or object storage have bottlenecks either in the application or on the storage tier. But the first step in setting up an application is ensuring the underlying infrastructure is configured correctly and not introducing extra bottlenecks. I built the flashblade-plumbing tool to simplify the process of validating the networking layer between each client and FlashBlade with minimal dependencies or pre-configuration required. The result is a single program that requires two inputs, management VIP and login token, and automatically tests NFS and S3 throughput at multi-GB/s speeds.30Views0likes0CommentsHow to Deploy A Monitoring Stack in Kubernetes with Prometheus and Grafana
This article originally appeared on Medium.com. It has been republished with permission from the author. Monitoring infrastructure is essential for keeping production workloads healthy and debugging issues when things go wrong. Observability is essential for troubleshooting. The goal of this post is to learn how to quickly and easily deploy a minimal configuration open-source Prometheus and Grafana monitoring infrastructure in Kubernetes. The full yaml for the examples discussed can be found on the github repo here. The Prometheus ecosystem continues to improve; the Prometheus operator and associated bundled project, while promising, are still in beta and improving their usability. Docker containers makes these applications particularly easy to run and configure, and Kubernetes adds additional resilience. The target audience for this post has a basic understanding of Kubernetes and is new to Prometheus/Grafana. I focus here on a simplistic deployment in order to illustrate how these applications work together and give examples of how Kubernetes concepts create useful building blocks. There are three necessary services in our monitoring setup: Prometheus endpoint(s). This is the application with metrics that we want to track and can either be done natively in the application or through an exporter. Prometheus, a monitoring system and time-series database. Grafana, a visualization tool that can use Prometheus to create dashboards and graphs. The software stack I use includes Kubernetes v.1.18.2, Prometheus v2.18, and Grafana v7. Overview: A Standalone Monitoring Pod Getting started with tools like Prometheus can be daunting, therefore my goal here is to walk through a simple, standalone monitoring deployment to illustrate the necessary components in one single yaml file. This can then be used as a foundation for a more sophisticated Prometheus setup. This walkthrough assumes a basic understanding of Kubernetes components: Services, Deployments, Pods, ConfigMaps, and PersistentVolumes. After reading, you should have a better understanding of both Prometheus and when to use each Kubernetes component. A quick overview of the components of this monitoring stack: A Service to expose the Prometheus and Grafana dashboards. Deployment with a pod that has multiple containers: exporter, Prometheus, and Grafana. Configmap that stores configuration information: prometheus.yml and datasource.yml (for Grafana). PersistentVolumeClaims to make Prometheus metrics and Grafana dashboards persistent. Service We start with a Service to expose the UI ports for both Grafana (3000) and Prometheus (9090). Use this service in conjunction with port-forwarding or a load balancer to make it easy to login to the either service. apiVersion: v1 kind: Service metadata: name: monitor labels: app: monitor spec: clusterIP: None ports: - name: graf-port port: 3000 - name: prom-port port: 9090 selector: app: monitor Deployment We then create a Deployment with a single pod. That single pod contains multiple containers, one each to run Prometheus and Grafana, so this architecture highlights the difference between a pod and a container. Effectively, the Deployment/pod combination is that logical unit by which Kubernetes manages the application: containers within a pod are scheduled together and restarted together. Splitting each into a separate pod creates more robustness, but I focus on a single pod to keep the interconnections between applications simpler. Container 1, Prometheus. This first container defines how to run Prometheus, using the public Docker image and linking to a config file that will be defined later. spec: containers: - name: prometheus image: prom/prometheus args: [“--config.file=/etc/prometheus/prometheus.yml”] ports: - containerPort: 9090 name: prom-port volumeMounts: - name: config-vol mountPath: /etc/prometheus/prometheus.yml subPath: prometheus.yml - name: prom-data mountPath: /prometheus imagePullPolicy: Always This container spec mirrors the Docker instructions for starting Prometheus, with straightforward translation of Docker arguments to yaml config. Compare the above yaml with suggested Docker invocation: docker run -p 9090:9090 \ -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus The only difference here being an additional volume for the persistent data so that the time-series data is retained upon restarts. Container 2, Grafana. The Grafana container specifies analogous elements as Prometheus: a port mapping, a configuration file, and a volume for persistent data. Again, there is a direct correspondence between this yaml and the basic docker run invocation. - name: grafana image: grafana/grafana ports: - containerPort: 3000 name: graf-port volumeMounts: - name: config-vol mountPath: /etc/grafana/provisioning/datasources/datasource.yml subPath: datasource.yml - name: graf-data mountPath: /var/lib/grafana Volumes Kubernetes volumes provide data to containers and have different possible sources. In other words, containers need to use many different types of data, so volumes provide the abstraction to connect data to containers in various ways. For example, ConfigMaps are great for small, read-only configuration data, whereas PersistentVolumes are more flexible for larger, dynamic datasets. The Pod spec defines three volumes: one for the configuration files for both services, and one each for the persistent storage for Prometheus and Grafana. These volume definitions instruct Kubernetes how to connect the underlying data sources to the volumeMounts in each container. ConfigMaps and PersistentVolumeClaims are mounted in the containers above the same way. The three volumes are: volumes: - name: config-vol configMap: name: monitor-config - name: prom-data persistentVolumeClaim: claimName: prom-claim - name: graf-data persistentVolumeClaim: claimName: graf-claim The two types of sources of these three volumes, ConfigMaps and PersistentVolumeClaims, will be described next. Volume type 1: ConfigMap A ConfigMap stores text data that can be used as configuration files inside a container. The data section of the ConfigMap contains two different entries, prometheus.yml and datasource.yml. The previous Volumes map these to configuration files for Prometheus and Grafana respectively. kind: ConfigMap apiVersion: v1 metadata: name: monitor-config data: prometheus.yml: |- global: scrape_interval: 30s scrape_configs: - job_name: 'replaceme' datasource.yml: |- apiVersion: 1 datasources: - name: Prometheus type: prometheus url: https://127.0.0.1:9090 Note in the datasource.yml configuration, Grafana connects to Prometheus via localhost (127.0.0.1), a simplification made possible by running both containers in the same Pod. Volume Type 2: PersistentVolumeClaims PersistentVolumeClaims enable persistent storage for both Prometheus and Grafana. The result is that both metric data and dashboards persist even with restarts. I leverage an already-installed Pure Service Orchestrator (PSO) to persist these volumes on a FlashBlade via the “pure-file” StorageClass. kind: PersistentVolumeClaim apiVersion: v1 metadata: name: prom-claim labels: app: prometheus spec: storageClassName: pure-file accessModes: - ReadWriteOnce resources: requests: storage: 1Ti --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: graf-claim labels: app: grafana spec: storageClassName: pure-file accessModes: - ReadWriteOnce resources: requests: storage: 100Gi These PersistentVolumesClaims are required to make this monitoring setup persistent and therefore able to handle pod restarts and failures without losing data. Usage To use this setup, login to either Grafana to create dashboards or Prometheus to view scraping status. I use port-forwarding to make both services accessible outside my Kubernetes cluster. Access Grafana dashboard by setting up port-forwarding and then pointing a browser to “hostname:3000” kubectl port-forward ---address 0.0.0.0 service/monitor 3000 Once Grafana loads, you can skip “Adding your first data source” because the datasource.yml entry in our ConfigMap pre-configures the Prometheus data source for Grafana. Instead, go directly to creating a dashboard. To start creating a custom dashboard, click “Add New Panel.” Start working with Prometheus queries by selecting “Prometheus” from the drop-down list of data sources. Then begin creating Prometheus queries (using PromQL): You can also directly access Prometheus via port 9090 in order to monitor the scrape status for each target. We have now set up everything except the actual endpoints to monitor, but fortunately this is relatively easy. Example 1: Monitor Kubernetes The first example will deploy Prometheus and Grafana to monitor Kubernetes itself. Kube-state-metrics is a service that listens on the Kubernetes API server and translates to a Prometheus endpoint. First, install kube-state-metrics to deploy the monitoring service for Kubernetes. > git clone https://github.com/kubernetes/kube-state-metrics.git > kubectl apply -f kube-state-metrics/examples/standard/ Next, to connect our monitoring stack to this service, add the following to the prometheus.yml entry in the above ConfigMap: scrape_configs: static_configs: - targets: ['kube-state-metrics.kube-system:8080'] Once configured, I can then start using PromQL to query metrics. For example, I can check per-node CPU resource limits: sum by (node) (kube_pod_container_resource_limits_cpu_cores) Example 2: Starburst Presto The next example uses the same stack to monitor an application that exposes a Prometheus endpoint. I will use the Starburst Presto operator as an example. The only addition necessary to the previous example is a job config for Prometheus that connects to the target Starburst service’s built-in Prometheus endpoint. scrape_configs: - job_name: 'starburst-coordinator' static_configs: - targets: ['prometheus-coordinator-example-presto:8081'] I can then plot interesting metrics, such as “running_queries” and “queued_queries” easily in Grafana: Example 3: Pure Exporter The third example builds upon the Pure Exporter, which is an external exporter for Pure FlashBlades and FlashArrays. This exporter is a Prometheus endpoints that runs as a container and collects results from Pure’s REST APIs. In other words, the exporter is a gateway that scrapes the Pure API and enables easier management of Pure FlashArrays and FlashBlades, including per-client statistics. I incorporate this stateless external exporter into the same pod as Prometheus and Grafana. The result is three containers working together to collect, store, and visualize metrics. The pure-exporter can also run in a separate pod, but I chose this option because it simplifies my Prometheus configuration by being able to always access the exporter through a localhost address. The full yaml definition can be found here. The only change required to our initial framework is to add an additional “pure_flashblade” job to the prometheus.yml definition. The places to add specific information about each FlashBlade endpoint are marked with “REPLACE” in the snippet below: scrape_configs: - job_name: ‘pure_flashblade’ scrape_timeout: 30s metrics_path: /metrics/flashblade relabel_configs: - source_labels: [__address__] target_label: __param_endpoint - source_labels: [__pure_apitoken] target_label: __param_apitoken - source_labels: [__address__] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9491 static_configs: - targets: [ MGMT_VIP ] # REPLACE with FB Management IP address labels: __pure_apitoken: T-token-secret # REPLACE - targets: [ OTHER_MGMT_VIP ] # REPLACE OR REMOVE labels: __pure_apitoken: T-different # REPLACE To obtain the management token for a FlashBlade, login and use the following CLI command: pureadmin [create|list] --api-token --expose Once configured to collect metrics, you can quickly build dashboards by starting with pre-configured dashboards. You can copy-paste the json definition from Github to Grafana, but I prefer to download the rawjson file for a dashboard to my local machine. To import a dashboard in Grafana, click on the “+” and then “Import” in Grafana. After loading the json dashboard definition, the import dialog box requires you to select the data source “Prometheus,” which connects back to our previously configured Prometheus server. You now have a working dashboard for monitoring your FlashBlade or FlashArray and can further add on graphs as desired. You can also combine all three examples in a single prometheus.yml config to monitor Prometheus, Starburst, and the FlashBlade with the same stack! Summary The full yaml for these examples can be found on the github repo here. The objective here was to describe the minimal setup necessary to build a persistent monitoring stack in Kubernetes with Prometheus and Grafana and understand how they interact. This scaffolding demonstrates how to configure these services as well as providing a useful starting point to quickly create a monitoring system. After learning the basics of Prometheus monitoring, you can start using the Prometheus operator, build more dashboards, and incorporate alertmanager.54Views0likes0CommentsHow to Deploy A Log Analytics Pipeline as-a-Service
This article originally appeared on Medium.com. It has been republished with permission from the author. Collecting and indexing logs from servers, applications, and devices enables crucial visibility into running systems. A log analytics pipeline allows teams to debug and troubleshoot issues, track historical trends, or investigate security incidents. The most commonly deployed pipeline combines Kafka and Elasticsearch to create a reliable, scalable, and performant system to ingest and query data. The time it takes to deploy a new log pipeline is a key factor in if a new data project will be successful. But both applications traditionally use converged infrastructure (similar to HDFS) which results in silos and management complexity due to stranded resources, expensive rebalance operations, and slow software upgrades. Kubernetes and disaggregated storage simplify Kafka and Elasticsearch clusters and are essential for scaling and operating log pipelines in production. Kubernetes makes deploying log pipelines as-a-service easy, with CSI dynamic volume provisioning allowing for easy scaling and adjusting of resources. More importantly, recently released features for both applications, Confluent Tiered Storage and Elastic Searchable Snapshots, use object store to fully disaggregate compute and storage in log pipelines. A cloud-native disaggregated pipeline architecture with fast object storage means: More efficient resource usage by avoiding deploying extra nodes just to increase storage and no longer needing full replicas for data protection. Faster failure handling by making pods (brokers or data nodes) near-stateless. With small, bounded amounts of storage attached to a pod, rebalance operations are orders of magnitude faster. Support fast historical searches with the predictable all-flash performance of FlashBlade®. This blog post describes a helm chart that automates the deployment and configuration of a disaggregated log analytics pipeline based on Kafka and Elasticsearch. The diagram below illustrates the deployed pipeline architecture: Confluent Kafka and Elasticsearch PersistentVolumes orchestrated by Portworx while also using S3 buckets for long-term shared storage. The PortWorx storage can be backed by local drives, FlashArray volumes, or FlashBlade NFS. Why would you want to do this? Log analytics as a service, so each team and project can create and operate independently with just the resources they need. The alternative is custom infrastructure silos for each team, all configured and managed slightly differently. Easily scale up or down cluster resources (compute or storage) as needed and in a self-service manner Modify resource requirements without changing hardware, e.g., more compute for one cluster and less storage for another Run multiple heterogeneous clusters on a shared hardware pool The alternative to the cloud-native disaggregated architecture is a group of infrastructure silos, one for each application component. These silos present challenges as each needs a customized hardware profile (cores, storage), which drifts and changes over time. And if you use a separate software-defined object store, then that creates yet another hardware silo that needs to be managed. With Kubernetes and FlashBlade, we instead optimize for the time it takes to deploy your team’s next production data pipeline. How Shared Storage Simplifies aaS Log Pipelines Shared storage powers as-a-service log pipelines in two key forms: Object Storage and remote PersistentVolumes. Object storage requires application awareness to fully take advantage of a scalable, reliable, and performant object store like FlashBlade. In contrast, PersistentVolumes provide many of the benefits without requiring changes to the application; a remote PersistentVolume transparently replaces a local drive. You can also find a video demo illustrating how object storage simplifies operations of this log analytics pipeline as well as previous blogs on Simplifying Kafka with Confluent Tiered Storage and Elasticsearch Snapshots. Object Storage The ease of use, scalability, and prevalence of S3 object storage has resulted in a generation of applications re-architecting themselves from a converged model with direct-attached storage to a disaggregated model with shared storage. Object stores like AWS or FlashBlade scale performance and capacity linearly, moving storage management tasks out of the application so that additional nodes are not needed just to add and manage storage. With disaggregated object storage, adding or removing a node to either the Kafka or Elasticsearch cluster does not require rebalancing of the data on the remote object store. Instead, only logical pointers are updated. Further, software upgrades are simpler because if an application upgrade goes awry, the data is still safely stored on the object store. A key outcome of disaggregating the storage for both applications with objects is that you can now bound the amount of data local to a node, thereby bounding the amount of data to be rebalanced on a node failure. For example, if all nodes have at most 500GB of data on their PersistentVolume, then the rebalance time is the same whether your total dataset is 1TB or 100TB. As clusters grow, keeping rebalance times manageable is crucial to operational simplicity and reliable service. This log analytics pipeline uses object storage for three different purposes: Confluent Tiered Storage Elastic Frozen Tier backed by Searchable Snapshots Elasticsearch Snapshot Repository for data protection One of the customized elements of the helm chart is a script that automates bucket creation and authentication on the object store. These are tasks that should be greatly simplified in the future as the Container Object Storage Interface comes to maturity. Finally, object store is also used for backing up Elasticsearch indices in case of accidental corruptions. PersistentVolume Dynamic Provisioning A second way that shared storage simplifies running log analytics pipelines is through dynamic provisioning of Persistent Volumes using a Container Storage Interface (CSI) plugin. In this pipeline, both Kafka and Elasticsearch use statefulsets that automate the creation and attachment to volumes using Portworx. The advantages of a remote PersistentVolume when compared to local storage are: Provisioning of storage is decoupled from CPU and RAM, meaning that Kubernetes can schedule pods only considering CPU and RAM without introducing an additional constraint. Pod and node failure recovery is orders of magnitude faster because Kubernetes will restart a failed pod on a different node while reattaching to the same remote volume, thus avoiding expensive rebalances. Volumes can be dynamically grown as needed without the restrictions of physical drives and drive bays. The rest of this post describes a helm chart to automatically install and configure a disaggregated log analytics pipeline in Kubernetes. This helm chart is not intended for production use as-is but rather as a building block to help understand the advantages of disaggregated log pipelines and to jump start the deployment of new production pipelines. Log Pipeline Components This section describes the end-to-end components of our log analytics pipeline as installed by the helm chart. Most of the chart deploys templated yaml and is a straightforward exercise in Kubernetes deployments, but there are a few additional setup steps for configuring the FlashBlade and the Elasticsearch policies. Prerequisites The following assumptions are made by the Helm chart: CSI Driver or PortWorx installed on Kubernetes Elastic Cloud for Kubernetes (ECK) v1.5+ installed Configured Elastic license (trial or enterprise license) Helm v3 present FlashBlade Configuration The log pipeline requires several buckets for object storage, so this helm chart first creates the necessary S3 accounts, users, keys, and buckets on the target FlashBlade using a separate python script named s3manage. As a pre-install hook, this script enables creation of the necessary account and bucket before the rest of the software starts up. Access and secret keys for bucket access are stored as a Kubernetes secret that is later used to populate environment variables. This configuration via custom scripting is exactly the problem that the upcoming Container Object Storage Interface (COSI) standard addresses: a portable way of creating buckets and provisioning access to those buckets. My script automates provisioning on the FlashBlade, but we need to wait for COSI to create a portable approach that uses native Kubernetes concepts and that would also work with other object store backends. Flog: Synthetic Log Generator I include a synthetic load generator to demonstrate how data flows through the log pipeline. Flog is a fake log generator with apache weblog-like output which can generate an infinite stream of pseudo-realistic data. To see an example of the output generated by flog, use the following docker run command: > docker run -it --rm mingrammer/flog … 137.97.114.3 — — [27/Aug/2020:19:50:11 +0000] “HEAD /brand HTTP/1.1” 416 16820 252.219.8.157 — — [27/Aug/2020:19:50:11 +0000] “PUT /maximize/synergize HTTP/1.0” 501 4208 … Confluent Kafka The helm chart configures a Kafka statefulset with an S3 Tiered Storage backend. Kafka is a reliable message queue that holds incoming log data before being processed and ingested by downstream systems. In most log pipelines, a message queue like Kafka buffers incoming data before ingestion by downstream systems like Elasticsearch. The result is that downtime or performance regressions in Elasticsearch do not result in dropped data. It also enables separate real-time applications to watch the same data stream. Confluent provides support and premium features on top of Kafka, including Tiered Storage which utilizes an object store backend to more efficiently store topic data and keep the brokers lightweight. By making the Kafka brokers near-stateless, operations like scaling up or down clusters and handling node failures no longer need expensive rebalance operations. As an example, recovering from a broker failure with Tiered Storage takes seconds in comparison to hours or days without. Tiered Storage is a natural fit for Kubernetes because it limits the amount of state managed by the pods, making it easier to provision pods, migrate them, and scale the pod count up or down. FileBeats The next step in a log analytics pipeline is to ingest data from Kafka into Elasticsearch for indexing and ad hoc querying. There are many ways to glue these two services together, including Apache Spark or Kafka Connect Elasticsearch Sink. For basic ease of configuration, I chose to use Filebeats to pull data from Kafka to Elasticsearch. Elasticsearch Elasticsearch is a flexible and powerful unstructured database for high-performance ad hoc queries on log data. Just like with Confluent Tiered Storage, Elastic has recently released a Frozen Tier backed by Searchable Snapshots as a way to offload the bulk of indexed data to an object store. Disaggregating the hot tier indexing from the bulk storage enables an elasticsearch cluster to scale resource usage independently as well as simplifying failure scenarios by making data nodes lightweight and faster to rebalance. This helm chart configures Elasticsearch to leverage a Frozen Tier for the bulk of its storage needs as well as a separate snapshot repository to protect index data from accidental corruptions. These Index Lifecycle Management (ILM) and Snapshot Lifecycle Management (SLM) policies are configured in the helm chart via a post-install hook. Example values.yaml File The following is an example values.yaml file that deploys a log pipeline using PortWorx for NVME Kafka PersistentVolumes and FlashBlade NFS for the Elasticsearch PersistentVolumes. The FlashBlade S3 credentials and buckets are automatically created by using the FlashBlade API token. flashblade: datavip: "10.62.64.200" mgmtvip: "10.62.64.20" token: "T-XXXXXX-YYYYY-ZZZZ-QQQQQ-813e5a9c1222" zookeeper: storageclass: "px-nvme" kafka: cpVersion: 6.1.1 storageclass: "px-nvme" nodecount: 4 elasticsearch: nodecount: 6 version: 7.12.1 storageclass: "pure-file" beats: nodecount: 12 flog: nodecount: 1 How to Adapt a Log Pipeline For Your Use Case The helm chart configures a log pipeline with synthetic log data. To adapt for real data sources, you need to make a few key changes and then optionally tweak some parameters: Disable the flog generators and replace with real data sources sent to a topic in Kafka Edit the filebeats configmap and change the “topics” setting to reflect your real topic(s) Edit the node counts in values.yaml to achieve the needed indexing performance Modify the snapshot policy (SLM) in post-install-es-snaps.yaml to meet your protection/recovery requirements I would recommend forking my helm chart or rendering it locally and then making the necessary changes to build towards a production use-case. Storage Usage Visualized After running the log pipeline for over a day, the Kibana monitoring dashboard shows the impact of disaggregation in the pipeline. In the cluster summary below, there are ~6 Billion documents indexed and 700GB of total data. But this count of data only includes the local storage across data nodes, not the data on the Frozen Tier. Looking at the FlashBlade bucket configured for the Frozen Tier, we see that there is an additional 1.95 TB of data stored here. There is a further 10% in space savings due to the FlashBlade’s inline compression. An examination of the indices shows that filebeat rolls after reaching 50GB in size. Due to the Frozen Tier, there are two other things to notice. First, indexes are renamed to add the “partial-” prefix once they are moved to the Frozen Tier and second, replica shards are not stored on the Frozen Tier. This enables more efficient space usage; instead of relying on storing multiple full copies, the FlashBlade internally uses parity coding to protect against data loss with less overhead. Looking more closely at an index on the Frozen Tier shows zero space usage. This means that the index takes up no space on a data node’s PersistentVolume and is instead entirely resident on the S3 snapshot repository. Shifting to the FlashBlade performance graphs, you can see the NFS traffic (top) which corresponds to the indexing activity on PersistentVolumes. The second graph shows the associated S3 write traffic as indices are finished and migrated to the Frozen Tier. Write (orange) and Read (blue) performance for Elasticsearch ingest Write spikes to S3 as indices are moved to the Frozen Tier Queries against the Frozen Tier take advantage of the FlashBlade’s all-flash performance. Querying historical data now benefits from linearly-scaling performance along with the simplicity and efficiency of FlashBlade. In the screenshot below, the FlashBlade shows up to 6.5 GB/s reads from the S3 bucket during a simple match query. The result is that 4.8TB of index data can be searched in 6 seconds! And finally, by looking at the storage usage in the Confluent Control Center, you can see a similar breakdown of data local to the brokers and data stored on the object store. So while 1.3TB of data is currently in my Kafka instance, I would only need to rebalance up to 230MB of data to handle node failures or cluster scaling. Conclusion Log analytics pipelines with Kafka and Elasticsearch ensure the ingestion and searchability of a wide variety of log data and enable use-cases like fraud detection, performance troubleshooting, and threat hunting. Creating these pipelines on-demand for different teams and projects requires an as-a-service platform like Kubernetes and disaggregated storage. Object storage and dynamic PersistentVolumes simplify the provisioning and operation of these pipelines. Portworx and FlashBlade make it easy to provide Kubernetes-native storage for both Kafka and Elasticsearch, allowing you to quickly scale clusters up or down as well as growing volumes as needed. FlashBlade provides an object storage backend for Confluent Tiered Storage and Elastic’s Frozen Tier with Searchable Snapshots. Using FlashBlade object storage limits the overhead and complexity of rebalancing across nodes when clusters scale or experience node failures.26Views0likes0CommentsUnderstanding Deduplication Ratios
It’s super important to understand where deduplication ratios, in relation to backup applications and data storage, come from. Deduplication prevents the same data from being stored again, lowering the data storage footprint. In terms of hosting virtual environments, like FlashArray//X™ and FlashArray//C™, you can see tremendous amounts of native deduplication due to the repetitive nature of these environments. Backup applications and targets have a different makeup. Even still, deduplication ratios have long been a talking point in the data storage industry and continue to be a decision point and factor in buying cycles. Data Domain pioneered this tactic to overstate its effectiveness, leaving customers thinking the vendor’s appliance must have a magic wand to reduce data by 40:1. I wanted to take the time to explain how deduplication ratios are derived in this industry and the variables to look for in figuring out exactly what to expect in terms of deduplication and data footprint. Let’s look at a simple example of a data protection scenario. Example: A company has 100TB of assorted data it wants to protect with its backup application. The necessary and configured agents go about doing the intelligent data collection and send the data to the target. Initially, and typically, the application will leverage both software compression and deduplication. Compression by itself will almost always yield a decent amount of data reduction. In this example, we’ll assume 2:1, which would mean the first data set goes from 100TB to 50TB. Deduplication doesn’t usually do much data reduction on the first baseline backup. Sometimes there are some efficiencies, like the repetitive data in virtual machines, but for the sake of this generic example scenario, we’ll leave it at 50TB total. So, full backup 1 (baseline): 50TB Now, there are scheduled incremental backups that occur daily from Monday to Friday. Let’s say these daily changes are 1% of the aforementioned data set. Each day, then, there would be 1TB of additional data stored. 5 days at 1TB = 5TB. Let’s add the compression in to reduce that 2:1, and you have an additional 2.5TB added. 50TB baseline plus 2.5TB of unique blocks means a total of 52.5TB of data stored. Let’s check the deduplication rate now. 105TB/52.5TB = 2x You may ask: “Wait, that 2:1 is really just the compression? Where is the deduplication?” Great question and the reason why I’m writing this blog. Deduplication prevents the same data from being stored again. With a single full backup and incremental backups, you wouldn’t see much more than just the compression. Where deduplication measures impact is in the assumption that you would be sending duplicate data to your target. This is usually discussed as data under management. Data under management is the logical data footprint of your backup data, as if you were regularly backing up the entire data set, not just changes, without deduplication or compression. For example, let’s say we didn’t schedule incremental backups but scheduled full backups every day instead. Without compression/deduplication, the data load would be 100TB for the initial baseline and then the same 100TB plus the daily growth. Day 0 (baseline): 100TB Day 1 (baseline+changes): 101TB Day 2 (baseline+changes): 102TB Day 3 (baseline+changes): 103TB Day 4 (baseline+changes): 104TB Day 5 (baseline+changes): 105TB Total, if no compression/deduplication: 615TB This 615TB total is data under management. Now, if we looked at our actual, post-compression/post-dedupe number from before (52.5TB), we can figure out the deduplication impact: 615/52.5 = 11.714x Looking at this over a 30-day period, you can see how the dedupe ratios can get really aggressive. For example: 100TB x 30 days = 3,000TB + (1TB x 30 days) = 3,030TB 3,030TB/65TB (actual data stored) = 46.62x dedupe ratio In summary: 100TB, 1% change rate, 1 week: Full backup + daily incremental backups = 52.5TB stored, and a 2x DRR Full daily backups = 52.5TB stored, and an 11.7x DRR That is how deduplication ratios really work—it’s a fictional function of “what if dedupe didn’t exist, but you stored everything on the disk anyway” scenarios. They’re a math exercise, not a reality exercise. Front-end data size, daily change rate, and retention are the biggest variables to look at when sizing or understanding the expected data footprint and the related data reduction/deduplication impact. In our scenario, we’re looking at one particular data set. Most companies will have multiple data types, and there can be even greater redundancy when accounting for full backups across those as well. So while it matters, consider that a bonus.48Views1like1CommentAsk Us Everything Recap: Making Purity Upgrades Simple
At our recent Ask Us Everything session, we put a spotlight on something every storage admin has an opinion about: software upgrades. Traditionally, storage upgrades have been dreaded — late nights, service windows, and the fear of downtime. But as attendees quickly learned, Pure Storage Purity upgrades are designed to be a very different experience. Our panel of Pure Storage experts included our host Don Poorman, Technical Evangelist, and special guests Sean Kennedy and Rob Quast, Principal Technologists. Here are the questions that sparked the most conversation, and the insights our panel shared. “Are Purity upgrades really non-disruptive?” This one came up right away, and for good reason. Many admins have scars from upgrade events at other vendors. Pure experts emphasized that non-disruptive upgrades (NDUs) are the default. With thousands performed in the field — even for mission-critical applications — upgrades run safely in the background. Customers don’t need to schedule middle-of-the-night windows just to stay current. “Do I need to wait for a major release?” Attendees wanted to know how often they should upgrade, and whether “dot-zero” releases are safe. The advice: don’t wait too long. With Pure’s long-life releases (like Purity 6.9), you can stay current without chasing every new feature release. And because Purity upgrades are included in your Evergreen subscription, you’re not paying extra to get value — you just need to install the latest version. Session attendees found this slide helpful, illustrating the different kinds of Purity releases. “How do self-service upgrades work?” Admins were curious about how much they can do themselves versus involving Pure Storage support. The good news: self-service upgrades are straightforward through Pure1, but you’re never on your own. Pure Technical Services knows that you're running an upgrade, and if an issue arises you’re automatically moved to the front of the queue. If you want a co-pilot, then of course Pure Storage support can walk you through it live. Either way, the process is fast, repeatable, and built for confidence. Upgrading your Purity version has never been easier, now that Self Service Upgrades lets you modernize on your schedule. “Why should I upgrade regularly?” This is where the conversation shifted from fear to excitement. Staying current doesn’t just keep systems secure — it unlocks new capabilities like: Pure Fusion™: a unified, fleet-wide control plane for storage. FlashArray™ Files: modern file services, delivered from the same trusted platform. Ongoing performance, security, and automation enhancements that come with every release. One attendee summed it up perfectly: “Upgrading isn’t about fixing problems — it’s about getting new toys.” The Takeaway The biggest lesson from this session? Purity upgrades aren’t something to fear — they’re something to look forward to. They’re included with your Evergreen subscription, they don’t disrupt your environment, and they unlock powerful features that make storage easier to manage. So if you’ve been putting off your next upgrade, take a fresh look. Chances are, Fusion, Files, or another feature you’ve been waiting for is already there — you just need to turn it on. 👉 Want to keep the conversation going? Join the discussion in the Pure Community and share your own upgrade tips and stories. Be sure to join our next Ask Us Everything session, and catch up with past sessions here!169Views3likes2CommentsPure Storage Delivers Critical Cyber Outcomes, Part Two: Fast Analytics
“We don’t have storage problems. We have outcome problems.” - Pure customer in a recent cyber briefing No matter what we are buying, what we are buying is a desired outcome. If you buy a car, you are buying some sort of outcome or multiple outcomes. Point A to Point B, comfort, dependability, seat heaters, or if you are like me, a real, live Florida Man, seat coolers! The same is true when solving for cyber outcomes, and often overlooked is a storage foundation to drive cyber resilience. A strong storage foundation improves data security, resilience and recovery. With these characteristics, organizations can recover in hours vs. days. Here are some top cyber resilience outcomes Pure Storage is delivering. Native, Layered Resilience Fast Analytics Rapid Restore Enhanced Visibility We tackled Layered Resilience in our first offering, but what about Fast Analytics? Fast Analytics refers to native log storage in an attempt to review and determine possible anomalies and other potential threats to an environment. This is a category of outcomes that has been moved, by the vendors themselves and, therefore, also customers, but is seeing a repatriation trend back to on premises. Why is repatriation occurring in this space? This is a trend that we are seeing in larger enterprises due to the rising ingest rates and runaway growth of logs occurring. It is important, more important than ever, to discover attacks as soon as possible. Rising costs of downtime and work time to recover are working hand in hand in making every attack more costly than the next attack. To discover anomalies quickly, logs must be interrogated as fast as possible. To keep up with this, vendor solutions have beefed up their compute functions in their cloud offerings. Next-Gen SIEM is moving the formerly, classic, static rules mode of their offerings to an AI-driven, adaptive set of rules, geared toward evolving on the fly, in order to detect issues as quickly as possible. To deliver that outcome, you need a storage platform to deliver the fastest possible reads allowed. As stated, vendors with their cloud offerings attempt to do this by raising compute performance. But what we see the enterprises dealing with is the rising costs of these solutions in the cloud. How is this affecting these customers? As organizations ingest more log and telemetry data (driven by cloud adoption, endpoint proliferation, and compliance), costs soar due to the vendor’s reliance on ingest-based and workload-based pricing. More data means larger daily ingestion, rapidly pushing customers into higher pricing tiers, resulting in substantial cost increases if volumes are not carefully managed. Increasing needs for real-time anomaly detection translate to greater compute demands and more frequent queries, which for workload-based models triggers faster consumption of compute credits and higher overall bills. To control costs, many organizations limit which data sources they ingest or perform data tiering, risking reduced visibility and slower detection for some threats. How does an on-premises solution relieve some of these issues? An on-premises solution, such as Pure Storage FlashBlade, offers the power of all-flash and fast read to provide faster detection of anomalies to support the dynamic aspects of next-gen SIEM tools, but also offer more control around storage growth and associated costs, without sacrificing needed outcomes. For example, our partnership with Splunk allows customers to retain more logs for richer analysis, run more concurrent queries in less time, and test new analysis and innovate faster. Visual 1: Snazzy, high level look at Fast Analytics with our technology alliance partners Customers at our annual user extravaganza, Accelerate, told us about their process of bringing their logs back on-prem, in order to address some of these issues. One customer in particular, FiServ, told their story in our Cyber Resilience breakout session, where we were speaking on what to do before, during, and after an attack, specifically in the area of visibility, where the race is on to identify threats faster. They told of their own desire to reign in the cost of growth, to regain control of their environment. There is nothing wrong with cloud solutions, but the economies of scaling those solutions have had real world consequences and bringing those workloads back on-prem, to a proven, predictable, platform for performance, is beginning to be a better long term strategy to battle the ongoing fight for cybersecurity and resilience. On-premises storage is a valuable tool for managing the financial impact of growing data ingestion and analytics needs, by supporting precision data management, retention policy enforcement, and infrastructure sizing, while reducing expensive cloud subscription fees for long-term, large-scale operations. Exit Question: Are you seeing these issues developing in your log strategies? Are you considering on-premises for your log workloads today? Jason Walker is a technical strategy director for cyber related areas at Pure Storage and a real, live, Florida Man. No animals or humans, nor the author himself, were injured in the creation of this post.39Views1like0CommentsConfiguring Apache Spark on FlashBlade, Part 3: Tuning for True Parallelism
This post will explore how to diagnose and resolve performance bottlenecks that are not related to storage I/O, ensuring you can take full advantage of the high-performance, disaggregated architecture of FlashBlade. We'll use a real-world scenario to illustrate how specific tuning can unlock massive parallelism.181Views4likes0Comments