The bane of every IT person’s existence is scheduled downtime. Downtime has created a situation in most organizations where, from the CxO on down, the process to institute a change is beyond bureaucratic. This process makes most days in Congress seem like a cakewalk.
Much of the issue with downtime doesn’t stem from the change or update itself, it’s about limiting the impact to businesses that run 24×7. Healthcare is a perfect example. We’re not avoiding human error, we’re avoiding inconvenience.
The cost of maintenance for systems is huge. We’re stuck in a difficult situation where downtime is needed to maintain system operation and security, but downtime can’t be obtained due to business requirements. In order to achieve the idea of “zero downtime”, organizations implement hot standby systems and duplicated efforts in hardware and supporting infrastructure just to avoid having a system offline. For organizations that can’t afford to do that, they just absorb the downtime as lost productivity.
This is why cloud makes sense for so many organizations. Properly built, cloud is generally always-on, always reliable, and a measure of assurance most organizations can’t afford to achieve. The cost savings alone make cloud a viable alternative for many solutions.
At Extreme, we realize that when you are managing 1M devices in a large multi-tenant environment, downtime is something that will never be tolerated. That’s why over the past several months, along with developing and deploying our 4th generation cloud technologies, we’ve also made amazing strides to eliminate maintenance downtime.
Beginning with the Q1r3 release of ExtremeCloud IQ, we implemented a new method of updating production without ever taking the application offline! To achieve this, we’ve had to do several things within the architecture and within our development discipline. Let me explain.
ExtremeCloud IQ is a 100% agile development project. We code in agile sprints, making many progressive updates to the product over time. Sometimes those updates require database schema changes, and it was those schema updates that were the cause of our previous scheduled maintenance. When you’re making changes at cloud-speed, these changes can create unnecessary amounts of system interruption on a scheduled basis.
ExtremeCloud IQ uses two major types of DBMS systems. We utilize tried-and-true SQL based solutions that many of you are familiar with, for all of the structured data that we consume. This structured data consists of elements like authentication information, configuration objects, or the inventory list of managed devices. Most of the other data that we process falls into the category of unstructured data, such as monitoring statistics, events, alarms, and AI/ML. All of this unstructured data is stored using Elastic Search.
Elasticsearch is brilliant in that it stores data inside of indexes and those indexes can be expanded with new fields anytime. The application won’t care about what we do to the Elasticsearch indexes, just so long as we don’t delete anything.
SQL database schemas are much more persnickety. A database schema change is not a graceful event unless the application and the query the application is making are capable of handling differences within the database schema correctly. Imagine making a query expecting 7 fields back, but you get 10, or expecting a field to be “X” length, but now it’s “Y” length. These changes can create quite a kerfuffle with a standard application and is why previously we would shut down the application before we apply database updates. Despite us having this process down to a science and the process itself heavily automated, it still took 30 minutes to accomplish, which was a thirty-minute inconvenience for our customers.
First of all, with our ISO 27001 certification, we’ve amassed hundreds of pages of operational documentation and processes that require developers to document their proposed schema changes. Any proposed schema change for a particular release sprint is vetted by several people, including the Cloud Operations Distinguished Engineer. No database change is ever undertaken without explicit vetting and approval.
Secondly, we’ve engineered ExtremeCloud IQ to be forward and backward compatible. This is actually quite a feat. Every line of code in the application that interacts with a database connection must anticipate that it may interface with a new schema and it must gracefully process it. In addition, the databases involved have to be backward compatible such that during the upgrade process the legacy application can still function. This is a ballet of code, conditional processing, and operating processes that come together to create a situation of zero downtime updates.
In the diagram below, we start out on the left with the ExtremeCloud IQ application and the supporting databases operating normally. In the first step, our cloud operations team will apply fully vetted and authorized DB schema updates to the active databases, creating a DB version of “N+1”. Meanwhile, utilizing the forward compatibility we’ve engineered into the application, the older version of ExtremeCloud IQ continues to run using the new database schema. The backward compatibility of the new “N+1” DB schema allows it to accept and process new transactions from the legacy application format.
Next, we apply updates to the ExtremeCloud IQ application itself. In 4th generation cloud, this update involves booting and swapping new containers that are orchestrated via the Kubernetes infrastructure. We do this in a graceful fashion, replacing a portion of the operating Kubernetes Pods, and then failing active connections into those upgraded instances. We will then upgrade the remaining pods, creating an application version “N+1” that is connected to the new supporting DB schema.
Finally, we apply DB cleanup and housekeeping routines to set the new standard for both the application and the database.
With an application as large and complex as ExtremeCloud IQ, there are lots of moving pieces, and this process I just explained has been overly simplified. However, when those pieces move in a well-orchestrated fashion, we can accomplish a zero-downtime upgrade. Beginning with the recently deployed Q1r3 released (released the week of April 13, 2020), we achieved our first production deployment using this methodology, and we will continue!
Here’s to the future, and one without downtime, brought to you by your friendly folks in cloud operations. Imagine what 5th generation cloud will bring? #cloudspeed #extremecloudiq