When IT Systems Fail: How to Ensure a Fault Tolerant Medical Practice
When it comes to downtime, many independent medical practices have an extremely unforgiving customer base compared to their small and medium-sized peers. Patients expect their providers to be available and working efficiently in their time of need. When IT systems fail, leaving a practice struggling to serve patients effectively, a natural customer response is to question the competency of the practice. After all, “If they can’t keep their business running, how can I expect them to keep me running?”
When IT systems fail, leaving a practice struggling to serve patients effectively, customer satisfaction is negatively impacted, employee productivity, and thus profitability, is adversely affected, and a higher risk of errors, data loss, or information breach exists. Anticipating a system failure and architecting an environment that minimizes the impact to the medical practice is strategic and a good business practice.
Practices with several locations in the same geographic area are uniquely able to mitigate IT system and facilities downtime. By designing their business processes and IT systems to be agile, these practices can react to failures and reallocate resources quickly, continuing to care for patients even if an entire site is taken offline. This article will explore the best ways to continue serving patients even when faced by system failure.
By designing their business processes and IT systems to be agile, these practices can react to failures and reallocate resources quickly, continuing to care for patients even if an entire site is taken offline.
Which Systems Matter?
First, we need to understand which systems are essential to practice operations. Because fault tolerance--which refers to the ability of a system to continue functioning after a component failure--is costly, it’s prudent to identify which IT systems and business processes are truly mission critical. Some choices are obvious: EMR, telephones, and patient check-in systems are all essential. Others fall into a grey area. What would happen if ePrescribe or faxing was unavailable for a day? What about printing? Develop a list of all IT and facility systems that are used by the practice, then rank each system on a priority scale of 1 to 3: P1 systems are mission critical, P2 systems would cause a major process interruption, and P3 systems are for convenience. Here’s an example list:
For each P1 system, fault tolerance is required. For P2 systems, a cost/benefit analysis should be performed to determine if the cost of service interruption is outweighed by the cost of of protection. In performing the analysis, remember to consider customer loss and employee productivity. Generally, fault tolerance is not needed for P3 systems.
Protecting Each System
With a list of critical systems in hand, the next step is to develop plans to protect each one. This is a task that requires close collaboration between the practice management team, the IT team, and the facilities team. For each IT system analyzed as part of this process, a plan must be developed to ensure that the system is always (1) available, (2) functional, and (3) accessible.
Available
For an IT system to be available, its underlying hardware and software must be fault tolerant. This means removing single points of failure in each system layer, from power and cooling to processing and storage. Several approaches can achieve this result.
- The most straightforward method involves outsourcing the entire application to a cloud service provider. In this case, the vendor is responsible for delivering a fault-tolerant environment, and will provide a service level agreement describing its availability guarantee.
- Another approach involves outsourcing a portion of the environment, usually the expensive-to-build “hosted computing” layer, and maintaining private control and management of the application itself. This approach can yield cost savings and flexibility benefits compared to the cloud service model, but requires a more skilled IT team to execute successfully.
- The third approach involves building the necessary fault-tolerant systems in-house and maintaining them privately. This approach is generally not cost effective for any but the largest private practices.
Each approach will, if implemented properly, provide the same result: failures within customer facilities, telecommunication carrier networks, electric utilities, or IT hardware systems will not affect availability of the core application.
Functional
IT system functionality describes the ability of the system to fulfill its users’ needs. Although a system may be available from a technical sense, if it is not behaving as intended, it will not satisfy business needs. Vendor support agreements with guaranteed response times are an effective way to ensure system functionality. In addition, the IT team must be trained and ready to quickly engage multiple vendor resources when complex problems arise. Also, a seamless support framework must exist within the practice to connect users with the correct support teams when issues arise. A centralized help desk that coordinates multiple support resources is essential.
Accessible
Accessibility defines whether authorized users can access a particular resource. Even if an IT system has 100% availability from an application perspective, if its users cannot reach the system due to a telecommunications failure at an individual practice location, the business need will not be served. Preventing accessibility issues is usually handled in two ways.
First, connectivity technologies should be chosen with reliability in mind. Fiber-based data circuits, for example, are generally more reliable than coaxial cable-based data circuits; enterprise-grade connections with a service level agreement provide higher uptime than consumer-grade connections.
Second, multiple circuit technologies can be combined to provide fault tolerance using a feature set known as SD-WAN. This provides a seamless failover experience if one circuit suffers an outage. It’s important to note, however, that although two different service providers may be available at a particular location, they may both share physical network assets, and may therefore be vulnerable to common failure modes.
Business Process Support
Identifying and protecting mission-critical systems is an essential step toward developing fault tolerance in a private medical practice, but the business continuity journey doesn’t end there. Business processes must also be developed and maintained in order to adapt to other failures.
Although careful planning and system building can ensure that Priority 1 systems remain available, functional, and accessible, special business processes must be developed, documented, and tested on a regular basis to adapt to Priority 2 and Priority 3 failures. Using the list of system priorities developed earlier, identify systems that are not protected from failure through fault tolerance at the system level, and develop a continuity plan that copes with unavailability of the individual system. For example, power failure at an individual practice location requires a plan detailing how to temporarily relocate employees to another location, communicate the change to patients with scheduled appointments, report the power outage to building management, and so forth.
While developing these business continuity processes, be sure to utilize multiple physical locations to your advantage. For example, a group of spare workstations at a well-placed practice location could serve as a disaster recovery site for the patient scheduling center or billing group.
Finally, as business processes are developed, they must be effectively documented and communicated to relevant employees on a continuing basis. Continuity plans are only effective if they’re regularly updated to reflect changes in business practices and personnel.
Testing
Regardless of the rigor applied during the planning process, new considerations will always arise when a failure occurs in practice. Testing and simulation exercises allow us to account for this reality. Failure testing, of course, can be disruptive, so careful planning is again required to expose process flaws while minimizing operational disruption.
Failure testing of fault tolerant P1 systems can be performed by the IT operations team during maintenance windows. Drills for failures that involve employee response and contingency plan activation can be scheduled during known low-patient-volume periods. Simulated failure procedures should be designed with a rapid rollback plan in place, so that normal operations can be resumed if a flaw in the documented recovery plan is exposed. Finally, testing activities should be repeated on a regular basis and the results documented for future reference.
Final Thoughts
The cost and effort to implement fault tolerant systems can be significant. In many cases, these projects are often prioritized only after major failures cause a business interruption. Unfortunately, these business interruptions carry indirect costs as well as the obvious direct results: missed patient appointments, billing delays, information loss, and the like. These indirect results, such as reputation damage and generation of negative online reviews, can result in lost revenue for months or years to come. The outcome of evaluating risks associated with critical systems is informed, strategic decision making. The approach shifts from reactionary, fire-fighting and costly, to proactive, controlled and calculated. The age old adage rings true: “an ounce of prevention is worth a pound of cure.”
About the Author
Patrick Shoemaker is a Partner at PEAKE Technology Partners, a healthcare focused IT Managed Service Provider (MSP) based in the Baltimore/Washington region. Patrick holds an Electrical Engineering degree from Bucknell University and is the chief architect of the PEAKE healthcare private cloud environment.