Records365 is a cloud based Records Management Service built for Office 365 and OneDrive for business. Providing features that seamlessly control and automate record management for these content sources. Records365 gives you full control of your information lifecycle in the cloud, ensuring compliance and transparency with no impact to your end users.
Proper handling of updates and change is always a challenge and even more so when running a SaaS offering such as Records365.
Careful thought must be paid to each component of the solution, from the infrastructure to the logical architecture, and to each component’s individual strategy regarding updates and changes, as well as an overall holistic approach to change in the system.
Resilience and support for change is accomplished by providing architectural options at each layer of the solution and by the avoidance of single points of failure, or more specifically, single points of update. The architecture of Records365 is inherently designed to handle updates and changes at multiple levels of the solution.
From the underlying physical architecture, to the logical mid-tier architecture, to our integration with dependent components such as Office 365, Records365 has been designed to be resilient and highly available under constant change.
This article discusses some architectural and operational aspects that support this resilience and availability.
Image 1: Records365 High-Level Architecture
The smallest, fully redundant virtual farm in Records365 incorporates multiple servers; at least two for each tier or role in the solution. User requests are automatically load-balanced across the web servers are utilized more or less equally. Because each role is redundant, we can independently update each role by removing that component from rotation while it is being updated, and then performing the same operation for the other components in turn.
We also use technology improvements that are available in the SharePoint 2016 platform, such as zero-downtime patching, to minimize any impacts when updates are geography wide, such as when upgrading content databases.
Records365 Data Tier
The data tier is also made redundant and highly available in the solution using database mirroring. We enable mirroring across all existing content databases, the configuration database, the Central Administration content database, and the service application databases.
Mirroring is configured for high-safety and in a synchronous-commit mode with an independent witness. This configuration enables independent updating of individual server components as updates become available. This is achieved via a rolling upgrade of those components.
The rolling upgrade is a multi-stage process that involves upgrading the individual Database Mirror System instance that is currently acting as the mirror server in a mirroring session, then manually failing over the mirrored database, upgrading the former principal instance, and resuming mirroring.
Image 2: Database Mirroring in Records365
Records365 Mid Tier
The Records365 mid-tier employs a queuing solution based on Azure SQL Database that also aids with the availability of the solution under planned or unplanned maintenance scenarios. In the case that the Records365 backend components become unavailable, the system can still accept and queue requests from Office 365. Once the backend components are back online, they can process the backlog of requests, to ensure that no updates have been lost.
We can also utilize load balancing at the mid-tier layer to support the maintenance of the web components while they are out of rotation.
Because Records365 is built to manage records in Office 365 and beyond, workloads such as SharePoint Online and OneDrive are potential areas where changes or updates can impact our solution. As these systems are outside of our direct control from an operational perspective, we take a deep partnership and information sharing approach with Microsoft to eliminate change risk in this area.
As part of this approach RecordPoint engineering works extensively with First Release and earlier ‘canary’ tenancies for Office 365 so we can catch any issues early and work with Microsoft to resolve them. This practice, in tandem with our engineering partnership with Microsoft and the Office teams, ensures that we are aware of Office 365 changes and updates during our development cycle, well before those changes make our customers production workloads, which generally reside on the outer release rings.
This ensures that once an Office 365 change reaches the standard release rings our system is well aligned with that change. Recent examples of this were our collaboration on and early testing of the Modern Sites functionality in SharePoint Online, as well as our early support for Microsoft Teams and Office 365 Groups, prior to wide release of that functionality.
We also meet regularly with the Microsoft Graph and Office teams in a roundtable format to ensure that we are across the forthcoming features and often assist with suggesting and testing new features.
Azure Update Domains, Fault Domains, and Availability Sets
For each workload that requires high availability, such as web, application and database workloads, the solution employs more than one virtual machine for each role and then includes them in an availability set. This ensures that workloads are available during local network reachability failures, local disk hardware failures, and any planned downtime that the platform may require.
Image 3: Availability Sets in Records365
The availability set also impacts the update domain for the virtual machine, with Azure ensuring that operating system updates will not be performed on machines offering the same workload at the same time.
For Azure initiated updates, we utilize Update Domains extensively in our architecture, which is an arrangement for high availability and update tolerance, which works by ensuring that only one of an instance of a role in an individual Update Domain will be down for an update at one time.
To support the updates of individual components, distribute load between the front-end web servers, and to provide high-availability at the web tier, the Microsoft Azure Load Balancer for virtual machines has been used, in both external facing and internal facing configurations.
In the case of a web server failure, the load balancing service has been configured to probe for the health of the server instances and to take unhealthy server instances out of the rotation.
Image 4: Load balancing in Records365
Microsoft Azure Load Balancer is a Layer-4 type load balancer. Microsoft Azure load balancer distributes load among a set of available servers (virtual machines) by computing a hash function on the traffic received on a given input endpoint. The hash function is computed such that all the packets from the same connection (TCP or UDP) end up on the same server.
The load balancer provides the fundamental mechanism by which a component of a role can be drained from active rotation and updated without impacting user or system requests.
System resilience and support for change are about providing options and flexibility at each layer of the solution, having close communication and partnership with those providing solution components or operating integrated systems, having a good operational understanding of the system, and avoiding single points of failure or more specifically — single points of update.
No solution can be made impervious to change or updates, but with careful planning, the risk of these can be curtailed and managed and the benefits of these maximized.