Cloud Security: Sample DR from Hybrid Cloud Infrastructure

Introduction

Your company has an on-premise data center, as well as a few applications that run entirely in AWS. Recently, the CEO of your company heard of a company that was forced to go bankrupt due to the failure of a data center. She wants to ensure that your company is protected in the event of a loss of data center services, either the on-prem or cloud-based ones. Migration of the on-prem services to the cloud is not possible right now, so a multi-part solution will be needed. You have been tasked to generate a high-level design document on how to achieve this at the lowest possible cost.

Situation

Your company currently runs their website in AWS, as described in previous lab reports. In addition to the servers you have interacted with, assume there are several databases and other large storage servers in AWS to facilitate the voting application. In addition to these AWS services, your company has a number of VMs (running on VMware ESX) in a co-location data center near the company headquarters. These VMs run equipment, such as file servers for word/Excel documents, the email system, internal accounting systems, and other business-specific applications. While there are backups to a local tape drive, nothing else is done to protect the data.

Technical Write-Up

Prepare a report describing how you will protect both the on-prem and AWS data in the event of a disaster. You should attempt to minimize costs, as your company is not willing to spend a great deal of money on this initiative, as they feel the odds of needing it are small. Outline how you will protect the data, how a disaster would be handled, and how services will be restored in the event a disaster occurs. The audience of this paper will be the executive committee of the company, so they have some, but not a great deal, of IT knowledge.

—-

Overview

The company currently hosts a number of services at a Data center that is near its headquarters, as well as hosting the voting application in AWS. The Company is in the process of moving more services to the cloud, but needs a DR solution for both current and future architectures.

Disaster Recovery is defined here as the ability to recover critical IT data and services at the occurrence of any event can impact those services. This includes complete data-center or city-wide disasters.

No Recovery Time Objective (RTO) or Recovery Point Objective (RPO) is defined at this point, however given that the the voting application is publicly accessible it is assumed to need a quick recovery; while business data and services are assumed to have more flexibility, especially on the RTO.

Purpose

This report describes how both on-prem and AWS data and services will be protected in the event of a disaster; while minimizing costs; as well as defining how services will be restored after the disaster is over.

Main areas of concern

On AWS

A set of Linux and Windows servers, databases and application servers are hosted on AWS to provide an internationally accessible voting application, including potentially sensitive personal data.

“On Prem” Data centre near headquarters

The on-prem data centre hosts accounting applications, file servers, an email system and other internal business applications. The data centre is located near the corporate headquarters which means that in some disaster scenarios both the headquarters and the data center might be implicated at the same time. recovery of the headquarters is covered by the Business Continuity Plan.

Migrated Applications onto AWS

In the future the on-prem services may be placed onto AWS as well at which point, a cost effective cloud strategy will be needed.

DR for Voting Application

It’s recommended that at least two different availability zones be selected and that the EC2 instances be doubled. At the data layer the databases should replicate between the availability zones. Later an alternative such as using AWS RDS if possible to replace the oracle instances and use it for the data-level DR plan. The existing AWS availability zone should be designated the prime location, and the new availability zone should be designated for DR. All servers except for the database server should be shut down to reduce cost; and the data base server on the DR zone should be set up to be a Read Replica. This would also be the initial configuration after moving to AWS RDS. In the event of a disaster, the servers will be turned on, and the database server will be configured to become the master database server. Route53 can be used to ensure that in the case of the disaster traffic is routed to the DR availability zone.

On recovery, the process can be reversed. The initial availability zone can be set up with the same initial machine instances and database, but with this database reconfigured as the read replica. When data is properly synchronized, it can be set as the master and Route53 can be configured to route traffic to this availaiblity zone, and the servers on the DR zone can be turned off and the database server(s) re-configured to act as read-replicas.

If RPO is large enough, a backup storage and recovery strategy can be used instead to reduce cost further. A scheduled backup can be made for the database servers on a daily basis and stored outside of this availability zone. The DR zone with the servers preconfigured but turned off would still be needed for this solution, but data retrieval would come from this backup rather than already having been stored within the database as a read-replica.

https://aws.amazon.com/blogs/database/implementing-a-disaster-recovery-strategy-with-amazon-rds/

DR strategy for on-prem

The current architecture has the following data and services being hosted within the on-prem data center. it’s expected that within the data center proper backup and offsite storage plans already exist for this service.

  • File Server

  • Email System

  • Accounting Applications

  • Internal Business Applications

However in a data center disaster scenario, recovery of the data, even from offsite, would still be problematic since the hardware may not be able to run the recovered data. Cloud Virtual Tape Libraries therefor might provide some assistance; but without the ability to run the original VMs it is not a complete solution.

For this reason, the Amazon EC2 VM Import Connector should be configured to replicate the on-prem servers into EC2 and then turned off until a disaster occurs and they need to be turned on. This will save costs over continually replicating the VMs into AWS. Cloud Virtual Tape Libraries can then be leveraged to store the backups into AWS and retrieve them to update the VMs in the case of a disaster.

The rollback from DR in this scenario would involve ensuring that the on-prem hardware is reconfigured and then retrieving the backup data to resurrect the servers and then shut down the DR site. HOWEVER, it may be more efficient overall to consider this a migration to the cloud; and instead consider the DR instances as the new production instances, and then set up a new DR site on a different availability zone. See “Future…” below.

Future (On AWS) - Migrated Applications

Once all of the EC2 instances are on AWS and working, this essentially represents a migration of the services to the cloud. At this point a decision can be made to use these instances rather than the on-prem instances as the main services and turn down the on-prem data center.

At this point strategies similar to the Voting Application strategies. A new set on machine instances should be set up in a different availability zone and turned down with any database servers set up to be read-replicas until a disaster occurs. Also similarly, the recovery from DR can be handled in reverse, by standing up the production servers in the production availability zone, but shut down with the database servers in a read-replica mode. Once the data is replicated these databases can be configured to be the masters, the DR databases can be configured to be the read-replicas; and the production servers can be turned on, and the DR servers can be shut down.

Route53 can be configured to route the traffic appropriate to the Production or DR sites based on which should currently be the master.

Regarding the File Server, as this is moved to the cloud; rather than hosting a file server this should be changed to an S3 solution to leverage it’s replication capabilities and a storage gateway can be placed on-prem to access the file storage. This would relieve some of the replication requirements at the file level needed for DR scenarios. Users would be able to access a file share just as they do currently, but files would be replicated to other locations transparently, and less infrastructure would be needed even on virtual machines.

References

(n.d.). Retrieved from https://docs.aws.amazon.com/rds/?id=docs_gateway

Deekonda, A. (2019, April 17). Implementing a disaster recovery strategy with Amazon RDS: Amazon Web Services. Retrieved from https://aws.amazon.com/blogs/database/implementing-a-disaster-recovery-strategy-with-amazon-rds/

Creating an SMB File Share. (n.d.). Retrieved from https://docs.aws.amazon.com/storagegateway/latest/userguide/CreatingAnSMBFileShare.html

Best Practices for Amazon EC2. (n.d.). Retrieved from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html

(n.d.). Retrieved from http://docs.media.bitpipe.com/io_13x/io_136616/item_1510269/CloudBerry_Lab_Whitepaper_A_Complete_Guide_for_Backup_and_DR_on_AWS.pdf

Previous
Previous

Cloud Security: Sample Cloud Analysis

Next
Next

Cloud Security: Sample Patching Policy