Migration From On-Premise Kubernetes to AWS EKS
This article will discuss the migration framework WWT uses to migrating existing on-premise containerized services into the Amazon Web Services (AWS) cloud, specifically covering AWS Kubernetes Service (EKS) as a migration target.
Migrating from on-premise Kubernetes to Amazon Kubernetes Service (EKS): A process and a method
In my time as a Technical Solutions Architect, I've completed many migrations—all with different unique challenges. Here are a few key aspects to the migration methodology that we use today.
Differences between on-premise Kubernetes and EKS
On -premise provides much greater control and management, but at a significant cost. It also offers integration into your processes much faster and without as many challenges. From my experience, when a process is completely on-premise, it is rarely re-evaluated until it breaks, or a new license is required to maintain it.
However, most data centers do not meet the standards and distribution that AWS has achieved. On-premise upgrade processes can also be arduous, given the volume of releases and speed that enterprise performs upgrades. Teams become buried very fast by feature parity of newer releases. It is common for enterprise to dedicate whole teams to on-premise upgrade processes; this gets very expensive, very fast.
The Kubernetes master controllers in EKS today are managed by AWS in a VPC and the logs for those masters are not readily accessible. To retrieve those logs you must open a ticket with AWS. Usually, this is not much of an issue, but it can be. Normally, your focus is on the workers and how they do their work.
Just as on-premise clusters are not re-evaluated until they break or new licenses are required, with AWS hosting the cluster, more attention is spent ensuring every aspect remains documented and evaluated. This allows for fewer cracks in the design, thus a stronger holistic solution.
The AWS shared security model and cluster availability SLA are an awesome aspect of AWS hosted EKS. EKS also has windows node support, an extremely desirable feature.
EKS is a certified Kubernetes conformance, which helps with compliance needs. Many other certifications are pre-built into the service. In addition, EKS makes it easy to test different cluster versions; one cluster can run version ‘A’ and another cluster at version ‘B’ easily facilitating workload testing. This is very easy to do—at the click of a button if need be.
I hope this information makes deciding to migrate to EKS easier. There are many good reasons to migrate in addition to the ones listed above, such as overhead reduction and simplifying version moves.
Overview of process
You will need to get the business story of who, why and what for the migration.
- Who: Who is the business owner of the workload?
- Why: This comes from the business success criteria, and it can be as simple as "to reduce data center costs, staff costs, etc."
- What: Focus on one workload to migrate first; this helps shake out the details of what is needed to be successful so you know what to expect with other workloads.
Once you understand these, the next step is dive into the on-premise environment’s details. This includes network hardware, software, ports and container-specific details. It’s extremely important to have a performance baseline. It helps to validate the success of the migration, gives areas of focus if there are any unexpected results and allows for a complete understanding of the environment before you migrate.
Scope out the lifecycle of the workload to migrate and make sure you understand the CI/CD deployment process and schedule for updates and release candidates. Some workloads are very agile, so it’s important to be prepared and continue rolling out updates and executing testing as fast as possible. Once all the migration foundation has been built, the final task is to migrate.
Below is an outline to follow when contemplating the migration process.
- Who, Why, What
- Success Criteria
- Details On-Premise
- Ports – Firewalls – Routing
- Kubernetes Specific
- ACLs – Secrets
- Certs / Keys
- Now we can migrate!
To begin with, I’ll make many assumptions. After reading this article, please feel free to ask for more information on any step I outline here.
The assumptions are as follows:
- You have a high familiarity with Kubernetes.
- There is a defined namespace for the project.
- Your workload has three parts: the workload (LOGIC), the database (DB) and the presentation (UI). Each part is a separate service that needs other parts to make it work.
- LOGIC — Exposed Port 8081, Persistent Volumes (PV), Persistent Volume Claims (PVC) Service
- DB — Exposed Port 3306, PV, PVC, Service, mysql 5.6, Secrets
- UI — Exposed Port 443, PV, PVC, Service
Who, why, what
The business owner of the workload will have an intimate understanding of the workload in question. They hold all the keys; a relationship with them is critical in finishing this portion of the process. They will not know what the business success criteria are most of the time, and they will have a separate set of criteria that they want to have accomplished beyond the migration.
For example, a business owner says, “we do not have Splunk working with this workload. We need this to be working in the cloud.” This is a specific success criteria, where the broader success criteria might be: “migrate these workloads to the cloud.”
Next we need the business to pick their simplest workload, or “the lowest hanging fruit." This simple workload will be moved as the phase one approach to the migration and will validate the new environment’s design and operational readiness.
- Secondary success criteria
- Business success criteria
- List of workloads and a direction to the focus for phase one
- Workload owner and details they can provide
Now that we have the focus phase for one of the workloads completed, we can build a plan from here. We need to look at both the physical and virtual environments. Step one is to gather existing container details: HD, MEM, CPU, PORTS and STORAGE. Next are the ACLs, firewalls, routing, DNS and all things networking.
After capturing the workload’s details we move into analyzing the current Kubernetes cluster in use. A good way to begin is by looking at what version of Kubernetes is running on-premise and comparing to what version EKS can run in the cloud today. I say today because the version changes quite quickly.
Connectivity & storage
After the EKS version is decided, the next step is to evaluate the type of networking you are using in Kubernetes, making note of the type of networking modules you are running. It’s important to reflect this in EKS, if possible. During this discovery phase, note any tools used to build your workload layer, such as Helm or Helm-like solutions, which can deploy to Kubernetes. Having these tools in mind when you migrate is key. There are different approaches to deploying workloads to the cloud, sometimes using the same method for on-premise can work, but most times it does not.
On-premise storage is far different from cloud storage, and if you want to continue to keep on-premise storage, this presents a different problem. My advice is to move to cloud storage and test storage methods in the cloud. If needed, try solutions that include on-premise storage.
In my experience, most hybrid solutions cost too much to be considered, and the speed is not comparable. Remember that data in transit needs to be encrypted, especially going from cloud to on-premise storage.
Now we need to know what the workload is talking to. In most cases, there is a large database that the workload relies on. Sometimes, the database is so large that moving it to the cloud appears daunting. It’s important to know that database solutions in the cloud offer capabilities that can be more advantageous for speed to capability than on-premise solutions.
Most companies must have visibility into their workloads and leverage application logging and monitoring tools. Sometimes this is an afterthought, which makes troubleshooting application performance difficult. When migrating from on-premise to cloud, be open to trying new solutions. There are many cloud native monitoring options; choosing one is a balance of comparing requirements vs. comfort with change.
Another important thing to consider when migrating to EKS on AWS is the ACLs that exist in the Kubernetes environment and member roles, as they must be reflected in EKS as well. It is important to define the Default Admin and Cluster Owner. Depending on how the cluster is managed, setting up the worker nodes for ssh access (or not) is not usually reflected in the current project needs.
It is best practice is to limit SSH access to the nodes, never use a process to allow node group ‘A’ to only talk internal to the cluster and node group ‘B’ to have assumed access to talk to other objects in AWS. That is advanced Kubernetes and cannot be described completely here. Keep that in mind as we are discussing migration in this article and not re-developing the app in EKS with security and current best practices applied.
Ask how the current Kubernetes cluster stores secrets. If secrets are stored separately by sending them to a key store, or any other method, it is critical to get all the information on how that is designed today and vault to EKS. Remember, the process can always be changed in the future. Successful secrets management can be a hinge pin in most workloads.
Config & DNS
When designing the AWS EKS environment, it’s important to understand the current config and see what needs to be adapted to the EKS environment. For example, a design might be to separate the ETCd to separate VMs for speed. EKS does not allow for this, and it does not allow you to control the number and spread of the masters—but you can set labels to help with location awareness and function.
If there is a team that manages the DNS of the company, they might have a play with the on-premise Kubernetes DNS functionality. I highly recommend you get a strong understanding of their control and see if they can allow for EKS to manage the Kubernetes DNS and apply DNS over the Ingress controller, for more of a control of the DNS needs of the workload.
There is a need to manage access to the environment, and this is done with RSA keys. When the system is built, it generates all the needed keys to talk to each service by default. Generally I make backups of these keys and store them in a separate location. I have never had occasion to access them after the first build, except when a key is manually added to a workload and is not managed by AWS.
Finally, consider where the customer resides that consumes the workload. Understanding the geographical considerations of your customer base is important when deciding where to deploy EKS. If you wish to deploy in Canada, for example, (at the time of this article’s creation) that region does not allow an EKS cluster to be built. This may change in the future; with this knowledge, you can better serve the business by sharing the operational restrictions AWS EKS clusters have.
When you have determined the location of your customers, locate an AWS region closer to that area or consider setting up methods to get customer's access to the AWS backbone via points of presence, such as CloudFront.
- Worker CPU, MEM, QTY
- Ports – Firewalls – Routing
- Kubernetes Specific
- YAML: Workload specific (or Helm Charts)
- Storage: Speed and types used
- Connectivity: External and internal
- Logging/Monitoring: Mechanism used
- ACLs – Secrets: Who has access and how to manage secrets
- CONFIG: Overall design considerations
- DNS: Internal and external addressing
- Certs / Keys: AWS provided or created
- Geolocation: Region
We must know how the developers cycle through the workloads. Some companies are so large each workload could be its own business and thus operates in its own manner. Ask about the SDLC for each workload. At a minimum, it’s important to understand the moratorium, code freeze dates, release dates, process, sprint durations and sprint methodologies (scrum, agile, waterfall, etc). In addition to the development cycle workflow, understanding maintenance windows is a must as well.
In lifecycle, we learn how the workload is developed, specifically the code language, environments used and how the code testing is executed. Learn what the criteria for a successful code promotion to production is. Research the change request/emergency change request process and how long requests normally take to process. If scheduling for the change request process is overlooked, the project may fail.
Understand how the developers track defects and how source code is stored. This information may be needed if a developer is not familiar with the entire process. It’s useful to ask to what third party systems are used to facilitate the development processes. It is common for a ticketing or a board to be used for tracking features/defects, and the development team works from that. It is not critical that you know how to code python or java but knowing where the code is to help with the migration is key to success.
- Development process
- SDLC stage of workload
- Environments used
- Criteria for production deployments
- CR/ECR process and timing
- Source code location and language
The previous step is similar to this one; developers are normally the gatekeepers to the CI/CD process. The CI/CD pipeline may be as simple as Jenkins and GitHub, or it may be a new third party tool. Understanding the process and all the systems that are leveraged and reported to will assist in how to best vault and continue the migration.
When choosing a workload for the first run, it is best to pick one that has a piece of everything I've outlined here. This is the one step you may have little or no control over. What normally happens is you migrate to EKS, and this is an afterthought. Let’s not do that. Seek out to understand this process and get that team involved in the migration.
What I mean by this is it is normally political at this juncture. You can find a workload that fits all the criteria I have outlined, thus allowing you to build a process that works for your company. But from experience, if an executive wants his workload migrated first, for no other reason than "I'm excited and want to see it done for my application first," then it's most likely that will occur. If you can steer the process to not let the tail wag the dog—bonus!
- Defined CI/CD process
- Contacts to assist in targeting EKS
Now we can migrate
Let’s get started with the migration. First, we build out a cluster to land the workload on. We use all previous information to build this out. Below are the generic items for the EKS cluster build.
Build EKS for the one workload, to begin with:
- Member Roles
- Default Admin
- Cluster Owner
- Labels and Annotations
- Account Access
- Access Key
- Secret Key
- Session Token
- Cluster Options
- Kubernetes Version
- Service role
- Fargate (NEW)
- Labels to Match
- VPC and Subnet
- Private/Public IP for Worker Nodes
- Generic VPC and Subnet / Custom
- Node Options
- Instance type
- Custom AMIs
- Desired ASG Size
- Node Volume Size
- SSH Key
- Pass user-data to the nodes to perform automated configuration tasks after worker launch
- Managed Node Groups
*Note: Nothing is final until you look at the EKS build logs and see if all is working as expected.
Configuring Kubernetes for the migration
OK, we have made it this far... can we please start the migration, Dan? The answer is not quite yet, because we still need to configure Kubernetes for the workload. The following is a checklist of items to think about when configuring Kubernetes. They need to be built and decisions made before workloads can be deployed:
- Namespace: Normally the workload official name for the business;
- Node labels: Help with assigning work to a specific worker node;
- Ingress: Port mapping of the internal workload, internal to Kubernetes being shared with the outside of the cluster;
- Health checks: Allow for the workload to let the cluster know if the workload microservice is healthy, like an HTTP 200 response;
- Ready checks: Response to the cluster that the microservice is ready for use, like an OK reply to curl command;
- Storage: Determined based on previously accumulated information, speed, backup, size, etc.;
- Dashboard: Visibility into cluster;
- Logging: Determined based off of previous knowledge around how to create, collect, store and distribute said logs; and
- Secrets: Determine how the use of secrets, keys, encryption, etc., will be handled.
Now deploy to EKS
There are a lot of things to consider before deployment starts. It’s common to deploy, test, change and deploy, test, change until finding a working balance. The workload will be viable, but the connection of all other items related to the workload can cause sprawl. Stay focused and make sure to stick with the business success criteria as a guide.
After the first workload is running in EKS successfully, rinse and repeat for the next workloads. At this point, all the required information one would need to start projecting timelines should be readily available for more migrations. It’s possible to use the data that was collected up to this point to define a path for all the future workloads.
Create a pattern for all the parts we have covered, consider all decisions made about size, storage and accessing the cloud. This process can be used to define plans for improving workload development. Migration projects expand understanding and leads to optimizing the environment(s).
How does one manage this? How can you use industry-standard CI/CD without spending millions on a platform? Who do I reach out for more help? The answer is simple. Please contact myself or the larger WWT team if you have interest in this topic or any other IT-related questions.