Custom Software Serverless Deployment and DevOps

Timothy Wong | September 30, 2020

Devops, Serverless

Company Name
CallEvo

Case Study Title
Custom Software Serverless Deployment and DevOps

Vertical

Company Info

Problem / Statement Definition

CallEvo was looking to both deploy and set up DevOps for its custom software. Their application was initially developed to run within a monolithic framework. Triumph Tech was brought in to help “decouple” their architecture and speed up application delivery.

Proposed Solution and Architecture

Triumph chose AWS to provide the cost efficient resources to run CallEvo’s custom software and speed the delivery of application updates.

We used a Lambda function to run the backend “API layer” to process requests in response to events from the front-end.
We used S3 and CloudFront in order to serve static assets for the “frontend.”
We used Postgres Aurora Serverless as our data layer.
We used API Gateway to send and receive dynamic content from the Lambda backend.
We utilized Elasticache Redis to provide a cacheing layer for requests made to the backend.
Auth0 was used to register and authenticate users to the application.
CodePipeline, CodeBuild, and CloudFormation were used in order to rapidly deploy updates to the application.
Systems Manager was used to securely store parameters required by the appliction.
We used SQS to increase performance of the application and efficiently process transactions sent / received by the software.

Outcomes of Project & Success Metrics

During the initial discovery phase, we discovered that the development team did not have a procedure to rapidly deploy changes to their application nor did they have an efficient way of running the application.

Project metrics were determined post-buildout with the execution of load tests and the speed of how application changes could be deployed. We needed to know if the environment could handle at least 100 requests per second in under 2000MS per request with a low error rate. Additionally, we needed to find the amount of time required to deploy code changes to the environment.
Locust was used to load the test by simulating common user behaviors that trigger read / write operations on the data layer. We simulated 1000 users and a request rate of 100 RPS. This was a success. We found that the application could respond in <500MS with a 0% error rate.

In order to test the ability to push code changes quickly to the application, we calculated the time required for CodePipline to execute all tasks to completion. The total time was <10 minutes and deemed a success. We compared this duration with the amount of time required to manually push changes to the environment; this was around 1 hour and 20 minutes until changes were live.

Describe TCO Analysis Performed

TCO was calculated based on the amount of time to manually push code to the environment and the resources required to do so.

Lessons Learned

Serverless functions are scalable and a cost-effective way of running production workloads.
Automated deployment solutions greatly reduce the amount of time required to deploy code to production.

Summary Of Customer Environment

Cloud environment is native cloud. The entire stack is running on Amazon Web Services in the US-West-2 region.

Summary Of Customer Environment

Root User is secured and MFA is required. IAM password policy is enforced.
Operations, Billing, and Security contact email addresses are set and all account contact information, including the root user email address, is set to a corporate email address or phone number.
AWS CloudTrail is enabled in all regions and logs stored in s3.

Operational Excellence

Metric Definitions

CodePipeline Health Metrics
If any step within the pipeline fails, notifications are sent out to the DevOps Slack channel. This is achieved via an integration between SNS topics and AWS Chatbot integration with Slack.

Lambda Health Metrics

Lambda health is determined by success / failure of the Lambda function. The most important metric is error count and success rate (%).

Lambda health is further defined by using Xray application tracing in order to effectively trace and debug requests made to the lambda function. Application exceptions (errors) are viewed within Xray and help enhance the performance of the application.

Metric Collection and Analytics

We consult clients on best practices in terms of log / metric collection. For application related logs we prefer the use of an ELK stack, which takes advantage of AWS Elastic Search Service, Logstash running on EC2, and Kibana. This allows for complete security and granular control over log collection and visualization.

To automate the alerting of unhealthy targets of an application / network / or classic ELB, we consult our client on the use of CloudWatch alarms, SNS alarm trigger notification, and AWS Lambda. The Lambda function makes a describe-load-balancer or describe_target_groups API call to get the identity of the failed target as well as the cause of the failure and then triggers an email notification via SNS with the discovered unhealthy host details.

We recommend the use of Grafana running on EC2 and Prometheus for the monitoring of individual workloads running within a stack. EC2, RDS, Container, EKS, and ECS metrics are collected by Prometheus and data visualized via dashboards within Grafana.

Operational Enablement

Enabling the client to manage and maintain the DevOps pipeline after handover is of the utmost importance. Our goal is to always minimize the amount of maintenance that will be required with the level of automation. Our goal is for all members of the Development team to be able to simply push code, follow a development process, and know that their applications are being tested and rapidly deployed.

Training and handover are always included in scope. This process includes the development of documentation specific to the customer workload. It outlines the development lifecycle from source control and branching all the way through deployment.

We document how to version IAC modules / templates that were developed and push out updates to their infrastructure.

We provide architecture diagrams, which outline the branch strategy / git workflow.

Lastly, we schedule a video conference, and do a hands-on session with the client where we go over how to push application updates throughout the development, staging, and production environments. We go over the development workflow and branching strategy.

We show the client how to troubleshoot a failed pipeline build within CodeBuild. We show the client where to find all relevant logs in relation to their build and test stages within CodePipeline should they occur. After a CI / CD automation pipeline is properly installed, the majority of DevOps-related troubleshooting tasks will be found within the CodeBuild logs and fixed at the application layer. We teach the client how to leverage Xray in order to better troubleshoot and enhance application performance.
During the video conference we outline common troubleshooting scenarios that the client will run into and show them how to effectively troubleshoot the workload.
We go over each component of the infrastructure CI/CD pipeline that was developed with the client and allow them time to ask any questions.

Deployment Testing and Validation

Deployments are tested and validated through a promotion strategy. The only branch which automatically deploys without approval is the development branch, which is deployed to the isolated development environment. The team will QA and validate application functionality and approve a promotion to the staging environment. A pull request is submitted to source control and merged into staging. Workloads are then deployed to the staging environment. After testing and validation of staging, a pull request is submitted from staging into master and merged. Master branch triggers a build and deployment to production via CodeBuild / CodePipeline.

Version Control

All code assets are version controlled within GitHub.

Application Workload and Telemetry
CloudWatch application logging is integrated by default into all of our container and serverless workloads. We include this as an “in-scope” item for all DevOps projects. This provides a centralized system where error logs can be captured and aid in operational troubleshooting.

Xray is implemented for application request tracing to help the client debug and improve performance of their workload.

Security: Identity and Access Management

Access Requirements Defined
In order to discover access requirements, we take a look at the organizational units within the client’s business such as developers, systems engineers, security engineers, and stakeholders, which will be required to access DevOps infrastructure. We have previously defined best practices that we follow for each of these groups.

IAM groups are created for each of these Organizational Units and least privilege access is applied to each. Each group is only granted access to what they actually required.

Developer Policy

Our developer policy looks like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:RevokeSecurityGroupEgress"
            ],
            "Resource": "arn:aws:ec2:*:*:*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "ec2:Describe*",
                "iam:ListInstanceProfiles",
                "mgh:CreateProgressUpdateStream",
                "mgh:ImportMigrationTask",
                "mgh:NotifyMigrationTaskState",
                "mgh:PutResourceAttributes",
                "mgh:AssociateDiscoveredResource",
                "mgh:ListDiscoveredResources",
                "mgh:AssociateCreatedArtifact",
                "discovery:ListConfigurations"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "ec2:CreateSecurityGroup",
                "ec2:ModifyInstanceAttribute",
                "ec2:CreateTags",
                "ec2:CreateVolume",
                "ec2:AttachVolume",
                "ec2:DetachVolume",
                "ec2:DeleteVolume",
                "ec2:CreateImage"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "ForAllValues:StringLike": {
                    "ec2.ResourceTag/appenv": [
                        "rmmigrate-dta"
                    ]
                }
            },
            "Action": [
                "ec2:TerminateInstances",
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:RunInstances"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "iam:PassRole",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

No processes deployed to AWS infrastructure will make use of static AWS credentials. All instances which call other AWS functions use roles. The only case where static AWS credentials are used to call AWS services is when third party integrations can’t make use of assumed roles.
Log into AWS for each APN partner and user of the platform, make use of unique IAM users or federated login. No root access is permitted. We have a CloudWatch alarm setup which triggers an SNS notification via email anytime the root user logs in.

Problem / Statement Definition

All security groups within the environment meets the following requirements:

Restricting traffic between the internet and VPC
Restricting traffic within the VPC
Only allows access from the Security Group

Security IT / Operations

Components which require encryption:

Lambda Variables: These are encrypted at rest using KMS.

AWS API Integration

AWS CLI is used for all programmatic access.

Reliability

Deployment Automation

The deployment process is fully automated. When we merge a change into the master branch from development within GitHub, CodePipeline is triggered. CodePipeline first runs CodeBuild and compiles application dependencies via pip and requirements.txt, then creates an artifact and CloudFormation template which triggers the deployment of the serverless function via CloudFormation. We use change sets and then automatically execute those change sets via CodePipeline.

Problem / Statement Definition

RTO: Application can be down for a maximum of 3 hours without any significant harm to the business.
RPO: 24 Hours

Data is backed up every 24 hours, so the worst case scenario is that we lose a day.

Adapts to Changes in Demand

This application uses Lambda, API Gateway, and RDS Aurora Postgres Serverless. We are using provisioned concurrency for Lambda and autoscaling on Aurora. We can respond rapidly in response to demand.

Cost Optimization

Cost Modelling

We deployed the workload into a development environment and load tested the application using methods previously described. Then we gathered metrics such as execution time and memory allocation to estimate costs. Our RDS Postgres Costs were determined based on the capacity units required by the application over a 24 hour period. Elasticsearch and ElastiCache costs were fixed and included in the TCO analysis for the AWS environment. We found our initial estimate was 95% accurate as it was only $100 away from the average costs since workload has been deployed to production.

Looking to implement an Automated Serverless Architecture? Contact one of our Serverless Specialists today.

AWS Thinkbox Deadline Summary

Rendering on-premise versus cloud

Locust was used to load the test by simulating common user behaviors that trigger read / write operations on the data layer. We simulated 1000 users and a request rate of 100 RPS. This was a success. We found that the application could respond in <500MS with a 0% error rate.”

Solutions

Results

Callevo – Custom Software Serverless Deployment & Devops

Custom Software Serverless Deployment and DevOps

Problem / Statement Definition

Proposed Solution and Architecture

Outcomes of Project & Success Metrics

Describe TCO Analysis Performed

Lessons Learned

Summary Of Customer Environment

Summary Of Customer Environment

Operational Excellence

Metric Definitions

Lambda Health Metrics

Metric Collection and Analytics

Operational Enablement

Deployment Testing and Validation

Version Control

Security: Identity and Access Management

Developer Policy

Problem / Statement Definition

Security IT / Operations

AWS API Integration

Reliability

Deployment Automation

Problem / Statement Definition

Adapts to Changes in Demand

Cost Optimization

Cost Modelling