In this tutorial we’ll walk through how Pocket deploys the Prefect data pipeline framework to ECS.

Why Pocket chose to use Prefect

Pocket is a part of Mozilla where we empower people to discover, organize, consume, and share content that matters to them. Our data is relatively small in size, but for a small team, we do support a lot of products with various data needs. For years, Pocket used Astronomer Airflow and in-house solutions side-by-side to run data processes. While Airflow is a very powerful tool, our engineers experienced a steep learning curve. Not to mention our in-house solutions, which suffered from a lack of observability.

After reviewing our options, we concluded that Prefect was the best long-term solution to support Pocket’s needs. What we like best about Prefect is how easy it is to pass data between tasks: it is as simple as passing the return value of one task into a parameter of another, just as you would for any other Python program.

Solution Architecture

The table below outlines the decisions that we took in deploying Prefect.

 DecisionReason
Execution environmentAWS ECS FargateECS allows long-running jobs based on Docker images.
Application code storageCustom Docker imageStoring code in a Docker image allows our local development environment to match production as closely as possible.
Infrastructure as codeCDK for Terraform (typescript)Writing infrastructure in a language that developers are already familiar with gives us control over our data infrastructure, without having to depend on dev-ops.
Continuous deployment pipelineAWS CodePipeline/CodeBuildWe consider this the most secure way to automatically build our infrastructure.

Security considerations

Our use of Prefect (see architecture diagram below) has several security benefits:

  1. Our Prefect Agent and Flows are deployed in a private subnet, such that they’re not publicly accessible from the internet.
  2. We follow the principle of least privilege, by only giving roles the access that they need. For example, the Prefect Agent only has permission to start tasks in its own ECS cluster.
  3. Prefect never receives personal user data. It only knows about the structure of our data flows, and when to schedule them.

Architecture Diagram

Prerequisites

Accounts that you’ll need to register for:

Tools that you’ll need to have installed:

Optionally, if you’re new to CDK for Terraform, it might be helpful to watch our YouTube workshop where we show how to deploy a simple service to ECS using CDK for Terraform. We even built helpful abstracted constructs for you to use!

Tutorial

The steps below deploy Prefect to your AWS account. Our repo supports deploying to a ‘development’ and ‘production’ AWS account. In this tutorial we’ll only do the former. If you get stuck, feel free to ask me (@Mathijs Miermans) a question in the public Prefect Slack.

Step 1: Fork Pocket/data-flows

  1. Go to https://github.com/Pocket/data-flows/ and click the ‘Fork’ button.
  2. In your terminal, git clone your forked repo.
  3. Run git checkout -b tutorial-deployment to create a new branch.

Step 2: Laying the foundation for continuous deployment

At Pocket, we have fully automated creating new services. For this tutorial, we’ll need to manually create a few resources in AWS to kickstart continuous deployment.

2.1. Create a CodeStar connection to Github

You can skip this step if you already have a CodeStar connection with access to your forked data-flows repo.

2.2. Create a CodeBuild project called DataFlows-Dev (see screenshot below)

CodeBuild screenshot

2.3. Attach the ‘AdministratorAccess’ policy to the CodeBuild IAM role

CodeBuild created a role called codebuild-DataFlows-Dev-service-role. This role requires broad permissions because terraform needs to create various resources, including IAM roles.

IAM CodeBuild role screenshot

2.4. Create an SNS topic named Deployment-Dev-ChatBot

This SNS topic is required to exist for deployment to succeed, but we won’t use it for this tutorial. You can choose to subscribe to this topic to be notified when deployments happen.

SNS topic screenshot

Step 3: Modify source code

Open .aws/src/config/index.ts in a text editor to make a few changes:

  1. Change s3BucketPrefix from pocket to your organization name.
  2. Change domain to domain names that you own. You only need to change the case where isDev is false for now because this tutorial deploys to the ‘development’ environment.
  3. Change both ARNs in githubConnectionArn to the ARN of the CodeStar connection that you created in the previous step.
  4. Change the dataLearningBucketName to an S3 bucket that you’d like your Prefect flows to have read and write access to. At the end of this tutorial, we’ll run an example flow that downloads something from this bucket as an end-to-end test.
  5. Remove all values in the parameterStoreNames list. After completing this tutorial, this is where you could put any credentials you want to securely inject as environment variables into your Prefect flow.
  6. Change the awsRegion from us-east-1 to the region that you’re deploying to.
  7. Change repository from pocket/data-flows to the name of your forked repository.
  8. Change codeDeploySnsTopicName to Deployment-${environment}-ChatBot, to match the name of the SNS topic that you created in the previous step.

Source code for Flows is located in the src/flows/ directory of your forked repo.

  1. Delete all files in src/flows/ except src/flows/example/s3_download_flow.py.
  2. Open src/flows/example/s3_download_flow.py, and modify the s3_bucket_source and s3_bucket_key to point to a key in the s3 bucket that you set dataLearningBucketName to in .aws/src/config/index.ts.
  3. Optionally, you can put your flow files in src/flows at this point.

Your changes by now should look something like the screenshot below.

Commit your changes to your branch, for example: git commit -a -m “Configure custom deployment” && git push origin tutorial-deployment. Force push your changes to your dev branch: git push -f origin tutorial-deployment:dev.

Code changes

Step 4: Create a Prefect ‘dev’ project and API key

In Prefect Cloud, create a project called ‘dev’.

  1. Go to the Service Accounts page in Prefect Cloud and click on ‘Add Service Account’.
  2. Create an API Key for this service account.
  3. Go to the AWS Systems Manager Parameter Store in the AWS Console, and create a parameter named /DataFlows/Dev/PREFECT_API_KEY of type SecureString and ‘Value’ set to the Prefect API Key.

Step 5: Create Terraform Workspaces

Create two workspaces called DataFlows-Dev and DataFlows-Prod. Both are required to exist for the deployment to succeed, even though we’ll only write Terraform state to DataFlows-Dev for this tutorial.

Terraform Workspaces

For both workspaces, change ‘Execution Mode’ in general settings to ‘Local’:

Terraform Workspaces

Step 6: Storing secrets and parameters in AWS

6.1. AWS Systems Manager Parameter Store

Create parameters in the AWS Systems Manager Parameter Store according to the following table. /DataFlows/Dev/PREFECT_API_KEY was created previously during step 4 and is listed only for completeness. If you don’t yet have a VPC with private and public subnets, you can create one using our VPC guide.

Parameter nameTypeValue
/DataFlows/Dev/PREFECT_API_KEYSecureStringPrefect API key
/Shared/VpcStringVPC id
/Shared/PrivateSubnetsStringListComma-separated private subnet ids
/Shared/PublicSubnetsStringListComma-separated public subnet ids

Terraform Workspaces

6.2. AWS Secrets Manager

Store a secret in the AWS SecretsManager called ‘CodeBuild/Default’ with keys:

  1. terraform_token = Terraform API token (Create one in User settings > Tokens)
  2. pagerduty_token = <empty string>
  3. github_access_token = <empty string>

Create a secret in the AWS SecretsManager called ‘Shared/DockerHub’ with keys:

  1. username = Dockerhub username
  2. password = Dockerhub token (Create one in Account Settings > Security)

Step 7: Doing an initial, manual Terraform deployment

On subsequent deployments, our infrastructure changes will automatically be deployed when running the AWS CodePipeline, but this CodePipeline does not exist yet. It’s a common ‘chicken-or-the-egg’ problem in continuous deployment. At Pocket, we’ve solved it by automatically running CodeBuild (the one we just created) if CodePipeline doesn’t exist.

For this tutorial, we’ll run CDK for Terraform in a terminal:

  1. cd .aws to go to the .aws/ directory located in the project root of your forked repo.
  2. (optional) If you have nvm installed, run nvm use to switch to node v16.
  3. npm ci to install dependencies
  4. npm run build:dev to build and transpile typescript code to javascript and synthesize to terraform friendly JSON.
  5. cd cdktf.out/stacks/data-flows to go to the directory with Terraform code.
  6. terraform login to login to your Terraform account, if you’re not logged in already.
  7. terraform init to initialize Terraform. Choose the dev workspace when prompted.
  8. terraform apply to deploy your stack to AWS. Confirm with ‘yes’ when prompted. It should take about 10 minutes for the deployment to finish.

Step 8: Building and pushing the Docker image

We’ve configured Prefect to store our code in a Docker image. We consider this more reliable than storing code on S3 because code is accessed in the same way during local development and production. The downside is that we need to build the Docker image every time we deploy to AWS, but continuous deployment takes care of this automatically at Pocket. For this tutorial, we’ll build and push the Docker image manually:

  1. Go to the ECR repository called dataflows-dev-app and click on ‘View push commands’.
  2. Go to the project root directory in a terminal.
  3. Copy and run the first push command to authenticate ECR with Docker.
  4. Instead of the second push command (starting with docker build), run this:
     docker build -t dataflows-dev-app --file .docker/app/Dockerfile .
    
  5. Run the third and fourth push command as AWS suggests to tag and push the image.

Go back to the AWS console, click on the refresh button in the dataflows-dev-app, and confirm that an image got pushed called ‘latest’.

Step 9: Triggering the CodePipeline

At this point, everything is set up to deploy the Prefect Agent to ECS. Go to DataFlows-Dev-CodePipeline and click on ‘Release change’. The pipeline consists of three stages:

  1. Source downloads the source code from Github and stores it as an artifact on S3.
  2. Deploy consists of two actions:
    1. Deploy_CDK creates infrastructure in AWS by running terraform apply.
    2. Deploy_ECS does a blue/green deployment of the Prefect Agent on ECS.
  3. Register_Prefect_Flows registers flows defined in src/flows with Prefet Cloud.

It’ll take about 10 minutes for the CodePipeline to finish running, after which it should look like the screenshot below.

Successful CodePipeline execution

Step 10: Test the setup by running a flow from Prefect

Go to the ‘dev’ project in Prefect Cloud. The flows in Prefect should match the ones in the src/flows directory. Go into the flow examples/s3_download_flow click on ‘Quick run’ to try it out. It should succeed if the S3 bucket and key that you provided are present.

Successful Prefect Flow run

Next steps

Congrats if you followed all steps! At this point, we have a secure, scalable Prefect deployment, with the beginnings of continuous deployment.

Clean up AWS resources

If you’re done with this tutorial and don’t want to incur any further costs from AWS, here’s how to delete all AWS resources created as part of this tutorial:

  1. In the AWS console, search for an S3 bucket name containing dataflows-results and delete it.
  2. cd cdktf.out/stacks/data-flows to go to the directory with Terraform code.
  3. Run terraform destroy to destroy all resources.
  4. Terraform should respond with “Destroy complete!” after a few minutes. Try running terraform destroy again if you get an error such as error deleting Target Group: ResourceInUse.
  5. In the AWS console, delete resources that were created manually:
    • The SNS topic Deployment-Dev-ChatBot
    • The CodeBuild project DataFlows-Dev
    • The IAM role codebuild-DataFlows-Dev-service-role
    • Two IAM policies containing CodeBuildBasePolicy-DataFlows-Dev

Complete continuous deployment

For this tutorial, we manually pushed the Docker image and triggered the CodePipeline through the AWS console. At Pocket, the deployment happens automatically in CircleCI when commits are merged into our dev or main branch. While we like that CircleCI ‘orbs’ make it easy to share code across services, having to coordinate between CircleCI and CodePipeline is a challenge. Instead, you could add a stage to the CodePipeline that pushes the Docker image. To get started, look at how our repo adds a CodePipeline stage to deploy flows to Prefect Cloud.

Create separate staging and production environments

We created a single environment in this tutorial, that is linked to the ‘dev’ branch of your forked repo. To create a production environment, follow this tutorial again in a different AWS account, and point it to the main branch of your forked repo. By default, production flows will be deployed to a Prefect project called ‘main’.


Tagged with: #Prefect, #CDK, #Terraform, #AWS