Deploying Airflow with Terraform: Streamline Data Workflows for Modern Data

Introduction:

Apache Airflow and Terraform are powerful tools that can revolutionize the management and orchestration of data workflows within your modern data platform. Airflow serves as a platform for programmatically authoring, scheduling, and monitoring workflows, while Terraform enables infrastructure-as-code, streamlining the definition and management of infrastructure resources. In this blog post, we delve into deploying and configuring Airflow using Terraform, unlocking the immense potential of these tools for your data engineering initiatives.

Code Sample 1: Airflow Cluster

resource "aws_emr_cluster" "airflow_cluster" {
  name            = "airflow-cluster"
  release_label   = "emr-6.4.0"
  instance_type   = "m5.xlarge"
  instance_count  = 3
  
  # Additional configurations for the Airflow cluster
  # Customize the following based on your requirements:
  
  # Bootstrap Actions
  bootstrap_action {
    path = "s3://your-bucket/bootstrap.sh"
  }
  
  # Software Configuration
  configurations = <<EOF
    [
      {
        "Classification": "airflow-site",
        "Properties": {
          "airflow.connections": "your-connections",
          "airflow.dags.path": "s3://your-bucket/dags",
          "airflow.plugins.path": "s3://your-bucket/plugins"
        }
      }
    ]
  EOF
  
  # Security Configuration
  service_role = aws_iam_role.airflow_service_role.arn
  ec2_attributes {
    subnet_id               = aws_subnet.airflow_subnet.id
    emr_managed_master_security_group = aws_security_group.airflow_master_sg.id
    emr_managed_slave_security_group  = aws_security_group.airflow_slave_sg.id
  }
}

In this code block, we include additional configurations specific to the Airflow cluster resource using the “aws_emr_cluster” Terraform resource. These configurations provide customization options for your Airflow deployment. Here are the additional components and configurations included:

  1. Bootstrap Actions: You can specify a script stored in an S3 bucket that will be executed during cluster launch to perform additional setup or installation steps.
  2. Software Configuration: The configurations block allows you to define custom properties for the Airflow configuration file (airflow-site.xml). You can set properties such as connections, DAGs path, and plugin path to tailor Airflow to your specific needs.
  3. Security Configuration: The service_role specifies the IAM role for the Airflow cluster. The ec2_attributes block configures the subnet and security groups used by the cluster.

By including these additional configurations, you can further customize and optimize your Airflow cluster deployment based on your specific requirements.

Code Sample 2: VPC and Subnets

resource "aws_vpc" "airflow_vpc" {
  cidr_block = "10.0.0.0/16"
  
  # Define additional VPC configurations
  # For example, enable DNS support and hostname assignment
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "airflow_subnet" {
  vpc_id     = aws_vpc.airflow_vpc.id
  cidr_block = "10.0.1.0/24"
  
  # Define additional subnet configurations
  # For example, configure route tables and assign tags
  tags = {
    Name = "Airflow Subnet"
  }
}

Within this code block, we define the VPC and subnet resources using the “aws_vpc” and “aws_subnet” Terraform resources, respectively. The VPC creates an isolated network environment for your Airflow deployment, while the subnet designates a specific IP address range within the VPC. In the additional VPC configurations, we enable DNS support and hostname assignment to enhance network communication within the VPC. For the subnet, we assign tags for easy identification and future management of the subnet.

Code Sample 3: Security Group

resource "aws_security_group" "airflow_sg" {
  vpc_id = aws_vpc.airflow_vpc.id
  
  # Define security group rules for Airflow communication
  # For example, allow inbound traffic on port 8080 for Airflow UI
  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

The “aws_security_group” resource empowers you to define security rules for inbound and outbound traffic control in your Airflow deployment. Within this code block, we configure the security group to allow inbound traffic on port 8080, commonly used for the Airflow UI. By specifying the appropriate ingress rules, you gain control over network access to your Airflow deployment.

Code Sample 4: EC2 Instance

resource "aws_instance" "airflow_instance" {
  instance_type             = "t3.medium"
  ami                       = "ami-0123456789"
  subnet_id                 = aws_subnet.airflow_subnet.id
  vpc_security_group_ids    = [aws_security_group.airflow_sg.id]
  
  # Define additional configurations for the Airflow instance
  # For example, specify user data for additional setup
  user_data = <<-EOF
    #!/bin/bash
    echo "Custom setup steps"
    # ...
  EOF
}

Deploying Airflow with Terraform involves defining the necessary infrastructure resources, including the Airflow cluster resource. The provided code samples demonstrate the creation of a VPC, subnet, security group, EC2 instances, and an Airflow cluster using Terraform. By customizing the configurations, you can tailor the Airflow deployment to meet your specific needs, such as bootstrap actions, software configurations, and security settings.

In the code block for the EC2 instance resource, we define the virtual machine where Airflow will be deployed. We specify the instance type, AMI, subnet ID, and security group to ensure deployment within the desired VPC and subnet, with the appropriate security measures. Additionally, customization options are available through the provision of user data, allowing for additional setup steps or configurations unique to your Airflow deployment.

Summary:

By leveraging these Terraform resources and configurations, you can easily provision the required networking infrastructure, security groups, and EC2 instances for your Airflow deployment. These code samples demonstrate the flexibility and simplicity of managing the essential infrastructure components necessary to establish and run Airflow.

At Anant, we specialize in helping companies modernize and maintain their data platforms. Our expertise in data engineering, including our focus on Apache Cassandra, equips us to empower clients with cutting-edge solutions. Leveraging Terraform and Airflow, we can assist in streamlining data workflows and achieving efficient data orchestration. Contact us to discover how we can support your data engineering objectives and drive success in your organization’s data initiatives.

Photo by Jason Blackeye on Unsplash