Terraform for Data Engineers

Getting started with Terraform.

Published in

AWS in Plain English

9 min readMay 7, 2024

For the next couple of posts, we will look at using Terraform for cloud resource deployments. We are going to concentrate on AWS Cloud and we are going to be focused on using Terraform as a data engineer. This post will be straight and forward to the point. For us to start working with Terraform and AWS, we need to set up our development environment: we need an AWS account, and we need to create a user for our Terraform application. Very importantly, in practice, we need to set up a restricted user on AWS and configure our CLI to use this user’s credentials.

For this post, we will not look into how we can set up an AWS account, how to create a dedicated user and configure Terraform to communicate with AWS. We will configure our development environment to be able to control and provision resources on our AWS account from our personal computers in case we would like to do so in the future.

For a start, we will install Terraform and configure our development environment to be able to control our AWS account using the Terraform configuration. With that, we would be able to provision lambda functions, create an S3 bucket, provision EC2 instance and so on from our declaration in the configuration files.

To install AWS CLI, we will download the AWS CLI packaged executable to your preferred location: make sure we can access the download location of the AWS CLI zip file.

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

Now we will unzip the files and be ready to start using it to communicate with AWS resources:

unzip awscliv2.zip

After unzipping the files we can now move the files to the /usr/local/bin folder so that we can access AWS CLI commands from anywhere in the terminal.

./aws/install -i /usr/local/aws-cli -b /usr/local/bin

Confirm to be sure that you have access to AWS CLI on our terminal.

aws --version

Now, configure AWS CLI with your credentials, very importantly, for development purposes, make sure you create a dedicated user on your AWS account that has administration or restricted privileges: set up the access key and the secret key to interact with AWS from your command line.

aws configure

Include the access key and the access secret, then include the output data type (in my case ‘yaml’) and press enter. Also, we can install Terraform by using the bash script below. We only need to execute the script using bash install_terraform.sh with that, we are ready to start using Terraform.

#!/bin/bash

# --------------------install_terraform.sh--------------
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common

wget -O- https://apt.releases.hashicorp.com/gpg | \
gpg --dearmor | \
sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg

gpg --no-default-keyring \
--keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \
--fingerprint


echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list


sudo apt update

sudo apt-get install terraform

Since we have provided the steps to set up Terraform on our computer now, let’s quickly look at how we can use the Terraform environment. We can now use terraform initto initialize the working directory containing configuration files and install the necessary plugins for required providers and start provisioning resources on AWS.

We can now use terraform plan to list and show what resources that are going to be created. We will create our file called providers.tf and include the code below.

### -----------------providers.tf------------------

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = "Dev"
      Name        = "Provider Tag"
    }
  }
}

We are going to create another file named main.tf and include the code below.

### -----------------main.tf------------------

resource "aws_s3_bucket" "bronze_data" {
  bucket = "bronze_data_bucket"

  tags = {
    Name        = "Data Lake Bronze Bucket"
    Environment = "Dev"
  }
}

The Terraform code above defines the name and tag of an S3 bucket that should be created on the AWS cloud. We can now start to include more Terraform code in our main.tf file. With that, we would be able to set up our Lambda functions, create S3 buckets, provision an EC2 instance, and so on using Terraform.

Variables and data types in Terraform

Let’s quickly look at some important concepts in Terraform. Let’s start with variables in Terraform and in the process we are going to look at data types in Terraform. These are important concepts, and we absolutely need them. There are different data types in Terraform and they are going to be handy when we start to create resources on AWS. We can declare strings, numbers, booleans, lists, objects, and complex data types in Terraform configuration files. For example, we can declare a string variable called bucket_name in our Terraform configuration file as shown below:

###-----------------variables.tf---------------------------

variable "bucket_name" {
  description = "The name of S3 bucket name"
  type        = string
}

The variable declaration we have provided above contains the name of the variable, the description and the variable type. We have specified the variable name as bucket_name, we have the description included as “The name of S3 bucket name” and the variable type as a string. This is very interesting because we are going to have the ability to declare the name of the bucket we would like to create during runtime.

We can now reference the bucket name in our Terraform configuration file (main.tf) as shown in the code editor: we referenced the variable using var.bucket_name.

###-----------------main.tf------------------

resource "aws_s3_bucket" "bronze_data" {
  bucket = var.bucket_name

  tags = {
    Name        = "Data Lake Bronze Bucket"
    Environment = "Dev"
  }
}

Then we can now pass in the bucket name during runtime into our Terraform execution if we execute our code using terraform plan, we will have the response below then we can enter the bucket name as shown below.

“terraform init” and “terraform plan” execution responses in terminal.

Terraform’s response would look like the following and then we can pass in the value of the bucket name:

The name of your s3 bucket name

  Enter a value: bronze_data_bucket


Terraform used the selected providers to generate the following execution plan. Resource
actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bronze_data will be created
  + resource "aws_s3_bucket" "bronze_data" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "bronze_data_bucket"
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags                        = {
          + "Environment" = "Dev"
          + "Name"        = "Data Lake Bronze Bucket"
        }
      + tags_all                    = {
          + "Environment" = "Dev"
          + "Name"        = "Data Lake Bronze Bucket"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

───────────────────────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take
exactly these actions if you run "terraform apply" now.

Instead of passing the values of our variables into Terraform execution during runtime, we can improve variables.tf even further by providing a default value so that Terraform doesn’t necessarily ask us about the value it should attach to the bucket name. The modified variables.tf would look like the following.

###-----------------variables.tf---------------------------

variable "bucket_name" {
  description = "The name of your s3 bucket name"
  type        = string
  default     = "bronze_data_bucket"
}

Now, if we do terraform plan on the terminal, we would notice that Terraform will not ask us for the bucket name, the default value we specify in the variable declaration would be used.

Terraform used the selected providers to generate the following execution plan. Resource
actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_s3_bucket.bronze_data will be created
  + resource "aws_s3_bucket" "bronze_data" {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = "bronze_data_bucket"
      + bucket_domain_name          = (known after apply)
      + bucket_prefix               = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags                        = {
          + "Environment" = "Dev"
          + "Name"        = "Data Lake Bronze Bucket"
        }
      + tags_all                    = {
          + "Environment" = "Dev"
          + "Name"        = "Data Lake Bronze Bucket"
        }
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

Plan: 1 to add, 0 to change, 0 to destroy.

───────────────────────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take
exactly these actions if you run "terraform apply" now.

Number data type in Terraform

The number data type in Terraform is a numeric value that can be represented with both whole numbers like 300 and fractional values like 2.283185. We can use this data type in different use cases in our Terraform solutions. For example, in the case of specifying the allocated storage in gigabytes for our Relation Database Service instance. We can define the number of allocated storage as shown below.

### -----------------variables.tf---------------------------

variable "allocated_storage" {
  description = "The allocated storage for our RDS instance in gigabytes."
  type        = number
  default     = 10
}

We can now use the variable value in our aws_db_instance resource block as shown below, instead of declaring the value in the resource block.

### ---------------------------- main.tf ---------------------

resource "aws_db_instance" "default" {
  allocated_storage           = var.allocated_storage
  db_name                     = "mydb"
  engine                      = "mysql"
  engine_version              = "5.7"
  instance_class              = "db.t3.micro"
  manage_master_user_password = true
  username                    = "foo"
  parameter_group_name        = "default.mysql5.7"
}

Using number variable in the Terraform resource block

The advantage to this is that we have more control over our Terraform development especially when we are building modules for our applications, we can specify different values for the development and the production environment. Also, we can use the allocated_storage value in different parts of our Terraform code for different MySQL databases in the company.

List data type in Terraform

A list data type is another powerful type in Terraform that allows us to specify more than one resource to be created or we are looking to iterate over values. For example, we can create EC2 instances based on the number of regions we have specified.

### -----------------variables.tf---------------------------

variable "region" {
  type    = list
  default = ["us-east-1a", "us-east-1b"]
}

List data type declaration in Terraform

After we declared that our default value is [“us-east-1a”, “us-east-1b”], now we can use this to create two EC2 instances for our applications in the availability zones, us-east-1a and us-east-1b. Our Terraform code to create EC2 instances in the availability zones would look like the one below.

### ----------------- main.tf ---------------------------

resource "aws_instance" "list_instance" {
  ami           = "ami-0889a44b331db0194"
  instance_type = "t2.micro"
  count         = length(var.region)
  availability_zone = "${element(split(",", data.aws_availability_zones.available.names[count.index]), 0)}"
}

data "aws_availability_zones" "available" {}

Using list variable in Terraform resource block

We can also declare variables with the combination of list, objects and string data types as shown below. This is very helpful in situations where we need a JSON-like declaration or result to be used in other part of our Terraform code.

### -----------------variables.tf---------------------------

variable "jobs_notification" {
  type = list(object({
    id      = string
    events  = list(string)
    sns_arn = string
    filter  = object({
      prefix = string
    })
  }))
}

Using list variable in Terraform resource block

Since we can declare a list of objects as shown above: we can then declare the values of the jobs_notification as shown below. We can then use the variable and its value wherever we would like them to be used, be it a Terraform module or data sources etc.

### -------------------- terraform.tfvars -------------------------------

jobs_notification = [
  {
    id      = "job1"
    events  = ["create", "update"]
    sns_arn = "arn:aws:sns:us-east-1:xxxxxxxxxxxxxx:sample_sns_topic"
    filter  = {
      prefix = "path/to/job1/"
    }
  },
  {
    id      = "job2"
    events  = ["delete"]
    sns_arn = "arn:aws:sns:us-east-1:xxxxxxxxxxxxxx:sample_sns_topic"
    filter  = {
      prefix = "path/to/job2/"
    }
  }
]

We can specify the jobs_notification in a module as shown as below:

### ------------------- main.tf -------------------------

module "notification" {
  source = "./modules"

  notifications = var.jobs_notification
}

Conclusion

Having the necessary knowledge about how Terraform data types work is important to building a solid and performant application with Terraform. Having the right skills under our belt as a data engineer is one of the best things to do ourselves: building a data pipeline with all reproducible infrastructure is gold and Terraform helps us to do that easily even if things go wrong in production, we can come back pretty quickly. Variables and Data types are the fundamental building blocks of Terraform application.

…Thank you for reading.

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Discord | Newsletter
Visit our other platforms: Stackademic | CoFeed | Venture | Cubed
More content at PlainEnglish.io

Terraform for Data Engineers

Getting started with Terraform.

Variables and data types in Terraform

Number data type in Terraform

Using number variable in the Terraform resource block

List data type in Terraform

List data type declaration in Terraform

Using list variable in Terraform resource block

Using list variable in Terraform resource block

Conclusion

In Plain English 🚀

Written by Isaac Omolayo