Streamline your data analytics with AWS Athena queries and Terraform

Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.

Requirements

AWS account
Terraform

Walkthrough

Configure AWS credentials with aws configure.

At the root of our project, create a file named provider.tf which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.



#provider.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.59.0"
    }
  }
}

Next, let's define an S3 bucket in a file named s3.tf which we will use as a source of data.



#s3.tf
resource "aws_s3_bucket" "log_bucket" {
  bucket = "my-tf-log-bucket-source-athena"
}

Now we will define a file named athena.tf, in this file we are going to create an athena workgroup,db, table, and our query.

For our aws_glue_catalog_table we are expecting the table to serealize a json object.

Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream



#athena.tf
resource "aws_glue_catalog_database" "myservice_athena_db" {
  name = "myservice"
}

resource "aws_glue_catalog_table" "athena_table" {
 name          = "mytable"
 database_name = aws_glue_catalog_database.myservice_athena_db.name
 description   = "Table containing the results stored in S3 as source"


 table_type = "EXTERNAL_TABLE"


 storage_descriptor {
   location      = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata"
   input_format  = "org.apache.hadoop.mapred.TextInputFormat"
   output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"


   ser_de_info {
     name                  = "s3-stream"
     serialization_library = "org.openx.data.jsonserde.JsonSerDe"


     parameters = {
       "ignore.malformed.json" = "TRUE"
       "dots.in.keys"          = "FALSE"
       "case.insensitive"      = "TRUE"
       "mapping"               = "TRUE"
     }
   }


   dynamic "columns" {
     for_each = var.columns


     iterator = column
     content {
       name = column.value.name
       type = column.value.type
     }
   }
 }
}

resource "aws_athena_workgroup" "test" {
  name = "example123123s"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/"

    }
  }
}

resource "aws_athena_named_query" "athena_query" {
  name      = "test_query"
  workgroup = aws_athena_workgroup.test.id
  database  = aws_glue_catalog_database.myservice_athena_db.name
  query     = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;"
}

Our json example should look like this (but without spaces)



{
  "product": "Live JSON generator",
  "version": 3.1,
  "releaseDate": "2014-06-25T00:00:00.000Z",
  "demo": true,
  "person": {
    "id": 12345,
    "name": "John Doe",
    "phones": {
      "home": "800-123-4567",
      "mobile": "877-123-1234"
    },
    "email": [
      "jd@example.com",
      "jd@example.org"
    ],
    "dateOfBirth": "1980-01-02T00:00:00.000Z",
    "registered": true,
    "emergencyContacts": [
      {
        "name": "Jane Doe",
        "phone": "888-555-1212",
        "relationship": "spouse"
      },
      {
        "name": "Justin Doe",
        "phone": "877-123-1212",
        "relationship": "parent"
      }
    ]
  }
}

In our variables.tf we will dynamically call all the columns



#variables.tf
variable "columns" {
  type = list(object({
    name = string,
    type = string,
  }))
  default     = []
  description = "The columns in the table, where the key is the name of the column and the value the type"
}



#terraform.tfvars
columns = [
    {
      name = "product"
      type = "string"
    },

     {
      name = "version"
      type = "int"
    },

     {
      name = "releaseDate"
      type = "string"
    },

     { 
      name = "demo" 
      type = "boolean" 
    } ,
     {
      name = "person"
      type = "string"
    }
]

After finishing our files let's use terraform to create our resources.



terraform init



terraform plan



terraform apply

Update our data.json file through s3 commands or directly in the interface



aws s3 cp data.json s3://<<your_bucket_name>>/mydata/

Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.

And save the results in our bucket.

Blog

Streamline your data analytics with AWS Athena queries and Terraform

David💻

Walkthrough

Join Our Newsletter. No Spam, Only the good stuff.

Related