AWS Cost Monitoring: Technologies, Challenges and Implementation

Discover how Groundfog optimizes AWS costs without compromising on quality! Dive into our journey with AWS Cost Explorer and the custom Cost Tracker tool, empowering development teams with monthly AWS service cost insights directly in Slack.

Ciprian Bodnărescu - March xx, 2024

Have you ever asked yourself how much your code costs your company? And what about its cost to customers or partners?

Development teams consistently strive to add value by refining requirements, implementing solutions, and ensuring quality in each step toward production. But what comes next?

At Groundfog, we recognize the need not only to develop cutting-edge technologies but also to optimize them in a way that aligns with business goals. We #re-think our solutions and optimize them further, minimizing costs without compromising quality.

There's a valuable tool called AWS Cost Explorer that reveals the cost evolution over months for each project, allowing you to group costs by services and use different tags for deeper insights. In this way, you can only check the costs for the services that are only used in local development environments or only in production.

In the upcoming sections, we will share our insights with you. We dive deeper into the details of automatically tracking AWS service costs monthly, what technologies we employed, and what challenges we faced.

Additionally, emojis will also be an important gear that will contribute to make the whole solution even better.

AWS Cost Optimization with Cost Tracker

It is important to think ahead about how our solutions will perform in the long term. There isn’t only the production environment we need to take into consideration: a developer will implement and test the solution locally using developer resources. Afterwards, depending on each company and development team, the code will reach all the testing environments, like DEV, TEST, and UAT environments. Only then, after the quality gates are passed, will the new code be hosted by the production environment.

Here comes the Cost Tracker, a custom AWS tool whose purpose is to inform the development team inside the Slack communication tool about the cost overview. Once the team is aware of the costs, it can take one step closer to improving the provided solutions or cleaning up unneeded resources or environments.

One benefit of having an automated Cost Tracker is that it makes us aware of how much the solution costs, not only in production but in all other environments as well.

There is always a slim likelihood that an AWS Scheduler is forgotten in the local developer environment, which triggers an hourly Lambda function. Is that actually needed? Will we be aware of that after quality gates are passed and the feature is released?

The Cost Tracker comes into play to solve that and help us raise awareness about the actual costs and make more informed decisions as we go. At the beginning of each month, each team can get a Slack notification in the project-related channel about what the cost evolution looks like in the previous two months. The costs will be grouped by each AWS service used, which makes it easier to understand what actually changed if there is an unexpected cost increase or decrease.

Solution overview

Prerequisites for AWS Cost Monitoring

Architecture Configuration

The architecture consists of multiple AWS services, as follows:

  1. An EventBridge Scheduler to invoke the Controller Lambda Function every month
  2. An S3 Bucket to keep the JSON file containing the project configurations
  3. Secrets Manager to store the Slack Access Token
  4. The Costs Controller Lambda Function used to:
    • retrieve all the cost-project configurations from S3
    • invoke the Costs Retriever Lambda Function
    • grab the Slack Access Token from Secrets Manager in order to post the cost metrics to Slack channels
  5. Post the Slack messages to the dedicated Slack channels
  6. The Cost Retriever Lambda Function used to call the Cost Explorer SDK
  7. Cost Explorer to get the costs for the previous two months for a given project configuration

Project Details Source

We need to take into consideration that there might be several teams in a company, and we don’t want to spam one Slack channel for all the projects, which is why we went for a configurable Slack channel for each project and its environments.

Let’s say we have Project A and B, where Team A and B would like to get these cost notifications for their environments on #projectA and #projectB Slack channels, respectively. For Project A, the team wants to have the costs aggregated for all the stages, whereas the team for Project B wants to have separate cost analyses aggregated for DEV and UAT on the one hand and for production on the other.

In order to achieve that, we need to configure somewhere all these mappings so that each Slack post reaches the correct channel, so we went for #simplicity and created one S3 bucket with one JSON S3 object, which the Controller Lambda reads the configuration from for each project. Anytime a project needs to be added, updated, or removed, the JSON file can easily be downloaded, modified, and uploaded back to S3.

After the bucket is created, you can easily upload the JSON file, which would look like the following:

.gitlab-ci.yml:
 -- export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s"
  $(aws sts assume-role-with-web-identity
     --duration-seconds 900
     --role-session-name "cicd"
     --role-arn arn:aws:iam::ACCOUNT_ID:role/gitlab-ci
     --web-identity-token "${CI_JOB_JWT_V2}"
     --query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]"
     --output text
    )
  )
-- aws configure set aws_access_key_id "${AWS_ACCESS_KEY_ID}" --profile default
-- aws configure set aws_secret_access_key "${AWS_SECRET_ACCESS_KEY}" --profile default
-- aws configure set aws_session_token "${AWS_SESSION_TOKEN}" --profile default

As seen in the previous structure, the same project can be configured multiple times in order to cover all the stages or environments your team wants to track in case the posts must be created in different channels or posts. The two very important fields here are the Title and the Stages. In order to let the Cost Tracker be able to retrieve the costs for the projects, we applied two different tags to each AWS resource through Terraform:

As an example, for an AWS service that was created for Project A in the DEV environment, we will have the following resource tags:

Project: "Project A"
Stage: "dev"

As the configuration file is there, the Lambda Controller can read it and iterate over each project like:

import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3'

const s3Client = new S3Client()

export async function handler(event) {
    const projectDetailsObj = await downloadObject(s3Client, event.bucket, event.key)
    if (projectDetailsObj.Error) {
        throw `Could not download the project details object from Bucket: ${event.bucket}, Key: ${event.key}. ${projectDetailsObj.Error}`;
    }

    const projectsObjectsString = await projectDetailsObj.Response.Body.transformToString();
    const detailsByProjects = JSON.parse(projectsObjectsString)
    if (!detailsByProjects) {
        console.warn(`No projects configured in Bucket: ${config.bucket}, Key: ${config.key}`)
        return
    }

    const projectKeys = Object.keys(detailsByProjects)

    // e.g if analysis is needed for January and February 2024, then:
    // startDate Date object should point to 2024-01-01, and endDate to 2024-03-01
    const startDate = <set-the-needed-start-date>
    const endDate = <set-the-needed-end-date> // end date is not included in the report

    for (let projectKey of projectKeys) {
        try {
            const slackMessage = await getProjectMetricsMessage(detailsByProjects[projectKey], startDate, endDate)

            await postMessage(slackMessage)

            // wait for 1 second due to the existing rate limit of one message per second
            await sleep(1000)
        } catch (error) {
            console.error(`Could not process project '${projectKey}': ${error}`)
        }
    }
}

async function downloadObject(client, bucket, key) {
    try {
        const params = {Bucket: bucket, Key: key};
        const response = await client.send(new GetObjectCommand(params));
        return {Response: response, Error: null};
    } catch (error) {
        return {Response: null, Error: error};
    }
}

Cost Retrieving with AWS Cost Explorer and Lambda Functions

Now we have the S3 object read by the Controller Lambda Function, and we know what projects are wanted by our teams to be tracked.

Next, we need to grab the costs for each project for the previous 2 months, so the Controller Lambda will invoke the Costs Retriever Lambda Function with a few parameters like the project title, the wanted stages, and the date range.

import { LambdaClient, InvokeCommand } from '@aws-sdk/client-lambda'

const lambdaClient = new LambdaClient()

async function getProjectMetricsMessage(projectDetails, startDate, endDate) {
    const slackChannelId = projectDetails.slackChannelId
    const repository = projectDetails.repository
    const title = projectDetails.title
    const stages = projectDetails.stages

    if (!slackChannelId || !repository || !title || !stages || stages.length == 0) {
        throw new Error(`Missing at least one mandatory field: slackChannelId/repository/title/stages from ${JSON.stringify(projectDetails)}`);
    }

    const costsRetrieverFunctionArn = `arn:aws:lambda:<aws-region>:<aws-account-id>:function:<costs-retriever-function-name>`

    const lambdaPayload = {
        title: projectDetails.title,
        stages: projectDetails.stages,
        startDate: startDate,
        endDate: endDate
    }

    const costsRetrieverLambdaParams = {
        FunctionName: costsRetrieverFunctionArn,
        Payload: JSON.stringify(lambdaPayload)
    }

    const lambdaResponse = await lambdaClient.send(new InvokeCommand(costsRetrieverLambdaParams))
    if (!lambdaResponse.Payload) {
        throw `No Payload found in the Costs Retriever Lambda Response`;
    }

    const payloadString = Buffer.from(lambdaResponse.Payload).toString()
    const data = JSON.parse(payloadString);

    // return the message to be posted to Slack
}

Now that the Costs Retriever Lambda has been invoked, let’s see how we can grab the costs for the wanted project details. What comes in handy here is the AWS Cost Explorer SDK, which takes as input a few fields like granularity, metrics, filter, time period, and what we want to group the costs by.

import { CostExplorerClient, GetCostAndUsageCommand } from "@aws-sdk/client-cost-explorer"

let costExplorer = new CostExplorerClient()

export async function handler(event) {
    const input = {
        Granularity: "MONTHLY",
        Metrics: ["UnblendedCost"],
        GroupBy: [
            {
                Key: "SERVICE",
                Type: "DIMENSION"
            }
        ],
        TimePeriod: {
            Start: new Date(event.startDate).toISOString().split("T")[0],
            End: new Date(event.endDate).toISOString().split("T")[0] // end date is not included in the report
        },
        Filter: {
            And: [
                {"Tags": { "Key": "Project", "Values": [ event.title ] }},
                {"Tags": { "Key": "Stage", "Values": event.stages }}
            ]
        }
    }

    const data = await costExplorer.send(new GetCostAndUsageCommand(input))
    return data.ResultsByTime
}

In our case, we wanted to retrieve all the unblended costs grouped by each AWS service with a monthly granularity, having the Project tag set to the given project name and the Stage tag set to the configured stages of the project starting on the first day of the month from two months ago and ending on the first day of the current month.

Please note that the end date is not taken into consideration in the report. For this reason, to have the costs for January and February, the start date and end date would be set to “2024-01-01“ and "2024-03-01,“ respectively.

More details about the AWS Cost Explorer Cost and Usage command can be found in the AWS documentation here.

After we get the response, we are interested in the ResultsByTime array of objects, which contains all the services used in the selected date range and their corresponding costs. An example of such an array would be something like:

[
  {
    "Estimated": false,
    "Groups": [
      {
        "Keys": [
          "AWS Lambda"
        ],
        "Metrics": {
          "UnblendedCost": {
            "Amount": "0.5298580119",
            "Unit": "USD"
          }
        }
      },
      {
        "Keys": [
          "Amazon DynamoDB"
        ],
        "Metrics": {
          "UnblendedCost": {
            "Amount": "2.052373",
            "Unit": "USD"
          }
        }
      },
      ...
    ],
    "TimePeriod": {
      "End": "2024-01-01",
      "Start": "2023-12-01"
    },
    "Total": {}
  },
  {
    "Estimated": false,
    "Groups": [...],
    "TimePeriod": {
      "End": "2024-02-01",
      "Start": "2024-01-01"
    },
    "Total": {}
  }
]

Keep in mind that your team might have started to use an AWS service in the second month of analysis, which means it will not be found in the first time period or the other way around if the team decides to stop using a specific service. This will lead to having the service available for one time period but not for the other.

Creating the Slack Messages

We reached the state where the Controller Lambda got back the used services and their corresponding costs over two months from Costs Retriever, so now it needs to calculate the evolution percentage between those months and what a better way to show the intensity if not through emojis 😉.

In order to accomplish that, the code was split into smaller functions as follows:

function getSlackThreadMessageBlocks(projectTitle, lastMonthCostsByServices, twoMonthsAgoCostsByServices, serviceNames, safeCostDiffLimit) {
    const blocks = [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": `${projectTitle}`,
                    "emoji": false
                }
            },
            ... // other sections
            {
                "type": "section",
                "fields": [
                    {
                        "type": "plain_text",
                        "text": ":cloud: Service",
                        "emoji": true
                    },
                    {
                        "type": "plain_text",
                        "text": "Evolution :bar_chart:",
                        "emoji": true
                    }
                ]
            }
        ]

        if (services.length == 0) {
            blocks.push({
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": "No services found. :eyes:"
                }
            })
            blocks.push({"type": "divider"})

            return blocks
        }

        for (let serviceName of serviceNames) {
            const month1Cost = twoMonthsAgoCostsByServices.get(serviceName)
            const month2Cost = lastMonthCostsByServices.get(serviceName)
            blocks.push(getEvolutionSection(serviceName, month1Cost, month2Cost, safeCostDiffLimit))
        }

        blocks.push(getTotalSection(services, lastMonthCostsByServices, twoMonthsAgoCostsByServices, safeCostDiffLimit))
        blocks.push({"type": "divider"})

        return blocks
}

function getTotalSection(serviceNames, lastMonthCostsByServices, twoMonthsAgoCostsByServices, safeCostDiffLimit) {
    let month1TotalCost = 0
    let month2TotalCost = 0

    for (let serviceName of serviceNames) {
        month1TotalCost += twoMonthsAgoCostsByServices.get(serviceName)
        month2TotalCost += lastMonthCostsByServices.get(serviceName)
    }

    return getEvolutionSection('Total', month1TotalCost, month2TotalCost, safeCostDiffLimit)
}

Here, getSlackThreadMessageBlocks() is the main function that contains the message construction. A few sections are appended where the needed information such as the project name, the stages, the repository and so on are highlighted.

Then, before moving on to the service costs evolution, we need to make sure there is actually a service captured in our Tracker for the given project name. If there is no service available, then a simple section is added and the function stops right before iterating over the services list.

If there are services found, then the cost for each service is grabbed for both months and the evolution section is built out of these values. Same is done for the total section where all the costs are summed up and is appended to the Slack message together with its corresponding emoji.

function getEvolutionSection(firstText, month1Cost, month2Cost, safeCostDiffLimit) {
    return {
        "type": "section",
        "fields": [
            {
                "type": "plain_text",
                "text": firstText,
                "emoji": false
            },
            {
                "type": "plain_text",
                "text": getEvolutionText(month1Cost, month2Cost, safeCostDiffLimit),
                "emoji": true
            }
        ]
    }
}

function getEvolutionText(month1Cost, month2Cost, safeCostDiffLimit) {
    let statusEmoji = ":arrow_upper_right:" // taken as default if month2Cost >= month1Cost
    if (month2Cost < month1Cost) {
        statusEmoji = ":arrow_lower_right:"
    }

    const diff = Math.abs(month1Cost - month2Cost)

    const evolutionAndEmoji = getEvolutionAndEmoji(month1Cost, month2Cost, sign, diff, safeCostDiffLimit)
    return `$${month1Cost.toFixed(2)} -> $${month2Cost.toFixed(2)} (${statusEmoji} by ${evolutionAndEmoji})`
}

In the getEvolutionText() function we take as arguments the cost for the first month, the cost for the second month and a safe cost difference limit which will be used later on.

Then, we choose the status emoji between ↗️ or ↘️ depending on the evolution and continue with the percentage value and its emoji retrieval using the getEvolutionAndEmoji() function.

function getEvolutionAndEmoji(month1Cost, month2Cost, sign, diff, safeCostDiffLimit) {
    if (month1Cost == 0) {
        return `$${month2Cost.toFixed(2)} :rocket:`
    }

    const percentage = diff * 100 / month1Cost

    let emoji
    const sign = Math.sign(month2Cost - month1Cost)
    switch (sign) {
        case 1:
            emoji = getIncreasedCostEmoji(percentage, diff, safeCostDiffLimit)
            break;
        case -1:
            emoji = getDecreasedCostEmoji(percentage, diff, safeCostDiffLimit)
            break;
        default:
            emoji = ":zzz:" // sleeping emoji because there was no cost change
            break;
    }

    return `${percentage.toFixed(2)}% ${emoji}`
}

where the emojis are retrieved like the following:

function getIncreasedCostEmoji(percentage, diff, safeCostDiffLimit) {
    switch (true) {
        case (diff < safeCostDiffLimit || percentage < 10):
            return ":ok_hand:" // increasted a little bit, but it's ok
        case (percentage < 25):
            return ":see_no_evil:"
        case (percentage < 50):
            return ":hot_face:"
        case (percentage < 100):
            return ":fire:"
        default:
            return ":skull_and_crossbones:"
    }
}

function getDecreasedCostEmoji(percentage, diff, safeCostDiffLimit) {
    switch (true) {
        case (diff < safeCostDiffLimit || percentage < 10):
            return ":thumbsup:"
        case (percentage < 25):
            return ":mechanical_arm:"
        case (percentage < 50):
            return ":woohoo:"
        case (percentage < 100):
            return ":tada:"
        default:
            return ":rock2:"
    }
}

We see here the usage of the safe cost limit, which was mentioned before. The reason why it is needed is for those cases where we have costs below $1. Imagine that the cost of a certain service was $0.5 in the first month but increased to $20 in the second month. If we didn’t use a limit, the increase percentage would be 3900%, which is a lot, but the difference is $19.5, so it might not be a reason to alert the team with a more noisy emoji.

Posting the Slack messages

The final step of the Controller Lambda is to post the Slack message to the configured Slack channel. In order to do that, we first need a Slack application by going to https://api.slack.com/apps and creating a new app from scratch. This will need to be installed in your Slack workspace, and once you do that, you will get a Slack OAuth Token by going to your App -> Features -> OAuth & Permissions and then right below OAuth Tokens for Your Workspace.

OAuth Token Location in Slack App

This token will be used inside the code to create the Slack client, which will have access to post the messages to the needed channels. Because it is sensitive information, it can be stored inside AWS SSM Parameters as a SecureString.

The Controller Lambda grabs the token from the configured SSM Parameter and creates the Slack client that will be able to post the messages.

import { WebClient } from "@slack/web-api"
import { SSMClient, GetParameterCommand } from '@aws-sdk/client-ssm'

const ssmClient = new SSMClient()

async function postMessage(message) {
    const slackClientTokenSsmPath = process.env.SLACK_CLIENT_TOKEN_SSM_PATH || null
    if (!slackClientTokenSsmPath) {
        throw new Error(`Slack Client Token SSM Path is missing.`)
    }

    const slackTokenSsmParameter = await getSSMParameter(ssmClient, slackClientTokenSsmPath, true)
    const slackToken = slackTokenSsmParameter?.Parameter?.Value
    if (!slackToken) {
        throw new Error(`Slack Client Token is missing.`)
    }

    const slackClient = new WebClient(slackToken)
    await slackClient.chat.postMessage(message)
}

async function getSSMParameter(ssmClient, ssmParamPath, withDecryption) {
    const params = {
        Name: ssmParamPath,
        WithDecryption: withDecryption
    }
    try {
        return await ssmClient.send(new GetParameterCommand(params))
    } catch (error) {
        throw new Error(`Could not get SSM Parameter for '${ssmParamPath}': ${error.message}`)
    }
}

Last but not least, the team will be able to see the message inside Slack.

The message consists of the project name as a headline, followed by the repository link where it can be found, the environments taken into consideration, the time period, and the cost evolution together with the corresponding emojis, but the template can be changed as desired by using the Block KiT Builder tool from Slack.

Triggering the cost tracker

There is one more tiny resource to create so that the Cost Tracker is invoked automatically on a monthly basis.

For that, we created an EventBridge Scheduler that will invoke our Controller Lambda Function on the first working day of each month, and every team will be notified about how its projects went from a cost perspective.

In order to avoid weekends, though, we went for the following schedule expression so that the job is triggered only on the first weekend day of the month at 11 a.m. CET.

cron(0 11 1W * ? *)

Fun fact: The Cost Tracker can also be tracked by itself. Of course, we also need to know how much it actually costs in the first place to track the services used in other projects.

By checking the AWS Cost Explorer Dashboard directly from the AWS Console where the Cost Tracker Project label was applied, it seems that the cost inside the Controller account is less than $0.005 for one month where there is no testing involved. In January, for example, the cost increased a little bit because of adding more projects from multiple AWS accounts and testing the new functionality.

Cost Tracker Costs over 6 months

As the average monthly cost shows $0.00, we can assume that it costs nothing to track all your projects' costs 🎉.

Overcoming challenges

Slack App Quota

The Slack documentation specifies that posting messages is part of Web API Special Tier, which means the quota is one message per second. This increases a little bit the time for processing all of the projects because a sleep() was needed between each message post.

As long as you have less than 900 projects to track, then the Lambda Function will manage to send all of the messages before the timeout of a maximum of 15 minutes. Otherwise, a good approach would be to create a Step Function invoking synchronously the Controller Lambda Function in batches.

Too much spam

If you have more projects configured to have their cost evolution posted in the same channel, then each project means one Slack post, and this could get ugly until the point that your team members will need to scroll down a good amount of seconds to reach the bottom.

In order to solve that, one approach would be to first create a post and retrieve its Thread ID which will then be used for each message pointing to that channel.

Handling multiple AWS accounts

As you might already know, not all projects are in the same AWS account, which means we need to grab the services' costs from each one of the accounts.

It would have been an option to deploy all the resources in each account, but this would have meant having unnecessary resources created and facing challenges like Slack App Quota for posting messages asynchronously to channels.

The Slack Message Output using the Thread

One approach would be to use the Terraform count meta-argument in order to decide whether or not to deploy to a certain AWS account and keep the controller resources only in the main account.

Conclusion

At Groundfog, we draw on a wealth of experience in delivering successful projects, making us a trusted partner for our customers. Our commitment goes beyond simply adopting new innovations. As highlighted in this article, our focus is on crafting tailor-made solutions that not merely optimize costs, but also empower businesses to achieve sustainable growth by developing the best solutions to their unique business needs.


Let's explore the full potential of the cloud together

Contact us today to discover how our tailored solutions can drive growth, scalability, and financial efficiency for your business.

REACH OUT TODAY