AWS Real-Time Budget Monitoring & IAM Access Control Automation

Omar Alsherbini

Cloud Infrastructure Architect
DevOps Engineer
Software Architect
Amazon EC2
AWS Lambda
Python
I developed and implemented a real-time AWS EC2 budget monitoring system combined with automated IAM access control to streamline resource allocation, improve security, and prevent budget overruns. The system was crucial for optimizing the usage of high-cost resources such as GPU-based EC2 instances, particularly for data science tasks involving AI model training. One of my freelancing clients needed a solution that could monitor AWS resource consumption in near real-time, provide budget alerts, and automatically manage access to costly resources. This project reduced the monitoring system response time from 1-2 days (AWS Budgets default) to 2-5 minutes, improving cost control by over 98%.

Project Background

The client was operating under a limited budget, and the development teams, especially the data science team, frequently required access to expensive EC2 GPU instances for AI model training. The challenge was to ensure that these resources were only used when necessary and that budget overruns were prevented. The initial use of AWS Budgets proved insufficient, as its alerts could take up to two days to trigger, which led to potential overspending.

Solution Design

The solution involved:
AWS CloudWatch to monitor real-time metrics and trigger alarms.
AWS Lambda to handle custom budget monitoring and automatic control of EC2 resources.
AWS IAM to manage user permissions dynamically based on resource consumption.
AWS SNS for real-time alerts and notifications.
AWS SQS as a messaging queue system.
AWS EventBridge for scheduling periodic checks of resource usage.
Fig. 1 System Architecture Diagram
Fig. 1 System Architecture Diagram

Key Components

1. Custom Budget Monitoring:

AWS CloudWatch was used to monitor resource usage in near real-time. By creating custom metrics, we could track the runtime of EC2 instances and calculate monthly usage, triggering appropriate actions when consumption reached specified thresholds.
CloudWatch Metrics: I defined a custom metric called runtime/month (in minutes) to track how long each EC2 instance was running.
Python Lambda Function: The Lambda function executed every 5 minutes (scheduled via AWS EventBridge) and incremented the runtime of each instance by 5 minutes if it was running.
Example: CloudWatch Metric Collection in Lambda (Python):
import boto3
import time

cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')

def lambda_handler(event, context):
instances = ec2.describe_instances()

for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
state = instance['State']['Name']

# Check if instance is running
if state == 'running':
# Put custom metric data (runtime/month) into CloudWatch
cloudwatch.put_metric_data(
Namespace='CustomEC2Metrics',
MetricData=[{
'MetricName': 'Runtime',
'Dimensions': [{'Name': 'InstanceId', 'Value': instance_id}],
'Unit': 'Minutes',
'Value': 5 # Increment runtime by 5 minutes
}]
)

2. Real-Time Alerts with AWS SNS and Alarms:

I set up three budget consumption thresholds at 50%, 80%, and 100% of the monthly runtime limit. When these thresholds were crossed, SNS was triggered to notify the data science team of their current EC2 usage via email.
Fig. 2 CloudWatch Alarms Flow Diagram
Fig. 2 CloudWatch Alarms Flow Diagram
Example: CloudWatch Alarm Setup:
cloudwatch.put_metric_alarm(
AlarmName='EC2Instance50Percent',
MetricName='Runtime',
Namespace='CustomEC2Metrics',
Threshold=500, # Example threshold: 500 minutes (50% of the budget)
ComparisonOperator='GreaterThanOrEqualToThreshold',
EvaluationPeriods=1,
Period=300,
ActionsEnabled=True,
AlarmActions=[
'arn:aws:sns:region:account-id:DataScienceTeamAlerts' # SNS topic
],
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}]
)

3. Automated Access Control Using IAM:

For critical cost management, the system automatically revoked access to GPU-based EC2 instances when usage hit 100% of the allocated budget for the month. I implemented this by using Lambda functions that dynamically adjusted IAM policies for the data science team.
A Lambda function would trigger when the 100% alarm was fired, stopping the EC2 instance and revoking access via IAM policy updates.
Another Lambda function would reset the runtime metric and restore access at the start of the next month.Example: IAM Policy Update to Revoke Access:
iam = boto3.client('iam')

def revoke_access(user_name):
iam.detach_user_policy(
UserName=user_name,
PolicyArn='arn:aws:iam::account-id:policy/AllowGPUAccess'
)

def stop_instance(instance_id):
ec2.stop_instances(InstanceIds=[instance_id])

def lambda_handler(event, context):
instance_id = event['detail']['InstanceId']
stop_instance(instance_id)
revoke_access('DataScienceUser') # Example user

4. Scheduled Monthly Reset with AWS EventBridge:

I used AWS EventBridge to schedule a Lambda function at the start of every month. This function reset the runtime/month custom metric in CloudWatch and restored IAM access to the data science team.
Fig. 3 IAM Policy and Lambda Interaction Diagram
Fig. 3 IAM Policy and Lambda Interaction Diagram
Example: Reset Function (Python):
def reset_runtime_metric(instance_id):
cloudwatch.put_metric_data(
Namespace='CustomEC2Metrics',
MetricData=[{
'MetricName': 'Runtime',
'Dimensions': [{'Name': 'InstanceId', 'Value': instance_id}],
'Unit': 'Minutes',
'Value': 0 # Reset to zero
}]
)

def restore_access(user_name):
iam.attach_user_policy(
UserName=user_name,
PolicyArn='arn:aws:iam::account-id:policy/AllowGPUAccess'
)

def lambda_handler(event, context):
reset_runtime_metric('instance-id')
restore_access('DataScienceUser')

Challenges and Solutions

1. Delayed AWS Budget Alerts:
AWS Budgets was initially used to monitor spending, but its 1-2 day delay in alerting made it unsuitable for real-time cost management. By switching to CloudWatch custom metrics and Lambda-triggered alarms, I was able to cut the response time down to 2-5 minutes, improving response time by over 98%.
2. Dynamic Access Control:
Implementing dynamic access control based on resource consumption required careful management of IAM policies. By automating the modification of IAM policies using Lambda functions, we ensured that access to high-cost resources was efficiently managed without manual intervention.

Key Outcomes

Reduced budget monitoring response time by over 98% (from 1-2 days to 2-5 minutes).
Successfully implemented real-time budget monitoring and automatic access control for high-consumption EC2 instances.
Enabled 50%, 80%, and 100% consumption alerts through SNS notifications. - Automated IAM policy adjustments to manage access based on resource consumption.
Improved cost control for the data science team, preventing overspending on GPU-based EC2 instances.
Partner With Omar
View Services

More Projects by Omar