We track our AWS spend closely but we want to spend on it only when humans need to be involved. If everything is normal, and most importantly if the explanation of the abnormalities is normal, we don't want to take a look. If the explanation is abnormal, then we want to take a look.
We made a spend agent to:
a. collect the daily spend in various angles, facets
b. look at the spend and see if there are abnormalities
c. call tools to investigate the abnormalities
d. package everything up into one place and indicate if we need to go take a look because there is un explained abnormality
Step 1. Collect the daily spend in various angles, facets
Since AWS spend data raw is large for a LLM context, the agent uses the 'spend aggregation tool' to collect the spend data in various facets. The aggregation tool will collect the spend data by account, by region, by operations, by usage type, by resource id, by service etc and makes it available for the agent to use.
The spend agent then uses both visual modals and text modals to verify whether there is an abnormality, and which facet might be responsible for the abnormality. The agent supplies different stacked graphs and csvs to the LLM with specific prompts to identify the cause at a high level.
The spend agent supplies graphs like below and corresponding CSV data to the LLM:
And LLM determines if there is something to investigate and the investigation criteria.
Observations:
- Key observations include:
- Amazon Elastic Compute Cloud (EC2) has the highest total spend of $1,669.088, with an average daily spend of $238.441.
- Amazon Elastic Container Service (ECS) follows closely with a total spend of $1,341.058 and an average daily spend of $335.264.
- Amazon ElastiCache also shows significant spending at $621.858.
Anomalies:
- Amazon Elastic Compute Cloud (EC2) Daily Spend
- Last Day Spend: $193.712
- Average Daily Spend: $238.441
- Anomaly: The last day spend is significantly lower than the average daily spend, indicating a potential drop in usage or an unexpected reduction in resource allocation.
Recommendations:
Recommendation: Investigate the EC2 instances running on the last day. Check CloudTrail on EC2 service for number of terminations and instance runs, or scaling activities that may have led to reduced costs.
Step 3. Call Tools to Investigate
The spend agent then uses recommendations to determine which tools it needs to use to investigate. For example for above recommendation, it uses cloudtrail tool with the specific eventNames to call based on the recommendation.
The cloudtrail agent is equipped to return the data for the above specific example, but if it cannot, it will write code to execute and return.
A lot of the spend calls go to CloudTrail, Cloudwatch (looking to metrics) and flow logs. A different blog posts on those tools and corresponding agents.
The spend agent will make a determination on whether it can "understand the abormality" this user with this action caused this spend or not. It collects the raw data and packages everything up (graphs, recommendations, csvs and any further recommended actions).
Step 4. Package everything up into one place and indicate if we need to go take a look because there is un explained abnormality
The result of the investigation goes into our Slack channel and we only look at the agent asks us to take a look and take an action.
It does pretty good at recommending actions for reservations, alternative services, resizing etc but long way to go, which we plan to improve.