Introduction
AWS implemented significant quota cuts for its Amazon Bedrock service in late 2024.
While these changes were (silently) announced, they were not communicated effectively and only became apparent when users started experiencing problems.
While my primary experience with this issue relates to Anthropic’s Claude 3.5 Sonnet in the Frankfurt region (eu-central-1
), it appears that the adjustments may extend to other Bedrock models and regions.
This post will explore what has happened, the challenges it has created for my AWS partners and myself, and what steps users can take to overcome these limitations.
The Problem
Quota Cuts Disrupt AWS Partner’s Client Demo
In early November, a Hungarian AWS partner reached out to us for advice:
The product is a web-based interface designed for enterprise clients; imagine it as a ChatGPT enhanced with features useful to enterpirses and meeting typical large corporate requirements (like user management, data security, etc.). The GenAI service is built on Bedrock, typically using Claude 3.5. The idea is that each future client would run the service on their own account, so it wouldn’t be available in a SaaS model. They built a test system that worked perfectly, but at the first live client demo (a week ago), it started throwing errors, with the system indicating that there were too many requests. They investigated and found the account limits had been reduced to:
- 1 query from 20 queries per minute
- 4.000 tokens from 300.000 tokens per minute Initially, they thought this was a bug, but based on responses from support, it seems this is a feature by AWS design. They have tried multiple EU regions (Frankfurt, Dublin, etc) Provisioned throughput is too expensive
Quota Issues Stall Project for German AWS Partner
About two weeks later, a german AWS advanced partner contacted me with a similar problem:
There is an issue on one of our AWS account with Bedrock, they suddenly blocked the token usage on the models. Unfortunately in one of the projects we are stuck. I saw you as a contact person on our end in the opened cases, do you know the reason and how to solve it? They come up with couple of questions, but I don’t know how to act/commit if there is a cost impact as well.
They already opened a AWS support ticket and the response from AWS support engineer was:
Hello, We would need to collaborate with our service team for a limit increase of this size and before they would be able to assist, we would need you to answer the below questions:
- Model ID (share model ID from this list)
- Commitment Terms
- MUs to be purchased
- ETA for customer to take ownership of the model units once we provision them to their account
- Please make sure to take ownership of the implemented capacity with 7 days from the date of implementation. Capacity will be returned to service account and reallocated if they are not purchased within prior mentioned period.
Verifying the Problem in My Environment
I tested this issue in my AWS sandbox account using my Bedrock Claude Chat. Unfortunately, the chat bot stopped working, throwing these errors:
A few weeks ago, when it was set up, everything was working fine. However, the screenshots shows the significant quota reductions:
The quota for On-demand InvokeModel requests per minute was suddenly reduced to 1 (from a default of 20), and the tokens per minute quota was reduced to 2.000 (from default 200.000). These reductions make it impossible to operate a chatbot effectively, even just for internal use or demos.
Normally, AWS quotas are adjustable, but as shown in the screenshot and confirmed in this documentation, these specific quotas cannot be changed. I also verified the same limits in my private AWS account, active for 13 years with consistent usage.
What is Going On?
Current Situation
The recent quota reductions by AWS seem to be a direct reaction to the growing demand for Generative AI services, particularly in heavily used regions like Frankfurt (eu-central-1
).
As more businesses incorporate AI capabilities into their applications, the popularity of high-end models like Anthropic Claude 3.5 Sonnet has surged, straining AWS’s infrastructure.
This situation presents a tough challenge: while AWS continues to highlight Amazon Bedrock as a cornerstone for AI development, the strict quota limits make it extremely difficult for developers to effectively test or showcase their solutions. These constraints have caused considerable frustration, particularly among AWS partners and AWS distributors, where proof-of-concepts and customer demos are crucial to success.
But what happend?
Email from AWS
I found an old email from July 9, 2024 in my archives:
Communication Issues
AWS announced these changes on July 9, 2024, but the communication left much to be desired:
- Recipients: The announcement only reached the root email address of AWS accounts, which is often not accessible by technical audience. The operations contact (if there is one set, which in my experience is not often the case) was addressed in copy. So the e-mail did not receive many AI developers or admins at the AWS partners, as in tpyically used (ECAM) reselling model the root e-mail adress is belonging to the customer.
- Long Implementation Timeline: While the email stated that the changes would take effect one day after the announcement, the actual implementation in Frankfurt accounts began with a four-month delay. Even users that received this email often forgot about it.
- Lack of Clarity on Account Scoring: The email mentioned payment history and fraudulent usage, causing confusion for users without issue in those areas.
Lack of Clarity on Account Scoring
One of our partners investigated this deeper with the AWS Support and discovered that the quota changes were influenced by an internal AWS account scoring system (referred to as the C-score). This score, which is not publicly documented, reportedly takes into account:
- Payment history
- Fraudulent activity
- Monthly recurring revenue (MRR)
- The age and activity of the AWS account New or fresh accounts, such as those created under AWS Organizations, typically start with a low score, which may explain the stricter quotas. The same applies to sandbox and developer accouts, which do typically not really drive consumption. None of this information, however, was included in the communication, leaving customers in the dark about how their quotas were determined or how they might improve them.
Following the suggested AWS Way
Contacting AWS Account Manager
Openly speaking, I assume 90% of the AWS accounts that received this email do not even have an AWS account manager (AM), nor know who this person might be. Further, AWS AMs are only (end-)customer facing. What about the AWS Partners? Why was the Partner Development Manager (PDM) not mentioned?
Opening an AWS Support Case
In the email, AWS recommends opening a support case. In my point of view, it’s not logical: Quotas are typically handled in the Service Quotas area of the AWS account itself. And if it’s clearly written as: “Not adjustable” (see screenshot above) in the documentation, why is it adjustable via AWS Support? This is not straightforward and causes confusion for AWS users.
Reserved Capacity Recommendations
In all scenarios I have described, this is just for customer or internal demo / proof of concepts. Even though an AWS support ticket was opened, AWS Support pushed for switching model inference type from On-Demand to Provisioned Throughput. Unfortunately, provisioned throughput is very expensive and is not suitable for the described use-case.
Support Case with Basic Support
In newly provisioned non-productive and sandbox accounts, AWS Support contracts are often not purchased, resulting in only Basic Support being available. AWS allows support cases for technical topics under Basic Support, but this option is not widely known. Both partners I worked with were unaware of this. (See in section “Opening a Support Ticket” below.)
Suggested Way to Move Forward
The following are my personal recommendations based on my experience and knowledge.
Driving Consumption
Based on the AWS C-score system, driving consumption improves potential quota limits. Launching a low-cost EC2 instance (e.g., t3.micro
) for a few days can help.
Avoid Using New Accounts
New AWS accounts often have low scores. Whenever possible, use established accounts for demos or internal use to avoid drastic quota reductions. Development is often done by an AWS partner. Deploying these demos to new customer AWS accounts is exactly what causes these problems. Try to directy develop in the customer accounts or deploy the entire demo from the AWS partner’s AWS account, but don’t switch. If a switch is required, drive consumption first.
Re-think on required AWS region
Demand is higher in some AWS regions, and Frankfurt in particular seems to still have low hardware capacity, so the cuts are greatest there.
If this is development, internal, demo, or even proof of concept, consider switching to the North Virginia AWS region (us-east-1
).
The quota cuts there was lower, in fact my AWS accounts have had no cuts at all.
Opening a Support Ticket for Quota Increase
Even with just Basic Support, you can open a ticket for Bedrock quota adjustments. In the section: Account and billing -> Account -> Other Account Issues -> General question I used the following:
Subject: Bedrock quota for AWS member-account 123456789012 … This account is crucial for a customer demo of an AI chatbot utilizing Claude 3.5 Sonnet in Frankfurt. The chatbot is functional but repeatedly encounters application errors such as: “Failed to run stream handler: An error occurred (ThrottlingException) when calling the ConverseStream operation (reached max retries: 4): Too many tokens per >minute, please wait before trying again.”
Upon reviewing the quotas, I discovered that my account’s quotas are significantly lower than the AWS default values:
- InvokeModel requests per minute: 1 (my account) vs. 20 (AWS default)
- Tokens per minute: 2,000 (my account) vs. 200,000 (AWS default) With these constraints (2,000 tokens per minute and only 1 request per second), I am unable to demonstrate the chatbot.
My AWS organization drives significant AWS consumption, and my sandbox account has a long-standing history with no issues such as unpaid invoices or fraudulent activity. As such, I kindly request that the quotas for my account be adjusted to >align with the AWS default values:
- On-demand InvokeModel requests per minute for Anthropic Claude 3.5 Sonnet: 20
- On-demand InvokeModel tokens per minute for Anthropic Claude 3.5 Sonnet: 200000
- Region: eu-central-1
This adjustment will enable me to proceed with the customer demo effectively. Thank you in advance for your assistance, and please let me know if further information is required. …
Two days later, I had a reply:
…
For a limit increase of this type, I will need to collaborate with our Service Team to get approval. Please note that it can take some time for the Service Team to review your request.
I will hold on to your case while they investigate and will update you as soon as they respond. … And I print you this detail, as the Service Team is reviewing the C-Score of your acccount and if it’s still low, they won’t approve!
5 additional days later, I got this:
… Thank you for your patience while we were working with the internal team.
We have now received an update, the team approved your quota increase request as per the below specs:
123456789012 with limit-name: max-invoke-tokens-on-demand-claude-3-5-sonnet-20240620-v1 is 200000.0. 123456789012 with limit-name: max-invoke-rpm-on-demand-claude-3-5-sonnet-20240620-v1 is 20.0.
Please feel free to reply back on the case if you need any assistance and we will be more than happy to help. …
After around 2 weeks, my AWS account had restored the default quota again. I never thought I would be happy with the defaults. ;-) With 20 requests per minute and 200,000 tokens per minute, the chatbot (for our team) is usable again for non-production, internal (team) purposes.
Conclusion
Amazon Bedrock’s recent quota reductions have created significant challenges, particularly in high-demanding regions like Frankfurt. These changes have disrupted demos, proof-of-concepts, and internal projects of AWS partners and distributors, leading to frustration.
Recommendations to AWS ✨
While these adjustments may have been driven by capacity constraints, the way they were introduced and communicated leaves room for improvement. It doesn’t make sense to me why some quotas can’t be adjusted through the standard process and instead require a support ticket. This approach is unclear and confusing. At the same time, AWS must increase transparency and work closely with its partners to maintain trust and ensure a smooth experience for its customers.
- Clear Communication: Future changes should be announced via an official AWS blog, accompanied by detailed emails. AWS also overlooked the impact to AWS partners who handle demos but rely on customer-owned accounts. Provide a fixed timeline for implementations as here to ensure all users are well-informed and prepared.
- Transparency: Provide more transparency on how quota reductions are calculated. Including details that allow an individual user to calculate / estimate what the target quotas will be.
- Harmonzied Quota Adjustment Process: Where possible, allow quota adjustments through the standard Service Quotas interface. If a support ticket is required, clearly explain the justification process and requirements for approval.
- Slighter reductions: The drastic 95% quota reductions - even for existing accounts actively using Amazon Bedrock - are very disruptive. Consider less disruptive changes, or limit the changes to new AWS accounts or accounts that don’t have access to the affected foundational model (FM)
- Support for Partners: Acknowledge the important role of AWS Partners and AWS Distributors and consider the impact on them when implementing such changes. This can also be included in regular Amazon Partner Network (APN) newsletter or in a PartnerCast to provide advance notice.
By addressing these areas, AWS can enhance the experience for its users and partners, rebuild trust and and better support its growing community of users and partners.