In June 2026, Anthropic abruptly disabled access to its most capable models, Fable 5 and Mythos 5, for every customer, after a US export-control directive barred foreign nationals from using them. Not a throttle. Not a deprecation notice with a migration window. A frontier model that worked on Friday was simply gone, for reasons that had nothing to do with uptime.
That is the uncomfortable part of putting an LLM-powered agent into production: you inherit a dependency that never shows up on the architecture diagram. Someone else’s model has to stay reachable for your system to do anything at all. Usually the ways it fails are mundane. The model gets throttled the week you need it most, or deprecated on the vendor’s timeline rather than yours, or it simply has a bad afternoon while your incident volume spikes. June was the reminder that it can also be a regulator drawing a line. Either way, the day it happens, your “autonomous” system is exactly as autonomous as a 500 error.
We had been building an operations agent that investigates tickets and reasons over live infrastructure, and we had already decided not to hard-wire it to a single model. The June suspension is what turned that from a prudent hedge into an obvious one, and it is why we are writing this up now. The agent is genuinely useful, which is the problem: the more people rely on it, the less acceptable “the model is unavailable right now” becomes as an answer. This post is how we think about that availability problem, why we landed on Amazon Bedrock as the foundation, and (more importantly) the things we got wrong on the way, because the interesting lessons are in the pitfalls, not the happy path.
Two ways to stop betting the business on one model
There are really only two structural answers to “what happens when my model is gone”:
- Configure fallbacks. Have more than one model the agent can run on, and fail over when the primary is unavailable.
- Run your own model. Remove the third-party API from the critical path entirely, so availability is something you control rather than something you subscribe to.
Both are sound. Both are also more subtle than they sound, and the subtlety is the whole point of this article. Neither is “set a second model ID and you’re done.”
The reason we built on Amazon Bedrock is that it makes both strategies reachable from one place. Through a single AWS IAM surface, one set of credentials, one regional endpoint, and one billing path, Bedrock gives you access to a large catalogue of foundation models from multiple providers. Via Amazon Bedrock Custom Model Import you can bring your own open-weight models, and Amazon Bedrock Marketplace adds a catalogue of others to deploy. You can layer Amazon Bedrock Guardrails across all of them as a provider-independent safety control, and you keep your data inside your chosen AWS Region. For a European services partner like us, whose customers are specific about where their data is processed, Region control often settles the whole approach before model quality even enters the conversation.
So far, so much like a Bedrock landing page. Here is where it got real for us.
Pitfall 1: “swap the model ID” is an access abstraction, not an application one
The single most useful mental model we developed is this distinction. Bedrock abstracts access and transport: one credential, one API, one bill, to reach many models. It does not abstract the request and response shape your application actually depends on.
When you start, you almost certainly reach for a provider’s own SDK or the request format of whichever model you adopted first. That code grows roots. Your agent loop parses the specific content-block shape that model returns: the way tool calls are represented, the field names, the IDs. Your retry logic, your streaming handling, your token accounting all quietly assume one family’s conventions. The moment you point that code at a different model family, the transport works perfectly and the parsing falls apart, because you didn’t insert a model-agnostic layer. You inserted one-vendor-on-Bedrock, which borrows Bedrock’s plumbing but speaks only one dialect.
Amazon Bedrock does offer a genuinely cross-model interface, the Converse API, which normalises messages and tool configuration into a common shape across many models. Most current foundation models on Bedrock are reachable through it, so adopting Converse from day one avoids a lot of this pain. The catch is at the edges: you trade away some of the richer, provider-specific surface, and a few models still expose their newest capabilities only through their own API shapes. A fully general agent can therefore end up maintaining more than one transport path anyway, which is exactly the situation the next point is about.
The lesson that generalises: build a thin internal seam early. Define one interface that your agent talks to, pick one canonical shape for messages and tool calls, and put each model family behind an adapter that translates its wire format to and from that shape. Concretely, the boundary is about this small:
# one interface the agent depends on; one adapter per model family behind it
class ModelClient(Protocol):
async def respond(self, messages: list[Block], tools: list[Tool]) -> list[Block]: ...
# AnthropicClient, ConverseClient, ... each satisfy this and own the
# translation between their provider's wire format and the canonical Block.
def make_client(model_id: str) -> ModelClient: ... # fail-closed: unknown id raises
Your dispatch loop only ever sees the canonical block, never a vendor payload. Make the factory fail-closed: an unrecognised model ID should raise, never silently route to a “best guess.” Fallback is only safe if the fallback path is one you’ve explicitly built and tested, not one your code stumbles into.
This is also why a framework like LangChain or LiteLLM is not a shortcut past the problem. It hands you a ready-made version of that seam across providers, which is genuinely useful, but it abstracts the wire format, not the behaviour. The per-model prompt sensitivity, the tool-calling quirks, the caching mechanics: those leak straight through any unified interface, yours or off-the-shelf.
Pitfall 2: models are not drop-in equivalents, even at equal “quality”
The second hard lesson is that two models can both be excellent and still not be interchangeable in your harness.
Concrete example: provider-specific prompt caching. The cost and latency model of an agent that re-sends a large tool-and-context preamble every turn depends heavily on prompt caching, and the way you mark cacheable spans (and even how many cache breakpoints you are allowed) is vendor-specific. Switch families and your carefully tuned caching strategy simply does not apply; your costs and latency move, sometimes sharply.
Another: model-version quirks. We found a specific model version would occasionally malform its tool calls in a way we had to detect and repair in the dispatch loop. That repair is correct for that version and meaningless for every other model. Tool-calling reliability, instruction-following under pressure, willingness to say “I don’t know” instead of fabricating: these vary enormously between models and are exactly the behaviours an agent lives or dies by.
So “best way to use a model” is real, and it is per-model: the prompt that gets the best out of one model is not the prompt that gets the best out of another, and the safety posture that one model respects, another ignores. A fallback model isn’t a spare tyre of the same size; it’s a different vehicle that happens to drive on the same roads. Treat the migration to it as a real piece of engineering, scoped and tested ahead of time, so that on the day you actually need it you are flipping a switch you have already proven.
Pitfall 3: a fallback ladder is also a cost ladder
The third consideration bites only after you have shipped: the model you fail over to has a different price, and the Region you are obliged to run it in has a different price again. On Bedrock these are two separate effects worth keeping straight. Token pricing varies widely by family, often by an order of magnitude between a frontier model and a lighter open-weight one, which is visible on the public Bedrock pricing page. Separately, Cross-Region inference itself does not add a surcharge: a request is billed at the inference profile’s published rate, which for current AWS profiles matches the on-demand rate of that profile’s primary Region. What moves the number is which profile you are obliged to use. The EU inference profile that keeps data in-region can sit above the cheapest on-demand Region for the same model. In our own cost modelling we carry roughly a 10% uplift on the EU inference-profile routes for our primary family against the equivalent US on-demand rate, and we treat that as the standing price of residency.
So a fallback ladder is also a cost ladder, and the two do not move together. Failing over to a cheaper open-weight model can save money while costing you quality; failing over to a residency-compliant route can cost more for the same model. Work both deltas out in advance, so a failover event doesn’t arrive as a billing surprise stacked on top of an incident.
Where Bedrock genuinely shines: cheap, isolated, side-by-side evaluation
Here is the flip side of all that subtlety: because every model lives behind the same Bedrock access surface, comparing them becomes an infrastructure problem you already know how to solve, not a procurement project per vendor.
We stood up a second, isolated runtime (same agent code, separate deployment, separate logs, separate metrics namespace) whose only job is to run candidate models against hard, representative tasks without touching production. A few design choices made this evaluation trustworthy, and they generalise well:
- Isolate it at the infrastructure level, not by convention. A separate runtime, image tag, log group and metrics namespace mean eval traffic can never pollute production dashboards or alerts, and a candidate model can never accidentally take a real action. Make the isolation fail-closed: if the eval deployment is missing its explicit configuration, it should refuse to deploy rather than fall back to production settings.
- Grade blind. If the evaluation environment can read the “right answer” (a human’s resolution notes, a linked root-cause record), a weaker model can look strong by quietly reading the answer key. Strip those inputs so you are measuring reasoning, not retrieval of the solution.
- Run a harness-fit probe before you blame the model. When a candidate underperforms, the natural reaction is “our prompt isn’t tuned for it.” So test that hypothesis directly: harden the prompt specifically for the candidate and re-measure. Our most valuable single finding came from this: the gap between our primary model and the alternatives was mostly model-intrinsic, not a prompt artefact. That told us the seam was worth keeping for break-glass resilience, but that switching the default wasn’t justified yet. You only learn that by measuring.
A safety note that bears repeating, because it surprised us: a “dry-run” flag that suppresses one kind of side effect doesn’t suppress all of them. In our case, suppressing the agent’s writes did not suppress its reads against live infrastructure. If a candidate model can call tools, those calls execute for real during evaluation. The durable backstop is least-privilege, read-only credentials at the boundary, not a flag in your application code. Defence in depth applies to your evaluation environment too.
Running your own model: removing the API from the critical path
Configuring fallbacks hedges against one model being unavailable. Running your own hedges against a different risk: not wanting your core workflow to depend on a third-party inference API at all, whether for sovereignty, predictable capacity, or a model fine-tuned on your own domain. The point of doing it on Bedrock is that the operational surface barely changes when the weights become yours: the same IAM controls and the same API, with Guardrails layered on where the model architecture supports them. Amazon Bedrock Custom Model Import brings supported open-weight architectures behind that surface; Amazon Bedrock Marketplace widens the catalogue; and Provisioned Throughput reserves dedicated capacity for steady, latency-sensitive load. Because the surface stays the same, a single seam can mix managed and self-hosted models on the same ladder.
The honest caveats are real but different from classic self-hosting. With Custom Model Import the serving and autoscaling stay AWS-managed (billed by Custom Model Units, with cold-start latency on an idle model), so what you take on is the cost model, the supported-architecture limits, and a quality bar an open-weight model may not clear for your task, not server ops. You only own capacity planning and scaling if you go all the way to your own Amazon SageMaker or EC2 endpoints, outside Bedrock. For most teams the right posture is a hybrid:
- Primary: a strong managed model. Your default. The one you have evaluated hardest and trust unattended.
- Fallback: a tested alternative, break-glass. Already proven through the seam on a normal day, not discovered during an outage.
- Self-hosted: for workloads where control wins. Reserved for cases where sovereignty or capacity genuinely outweighs the operational cost.
What we’d tell our past selves
- Build the seam before you need it. One internal interface, one adapter per model family, fail-closed routing. Retrofitting this under outage pressure is miserable.
- Treat “switch to the fallback” as engineering, not configuration. Prompts, caching, tool-calling quirks, and safety posture are all per-model. Prove the fallback works on a normal day.
- Default to your best model; keep the alternative warm. The point of the seam often isn’t to leave your strongest model; it’s resilience and the option to re-evaluate as the field moves.
- Make evaluation a first-class, isolated environment. Blind grading and a harness-fit probe will tell you whether your problem is the model or your prompt, saving you from both over-engineering and false economy.
- Put the real safety control at the boundary. Read-only, least-privilege credentials and Guardrails protect you regardless of which model is behind the seam, including during evaluation.
Those five are tactics. The shift underneath them is the real payoff. Amazon Bedrock did not make the model-specific subtlety disappear, and nothing will. What it changed is where the subtlety lives: behind one access surface, one security model, and one bill that we own, instead of scattered across vendor relationships we could only hope held. “Keep the agent running when a model goes dark” stopped being a procurement question and became an architecture decision.
Written at Cloudar, an AWS Premier Tier Services Partner. The lessons here come from production experience building AI-assisted operations tooling on AWS.
