Amazon's AZ3 chips: AI inference moves onto the device

Amazon just confirmed something that changes where AI actually runs: it designs its own end-to-end silicon for the Echo and Fire TV devices already sitting in millions of homes, and the newest chips run language models on the device itself. For anyone building customer-facing tech, the story isn't the smart speaker — it's that inference is moving off the cloud and onto the edge.

What actually happened

On July 2, Amazon's head of devices told CNBC the company is designing custom AI chips for Echo, Fire TV, and future devices including Kindle. The silicon has names now: Amazon's own announcement details the AZ3 and AZ3 Pro, built for its new Echo lineup.

The AZ3 packs an AI accelerator "designed to run AI edge models of the future" and improves wake-word detection by over 50%. The AZ3 Pro — inside the Echo Studio, Echo Show 8, and Echo Show 11 — goes further, adding "support for state-of-the-art language models and vision transformers" running locally. A sensor platform called Omnisense fuses camera, audio, ultrasound, and Wi-Fi radar on-device so Alexa+ can recognize a specific person and act, without every frame going to a datacenter.

Why it matters for your business

Cloud inference has three taxes: latency, per-call cost, and the fact that your data leaves the building. On-device inference erases all three for the workloads that fit. Amazon isn't doing this to be nice — it's doing it because sending every wake-word and camera frame to the cloud is slow and expensive at 40-million-device scale. The same math applies at your scale, just smaller.

For a retail or service business, the operator read is this: not everything needs a frontier model and a round-trip to an API. A kiosk that recognizes a returning customer, a camera that counts foot traffic, a scanner that reads a label — those can run a small model locally, respond instantly, cost nothing per call, and keep the data on your premises. The frontier API is for the hard 20%; the edge handles the boring, high-volume 80% that would otherwise bleed you on latency and per-token fees.

The vendors are voting with their silicon. The compute is moving to where the customer is. Build like it.

Key takeaways

Amazon confirmed (July 2) it designs custom AZ3/AZ3 Pro silicon for Echo and Fire TV, expanding to Kindle
The AZ3 Pro runs language models and vision transformers on-device; wake-word detection improved 50%+
Edge inference erases the three cloud taxes — latency, per-call cost, and data leaving your premises — for workloads that fit
The operator lesson: run the high-volume, boring 80% on small local models; save the frontier API and its per-token bill for the hard 20%

Paying per API call for AI that could run locally? We design systems that put the right workload in the right place — edge for speed and privacy, cloud for the hard problems. See how we build or bring us your use case.

Sources: CNBC, About Amazon.

Amazon's AZ3 chips: AI inference moves onto the device

What actually happened

Why it matters for your business

Keep reading

Cloudflare now lets you block AI bots by type — here's the play

Custom vs off-the-shelf retail software: a decision framework from operators who've run stores