Apple slipped a 20‑billion‑parameter foundation model onto the iPhone’s flash storage.
At WWDC the company unveiled a technical note that shows the model loading from the device’s NAND rather than RAM. On the iPhone 15 Pro and 15 Pro Max – which ship with 8 GB of RAM and up to 1 TB of flash – the model is streamed in chunks, keeping RAM use under 2 GB while still delivering full‑size inference. The note lists a latency of roughly 150 ms for a typical text‑completion query and a power draw of 1.2 W, comparable to a short video playback.
Running the model locally means no latency from round‑trip networking and no user data leaving the handset, a clear advantage for privacy‑focused apps. It also lets developers embed sophisticated language features in apps that previously required a server backend.
Apple’s approach isn’t new – Google and Meta have shipped similar on‑device models – but the sheer size of the model and its reliance on flash streaming make it a noteworthy milestone for mobile AI.
