Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.alpha.isaree.ai/llms.txt

Use this file to discover all available pages before exploring further.

A powerful AI model contains billions of internal numerical values — called parameters — that define how it reasons and generates text. At full precision, storing and processing all of those values requires far more memory than a smartphone or laptop can provide. Quantization is the process of reducing the precision of those values so that the model takes up significantly less memory, making it practical to run on everyday devices.

How quantization works

A useful way to think about quantization is to compare it to reducing the resolution of a photograph. A 48-megapixel image contains an enormous amount of detail. If you reduce it to 8 megapixels, you lose some of that fine-grained information — but for most practical purposes, the viewer can still recognize everything in the image clearly. The subject, the context, and the meaning of the photograph are all preserved. The file is simply much smaller and faster to load. Quantization applies the same principle to an AI model. Each parameter is stored using fewer “bits” — the basic unit of digital information. A full-precision model stores each value using 16 or 32 bits. A quantized model might store the same value using only 4 or 8 bits. The model loses a small amount of fine-grained precision, but for the vast majority of clinical language tasks — transcription, summarization, note structuring — the output quality remains close to the full-precision version.

How quantization frees up memory

The benefit of quantization goes beyond simply reducing the file size of the model. It directly addresses the memory bottleneck that limits on-device AI. When a model runs, it does not just sit in storage — it must be loaded into active memory (RAM) so the device can process it. A full-precision model that requires 30 GB of RAM cannot run on a device that has 8 GB available. It is a hard limit. By reducing the bit size of each parameter, quantization dramatically reduces the amount of RAM the model occupies while it is running. A model that required 30 GB at full precision might require only 5–6 GB after quantization. This freed-up memory space allows the device to load and run a model that would otherwise be completely inaccessible, with enough headroom to keep Isa running smoothly. In practical terms, quantization is what makes it possible to run a capable medical language model on a modern iPhone or MacBook without any additional hardware.

What this means for your practice

  • On-device AI becomes possible: Without quantization, the models capable of handling complex medical language would require server-grade hardware. Quantization brings them to your pocket.
  • On-device processing becomes viable: A quantized model that fits on your device can run entirely on-device — the model’s own inference step does not require a cloud connection. Note that other components (MCP Servers, cloud-based Scribe extraction) each have their own data paths.
  • Accuracy remains clinically sufficient: For documentation, transcription, and summarization tasks, quantized models perform at a level that meets clinical documentation standards. The trade-off in raw numerical precision does not translate into a meaningful loss of clinical usefulness.

Next

MLX

The framework that runs quantized models on Apple devices.

RAM and device memory

The hardware constraint quantization works around.

Choose a model

Pick a quantized model that fits your device.