How We Calculate Inference Carbon
Last updated: 25 March 2026
InferenceCarbon estimates the inference carbon footprint of AI queries using a combination of
published research, publicly available data, and stated assumptions. This page
explains exactly how we do it, where the numbers come from, and where we think
the limitations are.
Honest caveat: Few AI providers currently publish per-query energy data.
Our estimates incorporate provider-specific grid intensity (CIF) and clean energy
procurement (CEAF) data from the InferenceCarbon reference papers (March 2026), but they
remain estimates — not precise measurements. We think approximate awareness
is far better than no awareness at all.
We invite all LLM providers to publish both verified location-based and verified market-based gCO₂e figures so that estimates are unnecessary. (This may be coming anyway with IFRS S2 and the GHG Protocol’s Scope 2 revision.)
For non-thinking, text-based AI queries, the carbon footprint is calculated as:
Text carbon estimate (CEAF-adjusted)
carbon (gCO₂e) = (total_tokens / 1,000) × carbon_per_1k_ceaf_adj
For multimodal queries (image, audio, video), the formula is:
Multimodal carbon estimate
carbon (gCO₂e) = carbon_per_unit × quantity
Where:
- total_tokens (see Section 2, below) = input tokens + output tokens
- carbon_per_1k_ceaf_adj = grams of CO₂e per 1,000 tokens, with provider-location-specific
Carbon Intensity Factor (CIF) and Clean Energy Adjustment Factor (CEAF) already incorporated (see Section 3)
- carbon_per_unit = grams of CO₂e per image, per minute of audio, or per second
of video — already incorporating model energy consumption, provider data center Power Usage Effectiveness (PUE), CIF and CEAF
The idea is straightforward: more tokens (or more media units) means more computation, which means more
energy, which means more carbon. The amount of carbon per unit of computation varies
by model (larger models use more energy) and by provider (clean energy procurement
and data center location affect emissions).
2. Token Counting
AI models process text as tokens — roughly, sub-word units that
average about 0.75 words each (or about 4 characters). The more tokens in your
prompt and the model's response, the more computation is required.
Input tokens
We estimate input tokens from your prompt text. Where possible, we use the model's
actual tokenizer for accuracy; otherwise, we fall back to character-based heuristics.
| Method |
Used for |
Accuracy |
| tiktoken (BPE) |
GPT-4o, GPT-4.1, GPT-5, GPT o3 |
±2% |
| 3.8 chars/token |
Claude models |
±10% |
| 4.0 chars/token |
Gemini models |
±12% |
| 4.0 chars/token |
All other models |
±15% |
Output tokens
Since we can't know in advance how long a model's response will be, we use preset
estimates based on the response length you select:
| Response length |
Approximate words |
Estimated tokens |
| Short | ~100 | 133 |
| Medium | ~300 | 400 |
| Long | ~1,000 | 1,333 |
| Very long | ~5,000 | 6,667 |
When you use the "Try It Live" feature, we replace these estimates with the actual
token counts from the real API response.
3. Model Carbon Intensity (Three-Layer Derivation)
Each model's carbon intensity is derived through a three-layer process that converts
raw energy measurements into provider-specific, CEAF-adjusted carbon values. The values
come from the InferenceCarbon reference papers (March 2026; see Section 8).
Layer 1: Energy (Wh per unit)
We start with an anchor measurement — a directly measured or provider-disclosed
energy-per-token value for a reference model. Other models are scaled relative to this anchor using
the ratio of their throughput (tokens/second from independent benchmarks) and a GPU Energy
Coefficient (GEC) that accounts for architectural differences (e.g. GPUs vs. TPUs, Mixture-of-Experts routing, reasoning chains, etc.).
Layer 2: Location-Based Carbon
The energy value is converted to gross (location-based) carbon using a per-provider
Carbon Intensity Factor (CIF) in kgCO₂e/kWh. The CIF reflects the
carbon intensity of the electrical grid where the provider's data centers are located,
weighted by serving region mix and incorporating a PUE overhead.
| Provider / Infrastructure | CIF (kgCO₂e/kWh) | Basis |
| Anthropic / AWS |
0.287 |
AWS us-east-1 & us-west-2 weighted average (EPA eGRID 2024) |
| OpenAI / Azure |
0.350 |
Azure US region mix (EPA eGRID 2024) |
| Google / Google Cloud Platform |
0.375 |
GCP global serving mix (IEA 2023, Google CFE reports) |
| DeepSeek / China |
0.600 |
Chinese grid average (Ember 2024, IEA 2025 forecast) |
| DeepSeek / Azure |
0.350 |
Azure US region mix (same as OpenAI) |
| Mistral / Azure |
0.350 |
Azure US & EU region mix |
| Stability AI / AWS |
0.287 |
AWS region mix (same as Anthropic) |
| Kuaishou (Kling) / China |
0.530 |
Chinese grid, partial renewables (Ember 2024) |
| Undisclosed |
0.370 |
Global average fallback (IEA 2023) |
Layer 3: CEAF-Adjusted Carbon
The location-based carbon is then adjusted for the provider's verified clean energy procurement
using the Clean Energy Adjustment Factor (see Section 4):
Three-layer derivation
carbon_ceaf_adj = energy_wh × CIF × (1 − CEAF)
Text models (sorted by CEAF-adjusted carbon, greenest first)
| Model |
Provider |
Range gCO₂e / 1k tokens (Estimated) |
Gross gCO₂e / 1k tokens (Estimated) |
CEAF % |
CEAF-Adjusted gCO₂e / 1k tokens (Estimated) |
Confidence |
| Loading… |
Why the big range? Loading model data…
4. Clean Energy Adjustment Factor (CEAF)
The CEAF adjusts gross (location-based) emissions to account for a provider's verified
clean energy procurement. A provider that purchases renewable energy certificates (RECs)
or has long-term power purchase agreements (PPAs) will have lower market-based emissions.
CEAF adjustment
carbon_ceaf_adj = carbon_gross × (1 − CEAF)
| Provider | Grid CIF tier | CEAF % | Basis |
| Loading… |
Why is AWS CEAF 0%? AWS has no published verified 24/7 Carbon-Free Energy (CFE)
data at the facility level that we have been able to identify. While Amazon has corporate-level renewable energy commitments, the
CIF of 0.287 kgCO₂e/kWh already reflects the relatively cleaner grid locations where
AWS data centers operate (e.g., us-west-2 Oregon). Applying a CEAF on top would risk
double-counting the grid benefit.
CEAF limitations: The CEAF is based on annual averages and corporate-level
claims. Real-time clean energy matching varies hourly. Google's 24/7 CFE programme is the
most granular; other providers may over-claim on an hourly basis. We apply conservative
CEAF values and plan to update them as more granular data becomes available.
5. Multimodal Estimation (Image, Audio, Video)
AI isn't just text. Image generation, audio processing, and video creation have
very different energy profiles. We estimate these separately using per-unit factors
that already incorporate model energy, provider PUE, CIF, and CEAF.
5.1 Image generation
Image carbon
carbon = carbon_per_image × resolution_multiplier × steps_multiplier
The baseline is a 1024×1024 image at 25 diffusion steps, derived from direct
energy measurements of Stable Diffusion by Luccioni (2024): approximately 2,282 joules
per image. Resolution scaling is non-linear — doubling the pixel count doesn't
double the energy because of how diffusion models process images.
| Model | CEAF-Adjusted gCO₂e / image (Estimated) | Confidence |
| Loading… |
5.2 Audio processing
Audio carbon
carbon = carbon_per_minute × duration_minutes
For speech-to-text (Automatic Speech Recognition, ASR) models like Whisper, we derive per-minute energy from published
benchmarks: Whisper Large v3 processes 22 hours of audio using approximately 0.35 kWh,
giving a baseline of about 0.014 gCO₂e per minute. Text-to-speech (TTS) is
estimated at 2–3× the energy of transcription.
| Model | CEAF-Adjusted gCO₂e / minute (Estimated) | Confidence |
| Loading… |
5.3 Video generation
Video carbon
carbon = carbon_per_second × duration_seconds × quality_multiplier
Video is by far the most carbon-intensive AI modality. A single second of AI-generated
video at high quality can produce 60–180 gCO₂e. Our estimates are derived from direct measurements of
CogVideoX (2025), with other models estimated relative to this.
| Model | CEAF-Adjusted gCO₂e / second (Estimated) | Confidence |
| Loading… |
Video uncertainty is high (±50–60%). No standardized
benchmarks exist for AI video generation energy. These figures should be treated as
rough order-of-magnitude estimates.
6. Financial Cost
Alongside carbon, we calculate the financial cost of each query using the provider's
published API pricing:
Financial cost
cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price
We also calculate a carbon offset cost — the theoretical cost of
offsetting the emissions through voluntary carbon credits, at approximately $25 per tonne
of CO₂e (Gold Standard market average, 2025). For most individual queries, this
is a tiny fraction of a cent, which helps put the scale of AI carbon emissions in context.
NB: The data we produce is our best estimate of emissions coming from a particular model derived using external data, and it may vary significantly from emission calculations derived internally. Use these figures with care when assessing offsets.
6.2 EV Distance Equivalent
We express energy consumption as equivalent driving distance in a standard electric vehicle
(Nissan Leaf, 40 kWh battery, 149 miles / 240 km EPA range):
EV DISTANCE
ev_miles = energy_Wh / 1000 × 3.725
Where 3.725 miles/kWh is derived from the Nissan Leaf’s EPA-rated efficiency
(149 miles ÷ 40 kWh). This provides an intuitive sense of scale: a single AI query
might use the energy equivalent of driving a fraction of a millimetre, while generating
a minute of AI video could equate to several metres of driving.
7. Carbon Savings Methodology
InferenceCarbon doesn't just estimate your carbon footprint — it tells you how much
carbon you saved (or added) compared to a reference baseline. This section explains
how we define that baseline, what counts as a "saving", and why we think the approach
is defensible.
7.1 The default baseline: GPT-4o at 267 words
Every carbon comparison in InferenceCarbon is measured against a reference baseline.
The default baseline is the same prompt sent to GPT-4o, generating a medium-length
(267-word) response. Users can customize the baseline model and response length
to match their own usage patterns.
In formula terms:
Baseline carbon
baseline (gCO₂e) = ((input_tokens + 356) / 1,000) × 0.21
Where input_tokens is your actual prompt re-tokenised at GPT-4o's rate
(~4 characters per token), 400 is the estimated output tokens for a
267-word response (~356 tokens), and 0.21 is GPT-4o's CEAF-adjusted carbon per 1K tokens.
Carbon saving
saving (gCO₂e) = baseline − actual
A positive saving means you chose a lower-carbon option. A
negative saving means your choice used more carbon than the baseline.
7.2 Why GPT-4o?
We chose GPT-4o as the baseline model for three reasons:
-
Market share: ChatGPT holds approximately 68% of the AI chatbot
market (Similarweb, January 2026), and GPT-4o is its default model for both free
and paid users. It is the single most widely used AI model in the world.
-
Industry standard: GPT-4o is the de facto reference model used
by researchers (Epoch AI, 2025), benchmarking platforms, and competing providers
(Google, Anthropic, Meta) for performance and energy comparisons.
-
Mid-range carbon intensity: At 0.21 gCO₂e per 1,000 tokens
(CEAF-adjusted), GPT-4o sits in the middle of our model range (0.09–7.00). It
is neither the cleanest nor the most carbon-intensive option, making it a fair yardstick
rather than a cherry-picked extreme.
7.3 Why 267 words?
We use a medium-length (267-word, ~356-token) response as the baseline output length.
This is supported by published research on real-world AI conversations:
-
LMSYS-Chat-1M (Zheng, Chiang et al., ICLR 2024) — the
largest public dataset of real LLM conversations (1 million exchanges across 25
models) found an average model response of 214.5 tokens (~160 words).
-
Epoch AI (February 2025) used 500 tokens (~375 words)
as a "typical but somewhat pessimistic" estimate for energy calculations.
Our 267-word (356-token) baseline falls partway between these two reference points.
We believe this is the fairest choice: it avoids inflating savings (which a shorter
baseline would do) and avoids understating savings (which a longer baseline would do).
7.4 What counts as a saving?
Carbon savings come from three user decisions, each of which changes the actual
carbon footprint relative to the baseline:
| Decision |
How it creates a saving |
Example |
| Model selection |
Choosing a model with a lower carbon_per_1k_ceaf_adj value |
Gemini 2.5 Flash-Lite (0.09) vs GPT-4o (0.21) = 57% saving |
| Response length |
Requesting a shorter response reduces output tokens |
Short (100 words) vs baseline (267 words) = fewer tokens processed |
| Prompt optimisation |
Writing concise prompts reduces input tokens |
A 20-word prompt vs a 200-word prompt saves input-side computation |
Important: The baseline uses your actual prompt.
This means the comparison isolates the effect of
your model choice and response length — the two decisions most directly within
your control.
7.5 Honest limitations
We want to be transparent about the boundaries of this approach:
-
The baseline is hypothetical. We estimate what would
have happened if you'd used GPT-4o at 267 words. You may never have intended to
use GPT-4o, so the "saving" is relative to a counterfactual, not a known fact.
-
Output tokens are estimated. In the calculator, the baseline
uses a fixed 356-token output estimate. When you use "Try It Live", we have your
actual token count but still compare against the 356-token baseline. Real responses
vary widely.
-
Some model choices increase carbon. If you select a reasoning
model like DeepSeek R1 App (7.00 gCO₂e/1K tokens), the "saving" is negative —
you used more carbon than the baseline.
-
The baseline will evolve. As market share shifts and new models
emerge, the reference model may need updating. We'll document any changes here
and explain the rationale.
-
Savings inherit all uncertainty. Since both the baseline and actual
estimates carry ±30% uncertainty, the savings figure has a combined uncertainty
of roughly ±40%. Small differences between similar models should be interpreted
with caution.
7.6 Cumulative savings
On your dashboard, we sum individual query savings over time to show your total
carbon impact. This uses the same per-query formula applied to each tracked query
(from the browser extension, calculator, or Try It Live feature). Cumulative savings
are most meaningful when viewed as a trend rather than an absolute number.
A note on "savings" vs "reductions": InferenceCarbon shows how your
AI usage compares to a reference scenario. We deliberately use the word "saving" (vs
baseline) rather than "reduction" (vs your own past behavior) because we are comparing
against a hypothetical, not tracking a personal journey over time. Both framings have
value; ours is designed to be immediately actionable for every query.
8. Reference Papers
The methodology is based on the following InferenceCarbon reference papers (March 2026),
which provide provider-specific carbon intensity estimates:
- Estimating the Inference Carbon Intensity of Anthropic’s Claude Models — Claude Haiku/Sonnet/Opus families, multi-cloud (AWS+GCP), CIF tier 2.0
- Estimating the Inference Carbon Intensity of DeepSeek’s Models — V3/R1 models, dual-pathway (China app vs Azure), CIF tiers 3.0/1.5
- Estimating the Inference Carbon Intensity of Google Gemini Models — Gemini model family, 24/7 CFE data, CIF tier 1.0
- Estimating the Inference Carbon Intensity of Mistral’s Models — Small/Medium/Large 3 families, hosted on Azure/AWS, CIF tier 1.5
- Estimating the Inference Carbon Intensity of OpenAI ChatGPT Models — GPT-4o/4.1/5/o3 families, Azure infrastructure, CIF tier 1.5
We also publish modality-specific methodology papers covering audio, image, and video generation models:
9. Sources & References
Our methodology draws on the following published research:
-
Luccioni, A.S., Viguier, S., & Ligozat, A.-L. (2023).
"Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model."
Journal of Machine Learning Research, 24(253), 1–15.
Direct measurements of model energy per token — our primary baseline source.
-
Luccioni, A.S. (2024).
"Measuring the Energy Consumption of AI Image Generation."
Direct energy measurements of Stable Diffusion at various resolutions and step counts.
-
Patterson, D., et al. (2021, 2022).
"Carbon Emissions and Large Neural Network Training."
arXiv / IEEE Computer.
Energy scaling laws, GPU utilisation patterns, and efficiency trends.
-
Dodge, J., et al. (2022).
"Measuring the Carbon Intensity of AI in Cloud Instances."
FAccT 2022.
Uncertainty quantification methods for ML carbon estimation.
-
Patel, A., et al. (2024).
"Splitwise: Efficient GPU Inference with Model Parallelism."
ASPLOS 2024.
GPU power draw at 60–80% TDP during inference workloads.
-
Jegham, I., et al. (2025).
"Towards Sustainable AI: A Comprehensive Framework for Green Large Language Models."
arXiv:2505.09598v6.
Independent energy measurements per token across major LLM providers — the primary anchor data source for all carbon estimates.
-
Google (2025).
"Measuring Environmental Impact of AI Inference."
arXiv:2508.15734.
Google’s own measurement of 0.24 Wh per median Gemini prompt — cross-validates the Jegham framework.
-
Altman, S. (June 2025).
"The Gentle Singularity."
CEO disclosure of 0.34 Wh average ChatGPT query — the anchor data point for all OpenAI estimates.
-
Mistral AI / Carbone 4 (2025).
"Mistral Large 2 Life Cycle Assessment."
Provider-published, peer-reviewed LCA — the only independently audited per-token carbon figure in the industry (1.14 gCO₂e/400 tokens).
-
Samsi, S., et al. (2023).
"Scaling Large Language Models on Edge Accelerators."
IEEE HPEC 2023.
Energy per token estimates for large models.
-
International Energy Agency (2023, 2025).
"Emission Factors" / "Electricity Mid-Year Update 2025."
Regional and national grid carbon intensity data (gCO₂e/kWh), including Chinese grid CIF forecasts.
-
EPA eGRID (2024).
"Emissions & Generation Resource Integrated Database."
US grid carbon intensity data (370 gCO₂e/kWh) used for OpenAI and Anthropic gross estimates.
-
Ember (2024).
"Global Electricity Review."
Chinese grid carbon intensity (581 gCO₂e/kWh) used for DeepSeek domestic estimates.
-
GHG Protocol (2025).
"Scope 2 Guidance: Location-Based and Market-Based Accounting."
Framework for our dual-reporting approach (gross location-based vs. CEAF market-adjusted).
-
Artificial Analysis (2026).
"LLM Throughput Benchmarks."
Third-party throughput measurements (tokens/second) used for scaling estimates across all providers.
-
DeepSeek (2024, 2025).
"DeepSeek-V3 Technical Report" / "DeepSeek-R1 Paper."
Architecture details (MoE, 671B/37B active parameters) and training methodology.
-
Zheng, L., Chiang, W., et al. (2024).
"LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset."
ICLR 2024.
Average response length of 214.5 tokens across 1M real conversations — used to validate our 267-word baseline.
-
Epoch AI (2025).
"How Much Energy Does ChatGPT Use?"
Uses 500 output tokens as a typical estimate for GPT-4o energy calculations — upper bound for our baseline validation.
Feedback
If you spot an error in our methodology, have access to better data, or want to suggest
improvements, we'd genuinely love to hear from you. Getting this right matters, and
we know we don't have all the answers.