How We Calculate Inference Carbon

Last updated: 25 March 2026

InferenceCarbon estimates the inference carbon footprint of AI queries using a combination of published research, publicly available data, and stated assumptions. This page explains exactly how we do it, where the numbers come from, and where we think the limitations are.

Honest caveat: Few AI providers currently publish per-query energy data. Our estimates incorporate provider-specific grid intensity (CIF) and clean energy procurement (CEAF) data from the InferenceCarbon reference papers (March 2026), but they remain estimates — not precise measurements. We think approximate awareness is far better than no awareness at all.

We invite all LLM providers to publish both verified location-based and verified market-based gCO₂e figures so that estimates are unnecessary. (This may be coming anyway with IFRS S2 and the GHG Protocol’s Scope 2 revision.)

Contents

The Core Formula
Token Counting
Model Carbon Intensity (Three-Layer Derivation)
Clean Energy Adjustment Factor (CEAF)
Multimodal Estimation (Image, Audio, Video)
Financial Cost
Carbon Savings Methodology
Reference Papers
Sources & References

1. The Core Formula

For non-thinking, text-based AI queries, the carbon footprint is calculated as:

Text carbon estimate (CEAF-adjusted) carbon (gCO₂e) = (total_tokens / 1,000) × carbon_per_1k_ceaf_adj

For multimodal queries (image, audio, video), the formula is:

Multimodal carbon estimate carbon (gCO₂e) = carbon_per_unit × quantity

Where:

total_tokens (see Section 2, below) = input tokens + output tokens
carbon_per_1k_ceaf_adj = grams of CO₂e per 1,000 tokens, with provider-location-specific Carbon Intensity Factor (CIF) and Clean Energy Adjustment Factor (CEAF) already incorporated (see Section 3)
carbon_per_unit = grams of CO₂e per image, per minute of audio, or per second of video — already incorporating model energy consumption, provider data center Power Usage Effectiveness (PUE), CIF and CEAF

The idea is straightforward: more tokens (or more media units) means more computation, which means more energy, which means more carbon. The amount of carbon per unit of computation varies by model (larger models use more energy) and by provider (clean energy procurement and data center location affect emissions).

2. Token Counting

AI models process text as tokens — roughly, sub-word units that average about 0.75 words each (or about 4 characters). The more tokens in your prompt and the model's response, the more computation is required.

Input tokens

We estimate input tokens from your prompt text. Where possible, we use the model's actual tokenizer for accuracy; otherwise, we fall back to character-based heuristics.

Method	Used for	Accuracy
tiktoken (BPE)	GPT-4o, GPT-4.1, GPT-5, GPT o3	±2%
3.8 chars/token	Claude models	±10%
4.0 chars/token	Gemini models	±12%
4.0 chars/token	All other models	±15%

Output tokens

Since we can't know in advance how long a model's response will be, we use preset estimates based on the response length you select:

Response length	Approximate words	Estimated tokens
Short	~100	133
Medium	~300	400
Long	~1,000	1,333
Very long	~5,000	6,667

When you use the "Try It Live" feature, we replace these estimates with the actual token counts from the real API response.

3. Model Carbon Intensity (Three-Layer Derivation)

Each model's carbon intensity is derived through a three-layer process that converts raw energy measurements into provider-specific, CEAF-adjusted carbon values. The values come from the InferenceCarbon reference papers (March 2026; see Section 8).

Layer 1: Energy (Wh per unit)

We start with an anchor measurement — a directly measured or provider-disclosed energy-per-token value for a reference model. Other models are scaled relative to this anchor using the ratio of their throughput (tokens/second from independent benchmarks) and a GPU Energy Coefficient (GEC) that accounts for architectural differences (e.g. GPUs vs. TPUs, Mixture-of-Experts routing, reasoning chains, etc.).

Layer 2: Location-Based Carbon

The energy value is converted to gross (location-based) carbon using a per-provider Carbon Intensity Factor (CIF) in kgCO₂e/kWh. The CIF reflects the carbon intensity of the electrical grid where the provider's data centers are located, weighted by serving region mix and incorporating a PUE overhead.

Provider / Infrastructure	CIF (kgCO₂e/kWh)	Basis
Anthropic / AWS	0.287	AWS us-east-1 & us-west-2 weighted average (EPA eGRID 2024)
OpenAI / Azure	0.350	Azure US region mix (EPA eGRID 2024)
Google / Google Cloud Platform	0.375	GCP global serving mix (IEA 2023, Google CFE reports)
DeepSeek / China	0.600	Chinese grid average (Ember 2024, IEA 2025 forecast)
DeepSeek / Azure	0.350	Azure US region mix (same as OpenAI)
Mistral / Azure	0.350	Azure US & EU region mix
Stability AI / AWS	0.287	AWS region mix (same as Anthropic)
Kuaishou (Kling) / China	0.530	Chinese grid, partial renewables (Ember 2024)
Undisclosed	0.370	Global average fallback (IEA 2023)

Layer 3: CEAF-Adjusted Carbon

The location-based carbon is then adjusted for the provider's verified clean energy procurement using the Clean Energy Adjustment Factor (see Section 4):

Three-layer derivation carbon_ceaf_adj = energy_wh × CIF × (1 − CEAF)

Text models (sorted by CEAF-adjusted carbon, greenest first)

Model	Provider	Range gCO₂e / 1k tokens (Estimated)	Gross gCO₂e / 1k tokens (Estimated)	CEAF %	CEAF-Adjusted gCO₂e / 1k tokens (Estimated)	Confidence
Loading…

Why the big range? Loading model data…

4. Clean Energy Adjustment Factor (CEAF)

The CEAF adjusts gross (location-based) emissions to account for a provider's verified clean energy procurement. A provider that purchases renewable energy certificates (RECs) or has long-term power purchase agreements (PPAs) will have lower market-based emissions.

CEAF adjustment carbon_ceaf_adj = carbon_gross × (1 − CEAF)

Provider	Grid CIF tier	CEAF %	Basis
Loading…

    Why is AWS CEAF 0%? AWS has no published verified 24/7 Carbon-Free Energy (CFE)
    data at the facility level that we have been able to identify. While Amazon has corporate-level renewable energy commitments, the
    CIF of 0.287 kgCO₂e/kWh already reflects the relatively cleaner grid locations where
    AWS data centers operate (e.g., us-west-2 Oregon). Applying a CEAF on top would risk
    double-counting the grid benefit.
  

CEAF limitations: The CEAF is based on annual averages and corporate-level claims. Real-time clean energy matching varies hourly. Google's 24/7 CFE programme is the most granular; other providers may over-claim on an hourly basis. We apply conservative CEAF values and plan to update them as more granular data becomes available.

5. Multimodal Estimation (Image, Audio, Video)

AI isn't just text. Image generation, audio processing, and video creation have very different energy profiles. We estimate these separately using per-unit factors that already incorporate model energy, provider PUE, CIF, and CEAF.

5.1 Image generation

Image carbon carbon = carbon_per_image × resolution_multiplier × steps_multiplier

The baseline is a 1024×1024 image at 25 diffusion steps, derived from direct energy measurements of Stable Diffusion by Luccioni (2024): approximately 2,282 joules per image. Resolution scaling is non-linear — doubling the pixel count doesn't double the energy because of how diffusion models process images.

Model	CEAF-Adjusted gCO₂e / image (Estimated)	Confidence
Loading…

5.2 Audio processing

Audio carbon carbon = carbon_per_minute × duration_minutes

For speech-to-text (Automatic Speech Recognition, ASR) models like Whisper, we derive per-minute energy from published benchmarks: Whisper Large v3 processes 22 hours of audio using approximately 0.35 kWh, giving a baseline of about 0.014 gCO₂e per minute. Text-to-speech (TTS) is estimated at 2–3× the energy of transcription.

Model	CEAF-Adjusted gCO₂e / minute (Estimated)	Confidence
Loading…

5.3 Video generation

Video carbon carbon = carbon_per_second × duration_seconds × quality_multiplier

Video is by far the most carbon-intensive AI modality. A single second of AI-generated video at high quality can produce 60–180 gCO₂e. Our estimates are derived from direct measurements of CogVideoX (2025), with other models estimated relative to this.

Model	CEAF-Adjusted gCO₂e / second (Estimated)	Confidence
Loading…

Video uncertainty is high (±50–60%). No standardized benchmarks exist for AI video generation energy. These figures should be treated as rough order-of-magnitude estimates.

6. Financial Cost

Alongside carbon, we calculate the financial cost of each query using the provider's published API pricing:

Financial cost cost = (input_tokens / 1,000,000) × input_price + (output_tokens / 1,000,000) × output_price

We also calculate a carbon offset cost — the theoretical cost of offsetting the emissions through voluntary carbon credits, at approximately $25 per tonne of CO₂e (Gold Standard market average, 2025). For most individual queries, this is a tiny fraction of a cent, which helps put the scale of AI carbon emissions in context.

NB: The data we produce is our best estimate of emissions coming from a particular model derived using external data, and it may vary significantly from emission calculations derived internally. Use these figures with care when assessing offsets.

6.2 EV Distance Equivalent

We express energy consumption as equivalent driving distance in a standard electric vehicle (Nissan Leaf, 40 kWh battery, 149 miles / 240 km EPA range):

EV DISTANCE ev_miles = energy_Wh / 1000 × 3.725

Where 3.725 miles/kWh is derived from the Nissan Leaf’s EPA-rated efficiency (149 miles ÷ 40 kWh). This provides an intuitive sense of scale: a single AI query might use the energy equivalent of driving a fraction of a millimetre, while generating a minute of AI video could equate to several metres of driving.

7. Carbon Savings Methodology

InferenceCarbon doesn't just estimate your carbon footprint — it tells you how much carbon you saved (or added) compared to a reference baseline. This section explains how we define that baseline, what counts as a "saving", and why we think the approach is defensible.

7.1 The default baseline: GPT-4o at 267 words

Every carbon comparison in InferenceCarbon is measured against a reference baseline. The default baseline is the same prompt sent to GPT-4o, generating a medium-length (267-word) response. Users can customize the baseline model and response length to match their own usage patterns. In formula terms:

Baseline carbon baseline (gCO₂e) = ((input_tokens + 356) / 1,000) × 0.21

Where input_tokens is your actual prompt re-tokenised at GPT-4o's rate (~4 characters per token), 400 is the estimated output tokens for a 267-word response (~356 tokens), and 0.21 is GPT-4o's CEAF-adjusted carbon per 1K tokens.

Carbon saving saving (gCO₂e) = baseline − actual

A positive saving means you chose a lower-carbon option. A negative saving means your choice used more carbon than the baseline.

7.2 Why GPT-4o?

We chose GPT-4o as the baseline model for three reasons:

Market share: ChatGPT holds approximately 68% of the AI chatbot market (Similarweb, January 2026), and GPT-4o is its default model for both free and paid users. It is the single most widely used AI model in the world.
Industry standard: GPT-4o is the de facto reference model used by researchers (Epoch AI, 2025), benchmarking platforms, and competing providers (Google, Anthropic, Meta) for performance and energy comparisons.
Mid-range carbon intensity: At 0.21 gCO₂e per 1,000 tokens (CEAF-adjusted), GPT-4o sits in the middle of our model range (0.09–7.00). It is neither the cleanest nor the most carbon-intensive option, making it a fair yardstick rather than a cherry-picked extreme.

7.3 Why 267 words?

We use a medium-length (267-word, ~356-token) response as the baseline output length. This is supported by published research on real-world AI conversations:

LMSYS-Chat-1M (Zheng, Chiang et al., ICLR 2024) — the largest public dataset of real LLM conversations (1 million exchanges across 25 models) found an average model response of 214.5 tokens (~160 words).
Epoch AI (February 2025) used 500 tokens (~375 words) as a "typical but somewhat pessimistic" estimate for energy calculations.

Our 267-word (356-token) baseline falls partway between these two reference points. We believe this is the fairest choice: it avoids inflating savings (which a shorter baseline would do) and avoids understating savings (which a longer baseline would do).

7.4 What counts as a saving?

Carbon savings come from three user decisions, each of which changes the actual carbon footprint relative to the baseline:

Decision	How it creates a saving	Example
Model selection	Choosing a model with a lower `carbon_per_1k_ceaf_adj` value	Gemini 2.5 Flash-Lite (0.09) vs GPT-4o (0.21) = 57% saving
Response length	Requesting a shorter response reduces output tokens	Short (100 words) vs baseline (267 words) = fewer tokens processed
Prompt optimisation	Writing concise prompts reduces input tokens	A 20-word prompt vs a 200-word prompt saves input-side computation

    Important: The baseline uses your actual prompt.
    This means the comparison isolates the effect of
    your model choice and response length — the two decisions most directly within
    your control.
  

7.5 Honest limitations

We want to be transparent about the boundaries of this approach:

The baseline is hypothetical. We estimate what would have happened if you'd used GPT-4o at 267 words. You may never have intended to use GPT-4o, so the "saving" is relative to a counterfactual, not a known fact.
Output tokens are estimated. In the calculator, the baseline uses a fixed 356-token output estimate. When you use "Try It Live", we have your actual token count but still compare against the 356-token baseline. Real responses vary widely.
Some model choices increase carbon. If you select a reasoning model like DeepSeek R1 App (7.00 gCO₂e/1K tokens), the "saving" is negative — you used more carbon than the baseline.
The baseline will evolve. As market share shifts and new models emerge, the reference model may need updating. We'll document any changes here and explain the rationale.
Savings inherit all uncertainty. Since both the baseline and actual estimates carry ±30% uncertainty, the savings figure has a combined uncertainty of roughly ±40%. Small differences between similar models should be interpreted with caution.

7.6 Cumulative savings

On your dashboard, we sum individual query savings over time to show your total carbon impact. This uses the same per-query formula applied to each tracked query (from the browser extension, calculator, or Try It Live feature). Cumulative savings are most meaningful when viewed as a trend rather than an absolute number.

A note on "savings" vs "reductions": InferenceCarbon shows how your AI usage compares to a reference scenario. We deliberately use the word "saving" (vs baseline) rather than "reduction" (vs your own past behavior) because we are comparing against a hypothetical, not tracking a personal journey over time. Both framings have value; ours is designed to be immediately actionable for every query.

8. Reference Papers

The methodology is based on the following InferenceCarbon reference papers (March 2026), which provide provider-specific carbon intensity estimates:

Estimating the Inference Carbon Intensity of Anthropic’s Claude Models — Claude Haiku/Sonnet/Opus families, multi-cloud (AWS+GCP), CIF tier 2.0
Estimating the Inference Carbon Intensity of DeepSeek’s Models — V3/R1 models, dual-pathway (China app vs Azure), CIF tiers 3.0/1.5
Estimating the Inference Carbon Intensity of Google Gemini Models — Gemini model family, 24/7 CFE data, CIF tier 1.0
Estimating the Inference Carbon Intensity of Mistral’s Models — Small/Medium/Large 3 families, hosted on Azure/AWS, CIF tier 1.5
Estimating the Inference Carbon Intensity of OpenAI ChatGPT Models — GPT-4o/4.1/5/o3 families, Azure infrastructure, CIF tier 1.5

We also publish modality-specific methodology papers covering audio, image, and video generation models:

Estimating the Inference Carbon Intensity of Common Audio Models — Text-to-speech and speech-to-text models
Estimating the Inference Carbon Intensity of Common Image Models — Image generation and editing models
Estimating the Inference Carbon Intensity of Common Video Models — Video generation models

9. Sources & References

Our methodology draws on the following published research:

Luccioni, A.S., Viguier, S., & Ligozat, A.-L. (2023). "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model." Journal of Machine Learning Research, 24(253), 1–15.
Direct measurements of model energy per token — our primary baseline source.
Luccioni, A.S. (2024). "Measuring the Energy Consumption of AI Image Generation."
Direct energy measurements of Stable Diffusion at various resolutions and step counts.
Patterson, D., et al. (2021, 2022). "Carbon Emissions and Large Neural Network Training." arXiv / IEEE Computer.
Energy scaling laws, GPU utilisation patterns, and efficiency trends.
Dodge, J., et al. (2022). "Measuring the Carbon Intensity of AI in Cloud Instances." FAccT 2022.
Uncertainty quantification methods for ML carbon estimation.
Patel, A., et al. (2024). "Splitwise: Efficient GPU Inference with Model Parallelism." ASPLOS 2024.
GPU power draw at 60–80% TDP during inference workloads.
Jegham, I., et al. (2025). "Towards Sustainable AI: A Comprehensive Framework for Green Large Language Models." arXiv:2505.09598v6.
Independent energy measurements per token across major LLM providers — the primary anchor data source for all carbon estimates.
Google (2025). "Measuring Environmental Impact of AI Inference." arXiv:2508.15734.
Google’s own measurement of 0.24 Wh per median Gemini prompt — cross-validates the Jegham framework.
Altman, S. (June 2025). "The Gentle Singularity."
CEO disclosure of 0.34 Wh average ChatGPT query — the anchor data point for all OpenAI estimates.
Mistral AI / Carbone 4 (2025). "Mistral Large 2 Life Cycle Assessment."
Provider-published, peer-reviewed LCA — the only independently audited per-token carbon figure in the industry (1.14 gCO₂e/400 tokens).
Samsi, S., et al. (2023). "Scaling Large Language Models on Edge Accelerators." IEEE HPEC 2023.
Energy per token estimates for large models.
International Energy Agency (2023, 2025). "Emission Factors" / "Electricity Mid-Year Update 2025."
Regional and national grid carbon intensity data (gCO₂e/kWh), including Chinese grid CIF forecasts.
EPA eGRID (2024). "Emissions & Generation Resource Integrated Database."
US grid carbon intensity data (370 gCO₂e/kWh) used for OpenAI and Anthropic gross estimates.
Ember (2024). "Global Electricity Review."
Chinese grid carbon intensity (581 gCO₂e/kWh) used for DeepSeek domestic estimates.
GHG Protocol (2025). "Scope 2 Guidance: Location-Based and Market-Based Accounting."
Framework for our dual-reporting approach (gross location-based vs. CEAF market-adjusted).
Artificial Analysis (2026). "LLM Throughput Benchmarks."
Third-party throughput measurements (tokens/second) used for scaling estimates across all providers.
DeepSeek (2024, 2025). "DeepSeek-V3 Technical Report" / "DeepSeek-R1 Paper."
Architecture details (MoE, 671B/37B active parameters) and training methodology.
Zheng, L., Chiang, W., et al. (2024). "LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset." ICLR 2024.
Average response length of 214.5 tokens across 1M real conversations — used to validate our 267-word baseline.
Epoch AI (2025). "How Much Energy Does ChatGPT Use?"
Uses 500 output tokens as a typical estimate for GPT-4o energy calculations — upper bound for our baseline validation.

Feedback

If you spot an error in our methodology, have access to better data, or want to suggest improvements, we'd genuinely love to hear from you. Getting this right matters, and we know we don't have all the answers.

Email: hello@inferencecarbon.ai
Website: www.inferencecarbon.ai