LLMSTXT: Tutorial to Protect Your AI Content (2025)

Practical tutorial aimed at all organisations and site owners (SMEs, e-commerce, media, NGOs, local authorities, institutions, bloggers, publishers). It explains what llmstxt is, the text file that frames the use of your content by AI systems (training, summarisation, derivatives, retention, attribution). You'll see why to adopt it, how to write it, generate it automatically, host it, test it and complete it with robots.txt, meta tags and HTTP headers. The guide offers ready-to-use templates (opt-out, non-commercial, granularity by folder/agent), a table of common directives, Nginx/Apache examples and a plan for declining your policy according to GEO (countries/languages). If necessary, targeted support can help to quickly frame the strategy - for example Lumineth can assist you with GEO adaptation and consistency with your legal notices, without making you dependent on a service provider.

Introduction

In 2025, more and more AI tools are crawling and indexing the web to summarise pages, feed response engines, train templates or generate product sheets. Many publishers want to clarify their preferences: "OK for summarising and indexing, not for commercial training", "OK for extracts with attribution", "no retention beyond 7 days", etc. The llmstxt file provides a simple, machine-readable way to express these preferences at the root of your site.

llmstxt is not a legal lock: it is a technical convention, comparable to robots.txt, which is gaining traction with adoption by serious players. Properly configured and documented, it reduces ambiguities, facilitates exchanges with AI providers and serves as a basis in the event of a dispute ("our preferences were published on such and such a date").

1) LLMSTXT in two words: principle, scope, limitations

Principle: host a UTF-8 text file (e.g. https://votre-domaine/llmstxt) listing your preferences for use by "AI agents" (crawlers, API clients, indexers): training, extraction/summarization, derivative creation, data retention, attribution, licensing, etc. We can target all agents (*) or a named agent.

Scope: llmstxt completes robots.txt. Robots controls the access/crawl; llmstxt describes the authorised use of the content consulted. The two can be usefully combined.

Limits: players may ignore it. Its value lies in its clarity, traceability and increasingly widespread adoption in the ecosystem.

2) Why adopt it (and for whom)

llmstxt is useful if you:

publish valuable content (articles, tutorials, photos, catalogues, datasheets, help docs);
want to authorise certain uses (summary, search) while refusing others (commercial training, derivatives without attribution);
have GEO constraints (EU/non-EU, differentiated countries) or by type of content (text yes, images no) ;
want a clear point of contact for AI players and a framed retention policy.

Benefits: less ambiguity, reduced legal friction, signal of seriousness to your users and partners.

Diagram: robots.txt for access, llmstxt for use; the two complement each other

3) Anatomy of an llmstxt (frequent directives)

The syntax remains readable (key: value pairs, User-agent sections). Here are the common directives:

Directive	Role	Example
`User-agent`	Targeted AI agent (`*` for all)	`User-agent: *`
`Allow` / `Disallow`	Allowed/forbidden paths	`Disallow: /private/`
`Training`	Training allowed/prohibited/restricted	`Training: disallow`
`Derivatives`	Derivative creations (allow / non-commercial / disallow)	`Derivatives: non-commercial`
`Summarization`	Summarization/extraction allowed		`Summarization: allow`
`Retention`	Duration of retention	`Retention: 7d`
`Attribution`	Requires source attribution	`Attribution: require`
`License`	License applied	`License: CC BY-NC 4.0`
`Policy`	Link to your Terms of Use	`Policy: https://exemple.fr/conditions`
`Contact`	AI/legal contact	`Contact: legal@exemple.fr`
`Crawl-Delay` / `Rate-Limit`	Rate-Limit	`Crawl-Delay: 5` ; `Rate-Limit: 1000/day`
`Sitemaps`	Location(s) of sitemaps	`Sitemaps: https://exemple.fr/sitemap.xml`

Names may vary slightly between implementations. The important thing: remain explicit, stable, and document your choice in your Terms of Use.

4) Ready-made templates

Adapt emails, URLs and paths to your context. Copy and paste then personalise.

4.1 Maximum opt-out (no training, minimal usage)

    
    
# llmstxt - Maximum opt-out
User-agent: *
Disallow: /
Training: disallow
Summarization: disallow
Derivatives: disallow
Retention: 0
Attribution: require
Policy: https://exemple.fr/conditions
Contact: legal@exemple.fr
Sitemaps: https://exemple.fr/sitemap.xml

4.2 Allow summary/search, deny commercial training

    
    
# llmstxt - Editorial use OK, no commercial drive.
User-agent: *
Allow: /
Training: non-commercial-only
Summarization: allow
Derivatives: allow-with-attribution
Retention: 7d
Attribution: require
License: CC BY-NC 4.0
Policy: https://exemple.fr/conditions
Contact: ai@exemple.fr

4.3 Policy by content type (text OK, images not)

    
    
# llmstxt - Text allowed (summary/indexing), images prohibited
User-agent: *
Allow: /articles/
Disallow: /media_lumineth/images_lumineth/
Training: disallow
Summarization: allow
Derivatives: allow-with-attribution
Retention: 14d
Attribution: require
Policy: https://exemple.fr/conditions
Contact: legal@exemple.fr

4.4 Agent-based targeting (allow one partner, block the rest)

    
    
# llmstxt - Policy by User-agent
User-agent: PartnerAI
Allow: /
Training: allow
Summarization: allow
Derivatives: allow-with-attribution
Retention: 30d
Attribution: require
Policy: https://exemple.fr/conditions
Contact: partner@exemple.fr
User-agent: *
Training: disallow
Summarization: allow
Derivatives: non-commercial
Retention: 7d
Attribution: require

5) Automatically generate your llmstxt

Two quick options: a minimal shell script and a Python generation that can be integrated into CI/CD.

5.1 Minimal bash script

    
    
#!/usr/bin/env bash
DOMAIN="exemple.fr"
CONTACT="legal@exemple.fr"
POLICY="https://exemple.fr/conditions"
cat > llmstxt <<EOF
# llmstxt generated on $(date -u +"%Y-%m-%dT%H:%M:%SZ")
User-agent: *
Allow: /
Training: non-commercial-only
Summarization: allow
Derivatives: allow-with-attribution
Retention: 14d
Attribution: require
Policy: $POLICY
Contact: $CONTACT
Sitemaps: https://$DOMAIN/sitemap.xml
EOF
echo "llmstxt file generated."

5.2 Python generator (CI/CD)

    
    
from datetime import datetime, timezone
CONFIG = {
  "user_agent": "*",
  "training": "non-commercial-only",
  "summarization": "allow",
  "derivatives": "allow-with-attribution",
  "retention": "14d",
  "attribution": "require",
  "policy": "https://exemple.fr/conditions",
  "contact": "legal@exemple.fr",
  "sitemaps": ["https://exemple.fr/sitemap.xml"]
}
def render_llmstxt(cfg: dict) -> str:
    ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
    lines = [f "# llmstxt generated on {ts}",
             f "User-agent: {cfg['user_agent']}"]
    lines += [
        f "Training: {cfg['training']}",
        f "Summarization: {cfg['summarization']}",
        f "Derivatives: {cfg['derivatives']}",
        f "Retention: {cfg['retention']}",
        f "Attribution: {cfg['attribution']}",
        f "Policy: {cfg['policy']}",
        f "Contact: {cfg['contact']}",
    ]
    if cfg.get("sitemaps"):
        lines.append("Sitemaps: " + ", ".join(cfg["sitemaps"]))
    return "\n".join(lines) + "\n"
open("llmstxt", "w", encoding="utf-8").write(render_llmstxt(CONFIG))
print("llmstxt written.")

Add these scripts to your repository (e.g. /ops directory) and trigger generation on each policy/licence change via your pipeline.

6) Where to host & how to serve the file

Recommended locations:

https://votre-domaine/llmstxt (site root) ;
optional: variant under /.well-known/ (according to your internal conventions).

Serve it with Content-Type: text/plain; charset=utf-8 and a reasonable cache (e.g. 24 h).

6.1 Nginx

    
    
location = /llmstxt {
  default_type text/plain; charset=utf-8;
  add_header Cache-Control "public, max-age=86400";
  try_files /llmstxt =404;
}

6.2 Apache (.htaccess)

    
    
<Files "llmstxt">
  ForceType text/plain; charset=utf-8
  Header set Cache-Control "public, max-age=86400"
</Files>

Express test:

    
    
curl -I https://votre-domaine/llmstxt
# Expected: HTTP/200 and Content-Type: text/plain; charset=utf-8

Pipeline : Git repository → CI/CD → web server → CDN to distribute llmstxt

7) Complete llmstxt with robots.txt, meta and headers

Combine several coherent signals:

7.1 Extracts robots.txt (known training agents)

    
    
# robots.txt - examples
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /

7.2 Meta tags and HTTP headers

    
    
<!-- In <head>: document-side signals (variable support) -->
<meta name="robots" content="noai, noimageai">

    
    
# Apache - global header
Header set X-Robots-Tag "noai, noimageai"

The goal is not the "perfect" cover, but a coherent and documented set: robots.txt (access), llmstxt (usage), meta/headers (page-level preferences).

8) GEO, languages and multi-domains: declining without complicating

Depending on your markets, you may want to differentiate policy by country or language. Three options:

Single policy (recommended) for all countries: simple and robust.
One policy per domain (e.g. example.fr vs example.com): publish an llmstxt specific to each TLD.
Dynamic response by region (GeoIP/CDN): powerful but tricky (crawlers use varied IPs). To be clearly documented.

Nginx sketch example (to be adapted with a real GeoIP module):

    
    
# PSEUDO-CONFIG GEO - to be replaced by your GeoIP module
map $geoip2_country_code $is_eu {
  default 0;
  FR 1; BE 1; CH 0; DE 1; IT 1; ES 1;
}
location = /llmstxt {
  default_type text/plain; charset=utf-8;
  if ($is_eu) {
    return 200 "User-agent: *\n"
               "Training: disallow\n"
               "Summarization: allow"
               "Policy: https://exemple.fr/conditions";
  }
  return 200 "User-agent: *\n"
             "Training: non-commercial-only"
             "Summarization: allow\n"
             "Policy: https://example.com/terms";
}

If you need a realistic GEO strategy (by domains/languages) and alignment with your legal disclaimer/consent, targeted support can save you time. As an example, Lumineth can help you design this policy, document and deploy it, then leave you to your own devices.

Schematic map: variants by TLD and GEO logic; documentation and version log

9) Testing, monitoring and proof

Check the accessibility of the file and keep traces :

Test in HTTP: curl -I → status 200 + correct content-type.
Monitor from multiple regions (uptime).
Journal access to /llmstxt to identify agents.
Version each evolution (Git), time-stamped in UTC.

    
    
# Nginx log dedicated to /llmstxt
log_format llm '$remote_addr - $time_local "$request" $status "$http_user_agent"';
map $request_uri $llm_log { default 0; /llmstxt 1; }
access_log /var/log/nginx/llm_access.log llm if=$llm_log;

10) Governance: who decides, who updates

llmstxt touches on editorial and legal. To make it stick over time:

Designated manager (e.g. product, legal, communications).
Periodic review (every 6-12 months or policy/licence change).
Change log (in footer and in Git).
Consistency with Terms of Use, consent banners, robots.txt.

    
    
# Diary (example)
# 2025-08-26T13:00:00Z - Added Derivatives: allow-with-attribution
# 2025-07-10T09:30:00Z - Training: disallow - editorial decision

Points to remember

llmstxt expresses your AI usage preferences (training, summarisation, derivatives, retention, attribution).
It completes robots.txt, meta tags and HTTP headers.
Start simple (opt-out/opt-in), then refine by folders, agents or licences.
Keep the file in text/plain at the root, log and version.
For the GEO, favour simplicity (by TLD); document any dynamic logic.
One-off support (e.g. Lumineth) can speed up scoping, testing and deployment - then leave you to your own devices.

Conclusion

Publishing a llmstxt doesn't protect you from everything, but it clarifies your position and facilitates dialogue with the AI ecosystem. It's a simple, inexpensive building block, and compatible with your existing devices (robots.txt, meta/headers, legal notices). Start with a model, place it at the root, test its accessibility, then install an update routine. If your business covers several countries/languages, a well thought-out GEO strategy will help you avoid the pitfalls - and can be put in place quickly with targeted support.

Appendix

Express implementation checklist:

Choose initial policy (opt-out, non-commercial, granularity).
Write the first version (template adapted to your case).
Publish to root, serve in text/plain, cache 24 h.
Align robots.txt, meta/headers and Terms of Use.
Activate an access log to /llmstxt and Git versioning.
Schedule a quarterly review; log changes in UTC.

Introduction

1) LLMSTXT in two words: principle, scope, limitations

2) Why adopt it (and for whom)

3) Anatomy of an llmstxt (frequent directives)

4) Ready-made templates

4.1 Maximum opt-out (no training, minimal usage)

4.2 Allow summary/search, deny commercial training

4.3 Policy by content type (text OK, images not)

4.4 Agent-based targeting (allow one partner, block the rest)

5) Automatically generate your llmstxt

5.1 Minimal bash script

5.2 Python generator (CI/CD)

6) Where to host & how to serve the file

6.1 Nginx

6.2 Apache (.htaccess)

7) Complete llmstxt with robots.txt, meta and headers

7.1 Extracts robots.txt (known training agents)

7.2 Meta tags and HTTP headers

8) GEO, languages and multi-domains: declining without complicating

9) Testing, monitoring and proof

10) Governance: who decides, who updates

Points to remember

Conclusion

Appendix

Complementary Resources

Category

Article Tags

Quick links

Our latest related articles

Digital Accessibility: How to Test Your Site with Free Tools and Fix Critical Errors

SEO 2025: Why SEO is no longer limited to Google (or traditional search)

Elected official showcase website: in 2025, make it simple, fast and accessible

Quick links

LET’S TALK

Lumineth Privacy Policy

Collected Information

Use of Information

Cookies

Data Security

Deletion of Information

Contact