LLMSTXT Tutorial: Supervising AI's Use of Your Content

Practical tutorial aimed at all organisations and site owners (SMEs, e-commerce, media, NGOs, local authorities, institutions, bloggers, publishers). It explains what llmstxt is, the text file that frames the use of your content by AI systems (training, summarisation, derivatives, retention, attribution). You'll see why to adopt it, how to write it, generate it automatically, host it, test it and complete it with robots.txt, meta tags and HTTP headers. The guide offers ready-to-use templates (opt-out, non-commercial, granularity by folder/agent), a table of common directives, Nginx/Apache examples and a plan for declining your policy according to GEO (countries/languages). If necessary, targeted support can help to quickly frame the strategy - for example Lumineth can assist you with GEO adaptation and consistency with your legal notices, without making you dependent on a service provider.


Introduction

In 2025, more and more AI tools are crawling and indexing the web to summarise pages, feed response engines, train templates or generate product sheets. Many publishers want to clarify their preferences: "OK for summarising and indexing, not for commercial training", "OK for extracts with attribution", "no retention beyond 7 days", etc. The llmstxt file provides a simple, machine-readable way to express these preferences at the root of your site.

llmstxt is not a legal lock: it is a technical convention, comparable to robots.txt, which is gaining traction with adoption by serious players. Properly configured and documented, it reduces ambiguities, facilitates exchanges with AI providers and serves as a basis in the event of a dispute ("our preferences were published on such and such a date").


1) LLMSTXT in two words: principle, scope, limitations

Principle: host a UTF-8 text file (e.g. https://votre-domaine/llmstxt) listing your preferences for use by "AI agents" (crawlers, API clients, indexers): training, extraction/summarization, derivative creation, data retention, attribution, licensing, etc. We can target all agents (*) or a named agent.

Scope: llmstxt completes robots.txt. Robots controls the access/crawl; llmstxt describes the authorised use of the content consulted. The two can be usefully combined.

Limits: players may ignore it. Its value lies in its clarity, traceability and increasingly widespread adoption in the ecosystem.


2) Why adopt it (and for whom)

llmstxt is useful if you:

  • publish valuable content (articles, tutorials, photos, catalogues, datasheets, help docs);
  • want to authorise certain uses (summary, search) while refusing others (commercial training, derivatives without attribution);
  • have GEO constraints (EU/non-EU, differentiated countries) or by type of content (text yes, images no) ;
  • want a clear point of contact for AI players and a framed retention policy.

Benefits: less ambiguity, reduced legal friction, signal of seriousness to your users and partners.

Diagram: robots.txt for access, llmstxt for use; the two complement each other

3) Anatomy of an llmstxt (frequent directives)

The syntax remains readable (key: value pairs, User-agent sections). Here are the common directives:

Directive Role Example
User-agent Targeted AI agent (* for all) User-agent: *
Allow / Disallow Allowed/forbidden paths Disallow: /private/
Training Training allowed/prohibited/restricted Training: disallow
Derivatives Derivative creations (allow / non-commercial / disallow) Derivatives: non-commercial
Summarization Summarization/extraction allowed Summarization: allow
Retention Duration of retention Retention: 7d
Attribution Requires source attribution Attribution: require
License License applied License: CC BY-NC 4.0
Policy Link to your Terms of Use Policy: https://exemple.fr/conditions
Contact AI/legal contact Contact: legal@exemple.fr
Crawl-Delay / Rate-Limit Rate-Limit Crawl-Delay: 5 ; Rate-Limit: 1000/day
Sitemaps Location(s) of sitemaps Sitemaps: https://exemple.fr/sitemap.xml

Names may vary slightly between implementations. The important thing: remain explicit, stable, and document your choice in your Terms of Use.


4) Ready-made templates

Adapt emails, URLs and paths to your context. Copy and paste then personalise.

4.1 Maximum opt-out (no training, minimal usage)

    
    
# llmstxt - Maximum opt-out
User-agent: *
Disallow: /
Training: disallow
Summarization: disallow
Derivatives: disallow
Retention: 0
Attribution: require
Policy: https://exemple.fr/conditions
Contact: legal@exemple.fr
Sitemaps: https://exemple.fr/sitemap.xml
    
  

4.2 Allow summary/search, deny commercial training

    
    
# llmstxt - Editorial use OK, no commercial drive.
User-agent: *
Allow: /
Training: non-commercial-only
Summarization: allow
Derivatives: allow-with-attribution
Retention: 7d
Attribution: require
License: CC BY-NC 4.0
Policy: https://exemple.fr/conditions
Contact: ai@exemple.fr
    
  

4.3 Policy by content type (text OK, images not)

    
    
# llmstxt - Text allowed (summary/indexing), images prohibited
User-agent: *
Allow: /articles/
Disallow: /media_lumineth/images_lumineth/
Training: disallow
Summarization: allow
Derivatives: allow-with-attribution
Retention: 14d
Attribution: require
Policy: https://exemple.fr/conditions
Contact: legal@exemple.fr
    
  

4.4 Agent-based targeting (allow one partner, block the rest)

    
    
# llmstxt - Policy by User-agent
User-agent: PartnerAI
Allow: /
Training: allow
Summarization: allow
Derivatives: allow-with-attribution
Retention: 30d
Attribution: require
Policy: https://exemple.fr/conditions
Contact: partner@exemple.fr
User-agent: *
Training: disallow
Summarization: allow
Derivatives: non-commercial
Retention: 7d
Attribution: require
    
  

5) Automatically generate your llmstxt

Two quick options: a minimal shell script and a Python generation that can be integrated into CI/CD.

5.1 Minimal bash script

    
    
#!/usr/bin/env bash
DOMAIN="exemple.fr"
CONTACT="legal@exemple.fr"
POLICY="https://exemple.fr/conditions"
cat > llmstxt <<EOF
# llmstxt generated on $(date -u +"%Y-%m-%dT%H:%M:%SZ")
User-agent: *
Allow: /
Training: non-commercial-only
Summarization: allow
Derivatives: allow-with-attribution
Retention: 14d
Attribution: require
Policy: $POLICY
Contact: $CONTACT
Sitemaps: https://$DOMAIN/sitemap.xml
EOF
echo "llmstxt file generated."
    
  

5.2 Python generator (CI/CD)

    
    
from datetime import datetime, timezone
CONFIG = {
  "user_agent": "*",
  "training": "non-commercial-only",
  "summarization": "allow",
  "derivatives": "allow-with-attribution",
  "retention": "14d",
  "attribution": "require",
  "policy": "https://exemple.fr/conditions",
  "contact": "legal@exemple.fr",
  "sitemaps": ["https://exemple.fr/sitemap.xml"]
}
def render_llmstxt(cfg: dict) -> str:
    ts = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
    lines = [f "# llmstxt generated on {ts}",
             f "User-agent: {cfg['user_agent']}"]
    lines += [
        f "Training: {cfg['training']}",
        f "Summarization: {cfg['summarization']}",
        f "Derivatives: {cfg['derivatives']}",
        f "Retention: {cfg['retention']}",
        f "Attribution: {cfg['attribution']}",
        f "Policy: {cfg['policy']}",
        f "Contact: {cfg['contact']}",
    ]
    if cfg.get("sitemaps"):
        lines.append("Sitemaps: " + ", ".join(cfg["sitemaps"]))
    return "\n".join(lines) + "\n"
open("llmstxt", "w", encoding="utf-8").write(render_llmstxt(CONFIG))
print("llmstxt written.")
    
  

Add these scripts to your repository (e.g. /ops directory) and trigger generation on each policy/licence change via your pipeline.


6) Where to host & how to serve the file

Recommended locations:

  • https://votre-domaine/llmstxt (site root) ;
  • optional: variant under /.well-known/ (according to your internal conventions).

Serve it with Content-Type: text/plain; charset=utf-8 and a reasonable cache (e.g. 24 h).

6.1 Nginx

    
    
location = /llmstxt {
  default_type text/plain; charset=utf-8;
  add_header Cache-Control "public, max-age=86400";
  try_files /llmstxt =404;
}
    
  

6.2 Apache (.htaccess)

    
    
<Files "llmstxt">
  ForceType text/plain; charset=utf-8
  Header set Cache-Control "public, max-age=86400"
</Files>
    
  

Express test:

    
    
curl -I https://votre-domaine/llmstxt
# Expected: HTTP/200 and Content-Type: text/plain; charset=utf-8
    
  
Pipeline : Git repository → CI/CD → web server → CDN to distribute llmstxt

7) Complete llmstxt with robots.txt, meta and headers

Combine several coherent signals:

7.1 Extracts robots.txt (known training agents)

    
    
# robots.txt - examples
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /
    
  

7.2 Meta tags and HTTP headers

    
    
<!-- In <head>: document-side signals (variable support) -->
<meta name="robots" content="noai, noimageai">
    
  
    
    
# Apache - global header
Header set X-Robots-Tag "noai, noimageai"
    
  

The goal is not the "perfect" cover, but a coherent and documented set: robots.txt (access), llmstxt (usage), meta/headers (page-level preferences).


8) GEO, languages and multi-domains: declining without complicating

Depending on your markets, you may want to differentiate policy by country or language. Three options:

  1. Single policy (recommended) for all countries: simple and robust.
  2. One policy per domain (e.g. example.fr vs example.com): publish an llmstxt specific to each TLD.
  3. Dynamic response by region (GeoIP/CDN): powerful but tricky (crawlers use varied IPs). To be clearly documented.

Nginx sketch example (to be adapted with a real GeoIP module):

    
    
# PSEUDO-CONFIG GEO - to be replaced by your GeoIP module
map $geoip2_country_code $is_eu {
  default 0;
  FR 1; BE 1; CH 0; DE 1; IT 1; ES 1;
}
location = /llmstxt {
  default_type text/plain; charset=utf-8;
  if ($is_eu) {
    return 200 "User-agent: *\n"
               "Training: disallow\n"
               "Summarization: allow"
               "Policy: https://exemple.fr/conditions";
  }
  return 200 "User-agent: *\n"
             "Training: non-commercial-only"
             "Summarization: allow\n"
             "Policy: https://example.com/terms";
}
    
  

If you need a realistic GEO strategy (by domains/languages) and alignment with your legal disclaimer/consent, targeted support can save you time. As an example, Lumineth can help you design this policy, document and deploy it, then leave you to your own devices.

Schematic map: variants by TLD and GEO logic; documentation and version log

9) Testing, monitoring and proof

Check the accessibility of the file and keep traces :

  • Test in HTTP: curl -I → status 200 + correct content-type.
  • Monitor from multiple regions (uptime).
  • Journal access to /llmstxt to identify agents.
  • Version each evolution (Git), time-stamped in UTC.
    
    
# Nginx log dedicated to /llmstxt
log_format llm '$remote_addr - $time_local "$request" $status "$http_user_agent"';
map $request_uri $llm_log { default 0; /llmstxt 1; }
access_log /var/log/nginx/llm_access.log llm if=$llm_log;
    
  

10) Governance: who decides, who updates

llmstxt touches on editorial and legal. To make it stick over time:

  • Designated manager (e.g. product, legal, communications).
  • Periodic review (every 6-12 months or policy/licence change).
  • Change log (in footer and in Git).
  • Consistency with Terms of Use, consent banners, robots.txt.
    
    
# Diary (example)
# 2025-08-26T13:00:00Z - Added Derivatives: allow-with-attribution
# 2025-07-10T09:30:00Z - Training: disallow - editorial decision
    
  

Points to remember

  • llmstxt expresses your AI usage preferences (training, summarisation, derivatives, retention, attribution).
  • It completes robots.txt, meta tags and HTTP headers.
  • Start simple (opt-out/opt-in), then refine by folders, agents or licences.
  • Keep the file in text/plain at the root, log and version.
  • For the GEO, favour simplicity (by TLD); document any dynamic logic.
  • One-off support (e.g. Lumineth) can speed up scoping, testing and deployment - then leave you to your own devices.

Conclusion

Publishing a llmstxt doesn't protect you from everything, but it clarifies your position and facilitates dialogue with the AI ecosystem. It's a simple, inexpensive building block, and compatible with your existing devices (robots.txt, meta/headers, legal notices). Start with a model, place it at the root, test its accessibility, then install an update routine. If your business covers several countries/languages, a well thought-out GEO strategy will help you avoid the pitfalls - and can be put in place quickly with targeted support.


Appendix

Express implementation checklist:

  • Choose initial policy (opt-out, non-commercial, granularity).
  • Write the first version (template adapted to your case).
  • Publish to root, serve in text/plain, cache 24 h.
  • Align robots.txt, meta/headers and Terms of Use.
  • Activate an access log to /llmstxt and Git versioning.
  • Schedule a quarterly review; log changes in UTC.

Complementary Resources


Our latest related articles

Close

Lumineth Privacy Policy

At Lumineth, we place great importance on protecting your privacy. This privacy policy describes the types of information we collect, how we use it, and how you can contact us for any questions or requests regarding your personal data.

Collected Information

We collect only the information voluntarily provided by our potential clients through various contact forms. This information may include your name, email address, as well as any other data you choose to share with us.

Use of Information

The information you provide is used exclusively to respond to your requests for website creation, SEO optimisation, and social media publishing. We do not sell or share any of your information with third parties.

Cookies

We only use session cookies essential for the proper functioning of our website. These cookies are necessary to provide you with an optimal user experience and are automatically deleted when you close your browser.

Data Security

We implement appropriate security measures to protect your personal information against unauthorised access, modification, disclosure, or destruction.

Deletion of Information

If you would like your information to be deleted from our databases, please contact us at the following address: contact@lumineth.com. We are committed to processing your request as quickly as possible.

Contact

For any questions regarding our privacy policy, we invite you to contact us by email at the following address: contact@lumineth.com. We remain at your disposal to address all your concerns.