AI Search Engines Only Cite HTML — New Research Confirms Markdown Is Invisible to LLMs
AI SEO

AI Search Engines Only Cite HTML — New Research Confirms Markdown Is Invisible to LLMs

Apr 18, 2026

New research from OtterlyAI, published this April, has identified a technical blind spot that should immediately change how content teams publish: AI search engines — including ChatGPT, Google Gemini, and Perplexity — only cite HTML pages. Markdown (.md) files are invisible to these systems. They are not being indexed, processed, or cited in AI-generated answers. If any part of your content infrastructure publishes in Markdown format, you have a silent AI visibility gap that keyword optimisation, link building, and content quality improvements alone cannot fix.

What the Research Found

OtterlyAI tested AI system responses to identical content published in both HTML and Markdown formats across multiple query types and content categories. The finding was categorical: AI citation systems consistently selected HTML pages while ignoring structurally equivalent Markdown content. The gap was not a matter of degree — Markdown pages received zero citations regardless of content quality, topical relevance, domain authority, or backlink profile.

The underlying reason is architectural. AI crawlers and the citation processing systems built on top of them are designed around HTML’s semantic structure: heading elements (H1-H6), paragraph tags, list items, anchor elements, schema markup, and meta tags. These provide the machine-readable signals AI systems use to understand content hierarchy, extract answer candidates, and assess citation confidence.

Markdown’s simpler syntax — excellent for developer workflows and human readability — lacks this semantic richness. A Markdown heading and a Markdown paragraph look identical to a text renderer but are structurally ambiguous to an AI citation system expecting HTML’s explicit semantic scaffolding. Without those signals, AI systems cannot confidently extract, attribute, or cite the content — so they do not.

Where Markdown Exposure Is Hiding in Your Content Stack

Markdown is far more prevalent in content infrastructure than most marketing and SEO teams realise. The following content categories carry the highest risk:

Technical documentation and developer portals. The majority of software documentation is written in Markdown using platforms such as Docusaurus, GitBook, MkDocs, and ReadTheDocs. For technology brands, this documentation is often the most authoritative content they publish — detailed, specific product knowledge that prospects consult during purchase evaluation. If it is in Markdown, it is invisible to the AI systems that prospects increasingly use to research and compare solutions.

GitHub-hosted content. README files, repository wikis, and GitHub Pages sites are Markdown by default. Tutorials, guides, open-source project documentation, and technical explainers published on GitHub — regardless of how authoritative or widely referenced they are — are not being cited by AI search systems.

Static site generators and Jamstack architectures. Jekyll, Hugo, Gatsby, and similar frameworks store content as Markdown that compiles to HTML at build time. Whether the final output is AI-crawlable depends on implementation specifics. Many Jamstack sites serve minimal HTML with content loaded via JavaScript — content that AI crawlers cannot access, even if the visual output looks correct in a browser.

Knowledge bases and internal wiki platforms. Notion, Confluence, and similar tools use Markdown-like syntax. Content published from these platforms to public URLs may not render as fully semantic HTML with the structural signals AI citation systems require.

The Content Format Hierarchy for AI Citation

The OtterlyAI finding sits within a broader pattern of how content formats perform in AI search. Based on current citation pattern research, the hierarchy breaks down as follows:

Highest citation probability: Semantic HTML with complete schema markup, proper heading hierarchy, explicit author attribution with structured data, FAQPage and HowTo schema, fast Core Web Vitals, and factual claims written as standalone, crawlable sentences in paragraph elements.

Strong citation probability: When it comes to getting cited, there are a few things that can help. If you use HTML and include proper title tags and meta descriptions, that is a good start. You should also make sure your website works well on devices, like phones and computers. This is called design. Even if you do not have things like a schema, you can still get cited.

Moderate citation probability: If your website does not do these things, it is less likely to be cited. For example, if your website uses a lot of JavaScript and the important content isn’t there when the page loads, that can be a problem. It is also bad if your website is slow or lacks headings. This can make it hard for computers to understand what your website is about.

Low to zero citation probability: Some things are very unlikely to get cited. These include files and PDFs that are not set up correctly. If a website uses a lot of JavaScript and the content isn’t available when the page first loads, that is also a problem, because a website can be blocked by robots.txt, which can prevent computers from even seeing the website.

Five Actions to Audit and Fix Your Format Exposure

1. Conduct a Full Content Format Inventory

Map every location where your organisation publishes content: main website CMS, documentation platform, developer portal, GitHub presence, help centre, microsites, and campaign landing pages. For each, determine whether the final served format is semantic HTML or Markdown. Use View Source in your browser — not the browser inspector — to see what AI crawlers actually receive. The inspector shows rendered output; View Source shows what arrives over the wire.

2. Prioritise High-Value Markdown Content for HTML Conversion

Convert commercially important content first: product documentation, prospects consult during evaluation, how-to and tutorial content that establishes authority, comparison pages, and FAQ content you want cited as a primary source. Converting a single high-value page to properly structured HTML can recover citation value within weeks. The content quality already exists — the format change is what activates it.

3. Verify Your Static Site Generator’s HTML Output

If your site uses a static generator or a headless CMS, verify that Markdown content renders to fully semantic HTML. Check that content is present in the initial server HTML response without JavaScript execution required (View Source confirms this), that heading tags are properly nested from H1 downward, that meta tags contain unique values, and that schema markup is being generated at the page level.

4. Migrate Critical Documentation to a Proper HTML Platform

If your most authoritative technical or product content lives primarily on GitHub in Markdown, migrating it to a documentation platform that renders full semantic HTML is a high-ROI investment in AI visibility. The content quality already exists — the format migration is what unlocks citation potential. Docusaurus, properly configured GitBook, and custom documentation sites all provide the necessary semantic structure.

5. Audit robots.txt for Unintended AI Crawler Blocks

While addressing format issues, verify you are not blocking major AI crawlers in your robots.txt. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended are the primary crawlers to check. Many sites implemented broad bot-blocking rules during the initial AI crawler wave and are now inadvertently preventing citation by the most important AI search platforms.

A Note on LLMs.txt

The emerging LLMs.txt standard — a proposed file analogous to robots.txt, designed to guide AI crawler behaviour — is worth monitoring but does not require immediate action. Adoption is extremely low (under 0.015% of top sites as of early 2025), and genuine debate continues about whether the standard benefits of AI systems will achieve widespread adoption or create manipulation risks. The practical priority for most teams remains ensuring existing content is in semantic HTML that current AI systems can reliably process. That foundational fix unlocks everything else.

Format Is Now a First-Class Visibility Signal

Search visibility has long depended on keyword strategy, link authority, and content quality. The OtterlyAI finding adds a fourth dimension that has always been implicit but is now explicitly critical: format. Content that is not in semantic HTML does not participate in AI search — regardless of quality, authority, or topical relevance.

Unlike building topical authority or earning editorial links — processes that take months — converting Markdown to properly structured HTML is a technical implementation task that can be completed in days, even for large content libraries. It is one of the fastest ROI investments available in AI search optimization right now.

Get technical SEO and AI search intelligence that keeps you ahead at ejournalz.com.

Leave a Reply

Your email address will not be published. Required fields are marked *