– Bharath DeepTech Forum

The recent launch of SearchGPT has brought to light a recurring issue in AI search engines – hallucinations. Even popular search engine Perplexity struggles with this problem. However, a recent paper published by a team of researchers from China titled ‘HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems’ offers a potential solution. The paper discusses the use of HTML format in Retrieval-augmented generation (RAG) systems, which aim to enhance the performance of LLMs by providing them with external knowledge. The authors argue that using HTML instead of plain text can better preserve the structural and semantic information inherent in web pages. To effectively preserve HTML structure, the authors have proposed techniques like the two-stage pruning algorithm, which helps LLMs to shorten the input context without losing key information. This approach addresses the challenge of HTML documents being too lengthy for LLM context windows. The first step involves cleaning unnecessary HTML elements, followed by a block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while retaining important information. The proposed approach also discusses the potential use of Markdown, a lightweight markup language, to format plain text with special characters. This technique enables more efficient and precise knowledge integration without sacrificing semantic depth or contextual richness. The incorporation of HTML brings in complexities, but the proposed approach strategically removes unnecessary HTML blocks and selectively retains only the most relevant document components. This paper offers a promising breakthrough in tackling hallucination problems in AI search engines and could potentially enhance the knowledge capabilities of LLMs.

No Result