Using HTMLSemanticPreservingSplitter for Structured HTML Splitting
Descriptionβ
The HTMLSemanticPreservingSplitter
is a powerful tool designed to split HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other complex HTML components. This ensures that such elements are not split across chunks, which is crucial for maintaining context and readability.
This splitter is designed at its heart, to create contextually relevant chunks. General Recursive splitting with HTMLHeaderSplitter
can cause tables, lists and other structered elements to be split in the middle, losing signifcant context and create bad chunks.
IMPORTANT: max_chunk_size
is not a definite maximum size of a chunk, the calculation of max size, occurs when the preserved content is not apart of the chunk, to ensure it is not split. When we add the preserved data back in to the chunk, there is a chance the chunk size will exceed max_chunk_size
. This is crucial to ensure we maintain the structure of the original document
Usage Example: Preserving Tables and Listsβ
In this example, we will demonstrate how the HTMLSemanticPreservingSplitter
can preserve a table and a large list within an HTML document. The chunk size will be set to 50 characters to illustrate how the splitter ensures that these elements are not split, even when they exceed the maximum defined chunk size.
from langchain_core.documents import Document
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section 1</h1>
<p>This section contains an important table and list that should not be split across chunks.</p>
<table>
<tr>
<th>Item</th>
<th>Quantity</th>
<th>Price</th>
</tr>
<tr>
<td>Apples</td>
<td>10</td>
<td>$1.00</td>
</tr>
<tr>
<td>Oranges</td>
<td>5</td>
<td>$0.50</td>
</tr>
<tr>
<td>Bananas</td>
<td>50</td>
<td>$1.50</td>
</tr>
</table>
<h2>Subsection 1.1</h2>
<p>Additional text in subsection 1.1 that is separated from the table and list.</p>
<p>Here is a detailed list:</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
elements_to_preserve=["table", "ul"],
)
documents = splitter.split_text(html_string)
print(documents)
"""
[
Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'),
Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'),
Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")
]
"""
Explanationβ
In this example, the HTMLSemanticPreservingSplitter
ensures that the entire table and the unordered list (<ul>
) are preserved within their respective chunks. Even though the chunk size is set to 50 characters, the splitter recognizes that these elements should not be split and keeps them intact.
This is particularly important when dealing with data tables or lists, where splitting the content could lead to loss of context or confusion. The resulting Document
objects retain the full structure of these elements, ensuring that the contextual relevance of the information is maintained.
Example: Using a Custom Handlerβ
The HTMLSemanticPreservingSplitter
allows you to define custom handlers for specific HTML elements. Some platforms, have custom HTML tags that are not natively parsed by BeautifulSoup
, when this occurs, you can utilize custom handlers to add the formatting logic easily.
This can be particularly useful for elements that require special processing, such as <iframe>
tags. In this example, we'll create a custom handler for iframe
tags that converts them into Markdown-like links.
def custom_iframe_extractor(iframe_tag):
iframe_src = iframe_tag.get("src", "")
return f"[iframe:{iframe_src}]({iframe_src})"
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["table", "ul", "ol"],
custom_handlers={"iframe": custom_iframe_extractor},
)
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Iframe</h1>
<iframe src="https://example.com/embed"></iframe>
<p>Some text after the iframe.</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
documents = splitter.split_text(html_string)
print(documents)
"""
[
Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'),
Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")
]
"""
Explanationβ
In this example, we defined a custom handler for iframe
tags that converts them into Markdown-like links. When the splitter processes the HTML content, it uses this custom handler to transform the iframe
tags while preserving other elements like tables and lists. The resulting Document
objects show how the iframe is handled according to the custom logic you provided.
Important: When presvering items such as links, you should be mindful not to include .
in your seperators, or leave seperators blank. RecursiveCharecterTextSplitter
splits on full stop, which will cut links in half. Ensure you provide a seperator list with .
instead.
Conclusionβ
The HTMLSemanticPreservingSplitter
is essential for splitting HTML content that includes structured elements like tables and lists, especially when it's critical to preserve these elements intact. Additionally, its ability to define custom handlers for specific HTML tags makes it a versatile tool for processing complex HTML documents. By using this splitter, you can ensure that your documents maintain their context and readability, even when split into smaller chunks.