The Journey of Building a Custom HTML Parser
A detailed account of developing the ZMarkupParser HTML to NSAttributedString rendering engine
ℹ️ℹ️ℹ️ The following content is translated by OpenAI.
Click here to view the original Chinese version. | 點此查看本文中文版
The Journey of Building a Custom HTML Parser
A detailed account of developing the ZMarkupParser HTML to NSAttributedString rendering engine.
This article covers the tokenization of HTML strings, normalization processes, the generation of an Abstract Syntax Tree, the application of the Visitor and Builder patterns, and some additional thoughts.
Continuation
Last year, I published an article titled “TL;DR on Implementing iOS NSAttributedString HTML Rendering,” which briefly introduced how to use XMLParser to parse HTML and convert it into NSAttributedString.Key. The structure and ideas presented in that article were quite scattered, as it was more of a record of the issues I encountered at the time, and I didn’t spend much time researching the topic.
Convert HTML String to NSAttributedString
Revisiting this topic, we need a way to convert HTML strings provided by the API into NSAttributedString and apply the corresponding styles for display in UITextView/UILabel.
For example, <b>Test<a>Link</a></b>
should be displayed as Test Link.
- Note 1: It is not recommended to use HTML as a rendering medium for communication between the app and data, as the HTML specification is too flexible. The app cannot support all HTML styles, and there is no official HTML conversion rendering engine.
- Note 2: Starting from iOS 14, you can use the official native AttributedString to parse Markdown or import the apple/swift-markdown Swift Package to parse Markdown.
- Note 3: Due to the large scale of our company’s project and the long-standing use of HTML as a medium, we cannot fully switch to Markdown or other markup languages at this time.
- Note 4: The HTML here is not intended to display an entire HTML webpage; it is merely used as a styled Markdown rendering string. (To render a full page of complex HTML, including images and tables, you still need to use WebView to load HTML.)
I strongly recommend using Markdown as the string rendering medium. If your project faces similar challenges and you have to use HTML without an elegant tool for converting to NSAttributedString, then please proceed with caution.
Friends who remember the previous article can jump directly to the ZhgChgLi / ZMarkupParser section.
NSAttributedString.DocumentType.html
Most methods found online for converting HTML to NSAttributedString involve directly using the options provided by NSAttributedString to render HTML. Here’s an example:
1
2
3
4
5
6
7
let htmlString = "<b>Test<a>Link</a></b>"
let data = htmlString.data(using: String.Encoding.utf8)!
let attributedOptions: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
let attributedString = try! NSAttributedString(data: data, options: attributedOptions, documentAttributes: nil)
Problems with this approach:
- Poor performance: This method renders styles through the WebView Core and then switches back to the Main Thread for UI display; rendering over 300 characters takes about 0.03 seconds.
- Text loss: For example, marketing copy might use
<Congratulation!>
, which would be treated as an HTML tag and removed. - Lack of customization: For instance, you cannot specify the degree of boldness for bold HTML in NSAttributedString.
- Random crashes starting from iOS 12 with no official solution.
- A significant number of crashes appeared in iOS 15, with tests showing that under low battery conditions, it crashes 100% of the time (fixed in iOS ≥ 15.2).
- Long strings cause crashes; testing shows that inputting strings longer than 54,600 characters results in a 100% crash (EXC_BAD_ACCESS).
The most painful issue for us remains the crashing problem. From the release of iOS 15 until the fix in 15.2, our app was consistently plagued by this issue. Data shows that from March 11, 2022, to June 8, 2022, it caused over 2.4K crashes, affecting more than 1.4K users.
This crashing issue has existed since iOS 12, and iOS 15 merely encountered a larger pitfall. However, I suspect that the fix in iOS 15.2 is just a patch; the official team cannot eradicate the root cause.
The next issue is performance. As a markup language for string styles, it is heavily used in UILabel/UITextView throughout the app. As mentioned earlier, rendering a single label takes 0.03 seconds, and multiplying that across a list of UILabels/UITextViews can lead to noticeable lag in user interactions.
XMLParser
The second solution is to use XMLParser to parse the HTML into corresponding NSAttributedString keys and apply styles, as introduced in the previous article.
You can refer to the implementation of SwiftRichString and the content of the previous article for more details.
The previous article only explored the possibility of using XMLParser to parse HTML and perform corresponding conversions, completing an experimental implementation without designing it as a well-structured and extensible “tool.”
Problems with this approach:
- Zero fault tolerance: HTML like
<br>
,<Congratulation!>
, and<b>Bold<i>Bold+Italic</b>Italic</i>
can lead to errors in XMLParser, throwing an error and displaying a blank result. - When using XMLParser, the HTML string must strictly adhere to XML rules, unlike browsers or NSAttributedString.DocumentType.html, which can display with some fault tolerance.
Standing on the Shoulders of Giants
Neither of the above solutions perfectly and elegantly solves the HTML problem, so I began searching for existing solutions.
- johnxnguyen / Down only supports converting Markdown to Any (XML/NSAttributedString…) but does not support converting HTML.
- malcommac / SwiftRichString uses XMLParser under the hood, and tests show that it has the same zero fault tolerance issues as mentioned earlier.
- scinfu / SwiftSoup only supports HTML parsing (Selector) and does not support conversion to NSAttributedString as noted in this issue.
After searching extensively, I found that the results were similar to the projects mentioned above. There were no giants’ shoulders to stand on.
ZhgChgLi/ZMarkupParser
With no giants’ shoulders to rely on, I had to become the giant myself and developed a tool to convert HTML strings to NSAttributedString.
Developed entirely in Swift, it uses Regex to parse HTML tags and undergoes tokenization, analyzing and correcting tag validity (fixing missing end tags and misaligned tags), converting to an abstract syntax tree, and finally using the Visitor Pattern to map HTML tags to abstract styles, resulting in the final NSAttributedString. This tool does not rely on any parser libraries.
Features
- Supports HTML rendering (to NSAttributedString), stripping HTML tags, and selector functionality.
- Higher performance than
NSAttributedString.DocumentType.html
. - Automatically analyzes and corrects tag validity (fixing missing end tags and misaligned tags).
- Supports dynamic style settings from
style="color:red..."
. - Allows customization of style specifications, such as how bold a bold tag should be.
- Supports flexible extensibility for tags or custom tags and attributes.
For detailed introduction and installation instructions, please refer to this article: “ZMarkupParser HTML String to NSAttributedString Tool”.
You can directly git clone the project and open the ZMarkupParser.xcworkspace
project, selecting the ZMarkupParser-Demo
target to build and run it for experimentation.
Technical Details
Next, I want to share the technical details regarding the development of this tool.
The above image illustrates the general operation process. Subsequent articles will introduce each step and include code snippets.
⚠️️️️️️ This article will simplify demo code, reduce abstraction, and focus on explaining the operational principles. For the final results, please refer to the project source code.
Code Implementation — Tokenization
a.k.a. parser, parsing
When it comes to HTML rendering, the most crucial aspect is the parsing stage. Previously, I used XMLParser to treat HTML as XML for parsing; however, it couldn’t overcome the fact that everyday HTML usage is not 100% XML-compliant, leading to parser errors and an inability to dynamically correct them.
After ruling out the use of XMLParser, the only option left in Swift was to use Regex for matching and parsing.
Initially, I thought I could directly use regex to extract “paired” HTML tags and recursively search for HTML tags layer by layer until completion. However, this approach does not address the nesting of HTML tags or the need for fault tolerance for misaligned tags. Therefore, we changed our strategy to extract “single” HTML tags, recording whether they are start tags, close tags, or self-closing tags, along with other string combinations to form the parsing result array.
The Tokenization Structure is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
enum HTMLParsedResult {
case start(StartItem) // <a>
case close(CloseItem) // </a>
case selfClosing(SelfClosingItem) // <br/>
case rawString(NSAttributedString)
}
extension HTMLParsedResult {
class SelfClosingItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
}
class StartItem {
let tagName: String
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
// Start Tag may be an abnormal HTML Tag or normal text e.g. <Congratulation!>. After normalization, if it is found to be an isolated Start Tag, it will be marked as True.
var isIsolated: Bool = false
init(tagName: String, tagAttributedString: NSAttributedString, attributes: [String : String]?) {
self.tagName = tagName
self.tagAttributedString = tagAttributedString
self.attributes = attributes
}
// For subsequent normalization automatic correction
func convertToCloseParsedItem() -> CloseItem {
return CloseItem(tagName: self.tagName)
}
// For subsequent normalization automatic correction
func convertToSelfClosingParsedItem() -> SelfClosingItem {
return SelfClosingItem(tagName: self.tagName, tagAttributedString: self.tagAttributedString, attributes: self.attributes)
}
}
class CloseItem {
let tagName: String
init(tagName: String) {
self.tagName = tagName
}
}
}
The regex used is as follows:
1
<(?:(?<closeTag>\/)?(?<tagName>[A-Za-z0-9]+)(?<tagAttributes>(?:\s*(\w+)\s*=\s*(["|']).*?\5)*)\s*(?<selfClosingTag>\/)?>)
- closeTag: Matches <
/
a> - tagName: Matches <
a
> or , </a
> - tagAttributes: Matches <a
href=”https://zhgchg.li” style=”color:red”
> - selfClosingTag: Matches <br
/
>
*This regex can still be optimized further; I will address that later.
The latter part of the article provides additional information about regex for those interested.
Putting it all together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
var tokenizationResult: [HTMLParsedResult] = []
let expression = try? NSRegularExpression(pattern: pattern, options: expressionOptions)
let attributedString = NSAttributedString(string: "<a>Li<b>nk</a>Bold</b>")
let totalLength = attributedString.string.utf16.count // utf-16 support emoji
var lastMatch: NSTextCheckingResult?
// Start Tags Stack, First In Last Out (FILO)
// Check if the HTML string requires subsequent normalization to correct misalignment or add Self-Closing Tags
var stackStartItems: [HTMLParsedResult.StartItem] = []
var needForamatter: Bool = false
expression.enumerateMatches(in: attributedString.string, range: NSMakeRange(0, totalLength)) { match, _, _ in
if let match = match {
// Check the string between tags or the string before the first tag
// e.g. Test<a>Link</a>zzz<b>bold</b>Test2 - > Test,zzz
let lastMatchEnd = lastMatch?.range.upperBound ?? 0
let currentMatchStart = match.range.lowerBound
if currentMatchStart > lastMatchEnd {
let rawStringBetweenTag = attributedString.attributedSubstring(from: NSMakeRange(lastMatchEnd, (currentMatchStart - lastMatchEnd)))
tokenizationResult.append(.rawString(rawStringBetweenTag))
}
// <a href="https://zhgchg.li">, </a>
let matchAttributedString = attributedString.attributedSubstring(from: match.range)
// a, a
let matchTag = attributedString.attributedSubstring(from: match.range(withName: "tagName"))?.string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
// false, true
let matchIsEndTag = matchResult.attributedString(from: match.range(withName: "closeTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
// href="https://zhgchg.li", nil
// Use regex to further extract HTML attributes, to [String: String], please refer to Source Code
let matchTagAttributes = parseAttributes(matchResult.attributedString(from: match.range(withName: "tagAttributes")))
// false, false
let matchIsSelfClosingTag = matchResult.attributedString(from: match.range(withName: "selfClosingTag"))?.string.trimmingCharacters(in: .whitespacesAndNewlines) == "/"
if let matchAttributedString = matchAttributedString,
let matchTag = matchTag {
if matchIsSelfClosingTag {
// e.g. <br/>
tokenizationResult.append(.selfClosing(.init(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)))
} else {
// e.g. <a> or </a>
if matchIsEndTag {
// e.g. </a>
// Retrieve the position of the same TagName from the Stack, starting from the last
if let index = stackStartItems.lastIndex(where: { $0.tagName == matchTag }) {
// If it's not the last one, it indicates there is a misalignment or a missing closing tag
if index != stackStartItems.count - 1 {
needForamatter = true
}
tokenizationResult.append(.close(.init(tagName: matchTag)))
stackStartItems.remove(at: index)
} else {
// Extra close tag e.g </a>
// Does not affect subsequent processing, simply ignore
}
} else {
// e.g. <a>
let startItem: HTMLParsedResult.StartItem = HTMLParsedResult.StartItem(tagName: matchTag, tagAttributedString: matchAttributedString, attributes: matchTagAttributes)
tokenizationResult.append(.start(startItem))
// Push to Stack
stackStartItems.append(startItem)
}
}
}
lastMatch = match
}
}
// Check the RawString at the end
// e.g. Test<a>Link</a>Test2 - > Test2
if let lastMatch = lastMatch {
let currentIndex = lastMatch.range.upperBound
if totalLength > currentIndex {
// There are remaining strings
let resetString = attributedString.attributedSubstring(from: NSMakeRange(currentIndex, (totalLength - currentIndex)))
tokenizationResult.append(.rawString(resetString))
}
} else {
// lastMatch = nil, indicating no tags were found, all are plain text
let resetString = attributedString.attributedSubstring(from: NSMakeRange(0, totalLength))
tokenizationResult.append(.rawString(resetString))
}
Here’s the translated text in naturalistic English while preserving the original markdown image sources:
Check if the Stack is empty. If it isn’t, it means there are Start Tags without corresponding End Tags, which should be marked as isolated Start Tags.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for stackStartItem in stackStartItems {
stackStartItem.isIsolated = true
needForamatter = true
}
print(tokenizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("a")
// .rawString("Bold")
// .close("b")
// ]
The operation flow is shown in the image above.
In the end, we will obtain a Tokenization result array.
Corresponding to the source code in HTMLStringToParsedResultProcessor.swift implementation.
Normalization
Also known as Formatter, normalization.
After obtaining the preliminary parsing results in the previous step, if we find that normalization is still needed during the parsing, this step is required to automatically correct HTML Tag issues.
There are three types of HTML Tag issues:
- HTML Tag missing a Close Tag: for example,
<br>
- Regular text being treated as an HTML Tag: for example,
<Congratulation!>
- HTML Tag misalignment issues: for example,
<a>Li<b>nk</a>Bold</b>
The correction method is quite simple; we need to traverse the elements of the Tokenization result and attempt to fill in the gaps.
The operation flow is shown in the image above.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
var normalizationResult = tokenizationResult
// Start Tags Stack, First In Last Out (FILO)
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
var itemIndex = 0
while itemIndex < newItems.count {
switch newItems[itemIndex] {
case .start(let item):
if item.isIsolated {
// If it is an isolated Start Tag
if WC3HTMLTagName(rawValue: item.tagName) == nil && (item.attributes?.isEmpty ?? true) {
// If it is not a W3C defined HTML Tag & has no HTML Attributes
// Refer to the WC3HTMLTagName Enum in the Source Code
// Considered as regular text treated as an HTML Tag
// Change to raw string type
normalizationResult[itemIndex] = .rawString(item.tagAttributedString)
} else {
// Otherwise, change to self-closing tag, e.g. <br> -> <br/>
normalizationResult[itemIndex] = .selfClosing(item.convertToSelfClosingParsedItem())
}
itemIndex += 1
} else {
// Normal Start Tag, add to Stack
stackExpectedStartItems.append(item)
itemIndex += 1
}
case .close(let item):
// Encounter Close Tag
// Get the Tags between the Start Stack Tag and this Close Tag
// e.g <a><u><b>[CurrentIndex]</a></u></b> -> gap 0
// e.g <a><u><b>[CurrentIndex]</a></u></b> -> gap b,u
let reversedStackExpectedStartItems = Array(stackExpectedStartItems.reversed())
guard let reversedStackExpectedStartItemsOccurredIndex = reversedStackExpectedStartItems.firstIndex(where: { $0.tagName == item.tagName }) else {
itemIndex += 1
continue
}
let reversedStackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItems.prefix(upTo: reversedStackExpectedStartItemsOccurredIndex))
// Gap 0 means the tag is not misaligned
guard reversedStackExpectedStartItemsOccurred.count != 0 else {
// It's a pair, pop
stackExpectedStartItems.removeLast()
itemIndex += 1
continue
}
// If there are other gaps, automatically insert the missing Tags
// e.g <a><u><b>[CurrentIndex]</a></u></b> ->
// e.g <a><u><b>[CurrentIndex]</b></u></a><b></u></u></b>
let stackExpectedStartItemsOccurred = Array(reversedStackExpectedStartItemsOccurred.reversed())
let afterItems = stackExpectedStartItemsOccurred.map({ HTMLParsedResult.start($0) })
let beforeItems = reversedStackExpectedStartItemsOccurred.map({ HTMLParsedResult.close($0.convertToCloseParsedItem()) })
normalizationResult.insert(contentsOf: afterItems, at: newItems.index(after: itemIndex))
normalizationResult.insert(contentsOf: beforeItems, at: itemIndex)
itemIndex = newItems.index(after: itemIndex) + stackExpectedStartItemsOccurred.count
// Update Start Stack Tags
// e.g. -> b,u
stackExpectedStartItems.removeAll { startItem in
return reversedStackExpectedStartItems.prefix(through: reversedStackExpectedStartItemsOccurredIndex).contains(where: { $0 === startItem })
}
case .selfClosing, .rawString:
itemIndex += 1
}
}
print(normalizationResult)
// [
// .start("a",["href":"https://zhgchg.li"])
// .rawString("Li")
// .start("b",nil)
// .rawString("nk")
// .close("b")
// .close("a")
// .start("b",nil)
// .rawString("Bold")
// .close("b")
// ]
Corresponding to the source code in HTMLParsedResultFormatterProcessor.swift implementation.
Abstract Syntax Tree
Also known as AST, Abstract Tree.
After completing the data preprocessing with Tokenization & Normalization, we will now convert the results into an abstract tree 🌲.
As shown in the image above.
Converting to an abstract tree allows us to facilitate future operations and expansions, such as implementing Selector functionality or performing other transformations, like HTML to Markdown; or if we want to add Markdown to NSAttributedString in the future, we just need to implement Markdown’s Tokenization & Normalization to achieve it.
First, we define a Markup Protocol with Child & Parent properties to record information about leaves and branches:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
protocol Markup: AnyObject {
var parentMarkup: Markup? { get set }
var childMarkups: [Markup] { get set }
func appendChild(markup: Markup)
func prependChild(markup: Markup)
func accept<V: MarkupVisitor>(_ visitor: V) -> V.Result
}
extension Markup {
func appendChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.append(markup)
}
func prependChild(markup: Markup) {
markup.parentMarkup = self
childMarkups.insert(markup, at: 0)
}
}
Additionally, we use the Visitor Pattern to define each style property as an object Element, and through different Visit strategies, we can obtain individual application results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
protocol MarkupVisitor {
associatedtype Result
func visit(markup: Markup) -> Result
func visit(_ markup: RootMarkup) -> Result
func visit(_ markup: RawStringMarkup) -> Result
func visit(_ markup: BoldMarkup) -> Result
func visit(_ markup: LinkMarkup) -> Result
//...
}
extension MarkupVisitor {
func visit(markup: Markup) -> Result {
return markup.accept(self)
}
}
Basic Markup Nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Root Node
final class RootMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Leaf Node
final class RawStringMarkup: Markup {
let attributedString: NSAttributedString
init(attributedString: NSAttributedString) {
self.attributedString = attributedString
}
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Defining Markup Style Nodes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Branch Node:
// Link Style
final class LinkMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
// Bold Style
final class BoldMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
Corresponding to the source code in Markup implementation.
Before converting to an abstract tree, we still need to…
MarkupComponent
Because our tree structure does not depend on any data structure (for example, a node/LinkMarkup should have URL information to proceed with rendering). To address this, we define a container to store tree nodes and their related data information:
1
2
3
4
5
6
7
8
9
10
11
12
13
protocol MarkupComponent {
associatedtype T
var markup: Markup { get }
var value: T { get }
init(markup: Markup, value: T)
}
extension Sequence where Iterator.Element: MarkupComponent {
func value(markup: Markup) -> Element.T? {
return self.first(where:{ $0.markup === markup })?.value as? Element.T
}
}
Corresponding to the source code in MarkupComponent implementation.
We could also declare Markup as Hashable
and directly use a Dictionary to store values [Markup: Any]
, but this would prevent Markup from being used as a regular type, requiring the addition of any Markup
.
HTMLTag & HTMLTagName & HTMLTagNameVisitor
We also abstract the HTML Tag Name part, allowing users to decide which Tags need to be processed, making future expansions easier. For example, the <strong>
Tag Name can also correspond to BoldMarkup
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public protocol HTMLTagName {
var string: String { get }
func accept<V: HTMLTagNameVisitor>(_ visitor: V) -> V.Result
}
public struct A_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.a.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
public struct B_HTMLTagName: HTMLTagName {
public let string: String = WC3HTMLTagName.b.rawValue
public init() {
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
Corresponding to the source code in HTMLTagNameVisitor implementation.
Additionally, refer to the W3C wiki which lists the HTML tag name enum: WC3HTMLTagName.swift.
HTMLTag is simply a container object, as we want to allow external specification of the styles corresponding to HTML Tags, so we declare a container to hold them together:
1
2
3
4
5
6
7
8
9
struct HTMLTag {
let tagName: HTMLTagName
let customStyle: MarkupStyle? // To be explained in the Render section later
init(tagName: HTMLTagName, customStyle: MarkupStyle? = nil) {
self.tagName = tagName
self.customStyle = customStyle
}
}
Corresponding to the source code in HTMLTag implementation.
HTMLTagNameToHTMLMarkupVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct HTMLTagNameToMarkupVisitor: HTMLTagNameVisitor {
typealias Result = Markup
let attributes: [String: String]?
func visit(_ tagName: A_HTMLTagName) -> Result {
return LinkMarkup()
}
func visit(_ tagName: B_HTMLTagName) -> Result {
return BoldMarkup()
}
//...
}
Corresponding to the source code in HTMLTagNameToHTMLMarkupVisitor implementation.
Converting to an Abstract Tree with HTML Data
We need to convert the normalized HTML data results into an abstract tree. First, we declare a data structure for MarkupComponent that can hold HTML data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct HTMLElementMarkupComponent: MarkupComponent {
struct HTMLElement {
let tag: HTMLTag
let tagAttributedString: NSAttributedString
let attributes: [String: String]?
}
typealias T = HTMLElement
let markup: Markup
let value: HTMLElement
init(markup: Markup, value: HTMLElement) {
self.markup = markup
self.value = value
}
}
Converting to a Markup Abstract Tree:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
var htmlElementComponents: [HTMLElementMarkupComponent] = []
let rootMarkup = RootMarkup()
var currentMarkup: Markup = rootMarkup
let htmlTags: [String: HTMLTag]
init(htmlTags: [HTMLTag]) {
self.htmlTags = Dictionary(uniqueKeysWithValues: htmlTags.map{ ($0.tagName.string, $0) })
}
// Start Tags Stack, ensuring correct pop tag
// Normalization has already been done, so there shouldn't be any errors, just ensuring
var stackExpectedStartItems: [HTMLParsedResult.StartItem] = []
for thisItem in from {
switch thisItem {
case .start(let item):
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
// Use Visitor to ask for the corresponding Markup
let markup = visitor.visit(tagName: htmlTag.tagName)
// Add itself as the current branch's leaf node
// It becomes the current branch node
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
currentMarkup = markup
stackExpectedStartItems.append(item)
case .selfClosing(let item):
// Directly add as the current branch's leaf node
let visitor = HTMLTagNameToMarkupVisitor(attributes: item.attributes)
let htmlTag = self.htmlTags[item.tagName] ?? HTMLTag(tagName: ExtendTagName(item.tagName))
let markup = visitor.visit(tagName: htmlTag.tagName)
htmlElementComponents.append(.init(markup: markup, value: .init(tag: htmlTag, tagAttributedString: item.tagAttributedString, attributes: item.attributes)))
currentMarkup.appendChild(markup: markup)
case .close(let item):
if let lastTagName = stackExpectedStartItems.popLast()?.tagName,
lastTagName == item.tagName {
// Encounter Close Tag, go back to the previous level
currentMarkup = currentMarkup.parentMarkup ?? currentMarkup
}
case .rawString(let attributedString):
// Directly add as the current branch's leaf node
currentMarkup.appendChild(markup: RawStringMarkup(attributedString: attributedString))
}
}
// print(htmlElementComponents)
// [(markup: LinkMarkup, (tag: a, attributes: ["href":"zhgchg.li"]...)]
The operation result is shown in the image above.
Corresponding to the source code in HTMLParsedResultToHTMLElementWithRootMarkupProcessor.swift implementation.
At this point, we have actually completed the Selector functionality 🎉
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class HTMLSelector: CustomStringConvertible {
let markup: Markup
let components: [HTMLElementMarkupComponent]
init(markup: Markup, components: [HTMLElementMarkupComponent]) {
self.markup = markup
self.components = components
}
public func filter(_ htmlTagName: String) -> [HTMLSelector] {
let result = markup.childMarkups.filter({ components.value(markup: $0)?.tag.tagName.isEqualTo(htmlTagName) ?? false })
return result.map({ .init(markup: $0, components: components) })
}
//...
}
We can filter leaf node objects layer by layer.
Corresponding to the source code in HTMLSelector implementation.
Parser — HTML to MarkupStyle (Abstract of NSAttributedString.Key)
Next, we need to complete the conversion of HTML to MarkupStyle (NSAttributedString.Key).
NSAttributedString uses NSAttributedString.Key Attributes to set text styles. We abstract all fields of NSAttributedString.Key to MarkupStyle, MarkupStyleColor, MarkupStyleFont, and MarkupStyleParagraphStyle.
Purpose:
- The original Attributes data structure is
[NSAttributedString.Key: Any?]
. If we expose it directly, it becomes difficult to control the values users input. If they input incorrectly, it could lead to crashes, such as.font: 123
. - Styles need to be inheritable, for example,
<a><b>test</b></a>
, the style of the string “test” inherits from the link’s bold (bold + link); if we expose the Dictionary directly, it becomes difficult to manage inheritance rules. - Encapsulate iOS/macOS (UIKit/AppKit) related objects.
This translation maintains the original structure and meaning while ensuring the text flows naturally in English.
MarkupStyle Struct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
public struct MarkupStyle {
public var font: MarkupStyleFont
public var paragraphStyle: MarkupStyleParagraphStyle
public var foregroundColor: MarkupStyleColor? = nil
public var backgroundColor: MarkupStyleColor? = nil
public var ligature: NSNumber? = nil
public var kern: NSNumber? = nil
public var tracking: NSNumber? = nil
public var strikethroughStyle: NSUnderlineStyle? = nil
public var underlineStyle: NSUnderlineStyle? = nil
public var strokeColor: MarkupStyleColor? = nil
public var strokeWidth: NSNumber? = nil
public var shadow: NSShadow? = nil
public var textEffect: String? = nil
public var attachment: NSTextAttachment? = nil
public var link: URL? = nil
public var baselineOffset: NSNumber? = nil
public var underlineColor: MarkupStyleColor? = nil
public var strikethroughColor: MarkupStyleColor? = nil
public var obliqueness: NSNumber? = nil
public var expansion: NSNumber? = nil
public var writingDirection: NSNumber? = nil
public var verticalGlyphForm: NSNumber? = nil
//...
// Inherits from...
// Default: If fields are nil, fill in from the provided object
mutating func fillIfNil(from: MarkupStyle?) {
guard let from = from else { return }
var currentFont = self.font
currentFont.fillIfNil(from: from.font)
self.font = currentFont
var currentParagraphStyle = self.paragraphStyle
currentParagraphStyle.fillIfNil(from: from.paragraphStyle)
self.paragraphStyle = currentParagraphStyle
//..
}
// Convert MarkupStyle to NSAttributedString.Key: Any
func render() -> [NSAttributedString.Key: Any] {
var data: [NSAttributedString.Key: Any] = [:]
if let font = font.getFont() {
data[.font] = font
}
if let ligature = self.ligature {
data[.ligature] = ligature
}
//...
return data
}
}
public struct MarkupStyleFont: MarkupStyleItem {
public enum FontWeight {
case style(FontWeightStyle)
case rawValue(CGFloat)
}
public enum FontWeightStyle: String {
case ultraLight, light, thin, regular, medium, semibold, bold, heavy, black
// ...
}
public var size: CGFloat?
public var weight: FontWeight?
public var italic: Bool?
//...
}
public struct MarkupStyleParagraphStyle: MarkupStyleItem {
public var lineSpacing: CGFloat? = nil
public var paragraphSpacing: CGFloat? = nil
public var alignment: NSTextAlignment? = nil
public var headIndent: CGFloat? = nil
public var tailIndent: CGFloat? = nil
public var firstLineHeadIndent: CGFloat? = nil
public var minimumLineHeight: CGFloat? = nil
public var maximumLineHeight: CGFloat? = nil
public var lineBreakMode: NSLineBreakMode? = nil
public var baseWritingDirection: NSWritingDirection? = nil
public var lineHeightMultiple: CGFloat? = nil
public var paragraphSpacingBefore: CGFloat? = nil
public var hyphenationFactor: Float? = nil
public var usesDefaultHyphenation: Bool? = nil
public var tabStops: [NSTextTab]? = nil
public var defaultTabInterval: CGFloat? = nil
public var textLists: [NSTextList]? = nil
public var allowsDefaultTighteningForTruncation: Bool? = nil
public var lineBreakStrategy: NSParagraphStyle.LineBreakStrategy? = nil
//...
}
public struct MarkupStyleColor {
let red: Int
let green: Int
let blue: Int
let alpha: CGFloat
//...
}
Corresponding to the original code’s MarkupStyle implementation
Additionally, refer to the W3C wiki, which lists corresponding color names and their RGB values: MarkupStyleColorName.swift
HTMLTagStyleAttribute & HTMLTagStyleAttributeVisitor
I would like to mention these two objects because HTML Tags can be styled using CSS. We apply the same abstraction from HTMLTagName to HTML Style Attributes.
For example, HTML might provide: <a style="color:red;font-size:14px">RedLink</a>
, indicating that this link should be styled in red with a font size of 14px.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public protocol HTMLTagStyleAttribute {
var styleName: String { get }
func accept<V: HTMLTagStyleAttributeVisitor>(_ visitor: V) -> V.Result
}
public protocol HTMLTagStyleAttributeVisitor {
associatedtype Result
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result
//...
}
public extension HTMLTagStyleAttributeVisitor {
func visit(styleAttribute: HTMLTagStyleAttribute) -> Result {
return styleAttribute.accept(self)
}
}
Corresponding to the original code’s HTMLTagStyleAttribute implementation
HTMLTagStyleAttributeToMarkupStyleVisitor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct HTMLTagStyleAttributeToMarkupStyleVisitor: HTMLTagStyleAttributeVisitor {
typealias Result = MarkupStyle?
let value: String
func visit(_ styleAttribute: ColorHTMLTagStyleAttribute) -> Result {
// Use regex to extract Color Hex or map from HTML Pre-defined Color Name; see Source Code
guard let color = MarkupStyleColor(string: value) else { return nil }
return MarkupStyle(foregroundColor: color)
}
func visit(_ styleAttribute: FontSizeHTMLTagStyleAttribute) -> Result {
// Use regex to extract 10px -> 10; see Source Code
guard let size = self.convert(fromPX: value) else { return nil }
return MarkupStyle(font: MarkupStyleFont(size: CGFloat(size)))
}
// ...
}
Corresponding to the original code’s HTMLTagAttributeToMarkupStyleVisitor.swift implementation
The init
value corresponds to the attribute’s value, which is converted to the corresponding MarkupStyle field based on the visit type.
HTMLElementMarkupComponentMarkupStyleVisitor
After introducing the MarkupStyle object, we will convert the results from Normalization’s HTMLElementComponents into MarkupStyle.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// MarkupStyle Strategy
public enum MarkupStylePolicy {
case respectMarkupStyleFromCode // Prioritize styles from Code, filling in from HTML Style Attributes
case respectMarkupStyleFromHTMLStyleAttribute // Prioritize styles from HTML Style Attributes, filling in from Code
}
struct HTMLElementMarkupComponentMarkupStyleVisitor: MarkupVisitor {
typealias Result = MarkupStyle?
let policy: MarkupStylePolicy
let components: [HTMLElementMarkupComponent]
let styleAttributes: [HTMLTagStyleAttribute]
func visit(_ markup: BoldMarkup) -> Result {
// .bold is just a default style defined in MarkupStyle; see Source Code
return defaultVisit(components.value(markup: markup), defaultStyle: .bold)
}
func visit(_ markup: LinkMarkup) -> Result {
// .link is just a default style defined in MarkupStyle; see Source Code
var markupStyle = defaultVisit(components.value(markup: markup), defaultStyle: .link) ?? .link
// Retrieve the corresponding HtmlElement from HtmlElementComponents for LinkMarkup
// Look for the href parameter in the HtmlElement's attributes (the way HTML carries URL Strings)
if let href = components.value(markup: markup)?.attributes?["href"] as? String,
let url = URL(string: href) {
markupStyle.link = url
}
return markupStyle
}
// ...
}
extension HTMLElementMarkupComponentMarkupStyleVisitor {
// Retrieve the specified custom MarkupStyle from the HTMLTag container
private func customStyle(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?) -> MarkupStyle? {
guard let customStyle = htmlElement?.tag.customStyle else {
return nil
}
return customStyle
}
// Default action
func defaultVisit(_ htmlElement: HTMLElementMarkupComponent.HTMLElement?, defaultStyle: MarkupStyle? = nil) -> Result {
var markupStyle: MarkupStyle? = customStyle(htmlElement) ?? defaultStyle
// Retrieve the corresponding HtmlElement for LinkMarkup from HtmlElementComponents
// Check if the HtmlElement's attributes contain a `Style` Attribute
guard let styleString = htmlElement?.attributes?["style"],
styleAttributes.count > 0 else {
// None
return markupStyle
}
// There are Style Attributes
// Split the Style Value string into an array
// font-size:14px;color:red -> ["font-size":"14px","color":"red"]
let styles = styleString.split(separator: ";").filter { $0.trimmingCharacters(in: .whitespacesAndNewlines) != "" }.map { $0.split(separator: ":") }
for style in styles {
guard style.count == 2 else {
continue
}
// e.g. font-size
let key = style[0].trimmingCharacters(in: .whitespacesAndNewlines)
// e.g. 14px
let value = style[1].trimmingCharacters(in: .whitespacesAndNewlines)
if let styleAttribute = styleAttributes.first(where: { $0.isEqualTo(styleName: key) }) {
// Use the previously mentioned HTMLTagStyleAttributeToMarkupStyleVisitor to convert back to MarkupStyle
let visitor = HTMLTagStyleAttributeToMarkupStyleVisitor(value: value)
if var thisMarkupStyle = visitor.visit(styleAttribute: styleAttribute) {
// If the Style Attribute has a value...
// Merge the previous MarkupStyle result
thisMarkupStyle.fillIfNil(from: markupStyle)
markupStyle = thisMarkupStyle
}
}
}
// If there is a default Style
if var defaultStyle = defaultStyle {
switch policy {
case .respectMarkupStyleFromHTMLStyleAttribute:
// Style Attribute MarkupStyle takes precedence, then
// Merge the defaultStyle result
markupStyle?.fillIfNil(from: defaultStyle)
case .respectMarkupStyleFromCode:
// defaultStyle takes precedence, then
// Merge the Style Attribute MarkupStyle result
defaultStyle.fillIfNil(from: markupStyle)
markupStyle = defaultStyle
}
}
return markupStyle
}
}
Corresponding to the original code’s HTMLTagAttributeToMarkupStyleVisitor.swift implementation
We will define some default styles in MarkupStyle, which will be used when certain Markup does not have externally specified styles.
There are two style inheritance strategies:
- respectMarkupStyleFromCode: Use the default style as the primary; then see what styles can be supplemented from Style Attributes, ignoring existing values.
- respectMarkupStyleFromHTMLStyleAttribute: Use the Style Attributes as the primary; then see what styles can be supplemented from the default style, ignoring existing values.
HTMLElementWithMarkupToMarkupStyleProcessor
This converts the normalization results into an AST & MarkupStyleComponent.
We declare a new MarkupComponent to store the corresponding MarkupStyle:
1
2
3
4
5
6
7
8
9
10
struct MarkupStyleComponent: MarkupComponent {
typealias T = MarkupStyle
let markup: Markup
let value: MarkupStyle
init(markup: Markup, value: MarkupStyle) {
self.markup = markup
self.value = value
}
}
A simple traversal of the Markup Tree & HTMLElementMarkupComponent structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let styleAttributes: [HTMLTagStyleAttribute]
let policy: MarkupStylePolicy
func process(from: (Markup, [HTMLElementMarkupComponent])) -> [MarkupStyleComponent] {
var components: [MarkupStyleComponent] = []
let visitor = HTMLElementMarkupComponentMarkupStyleVisitor(policy: policy, components: from.1, styleAttributes: styleAttributes)
walk(markup: from.0, visitor: visitor, components: &components)
return components
}
func walk(markup: Markup, visitor: HTMLElementMarkupComponentMarkupStyleVisitor, components: inout [MarkupStyleComponent]) {
if let markupStyle = visitor.visit(markup: markup) {
components.append(.init(markup: markup, value: markupStyle))
}
for markup in markup.childMarkups {
walk(markup: markup, visitor: visitor, components: &components)
}
}
// print(components)
// [(markup: LinkMarkup, MarkupStyle(link: https://zhgchg.li, color: .blue)]
// [(markup: BoldMarkup, MarkupStyle(font: .init(weight: .bold))]
Corresponding to the original code’s HTMLElementWithMarkupToMarkupStyleProcessor.swift implementation
Render — Convert To NSAttributedString
Now that we have the abstract tree structure of HTML Tags and the corresponding MarkupStyle, the final step is to produce the final NSAttributedString rendering result.
MarkupNSAttributedStringVisitor
Visit markup to NSAttributedString
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
struct MarkupNSAttributedStringVisitor: MarkupVisitor {
typealias Result = NSAttributedString
let components: [MarkupStyleComponent]
// Root/base MarkupStyle, specified externally, e.g., can specify the size of the entire text
let rootStyle: MarkupStyle?
func visit(_ markup: RootMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
func visit(_ markup: RawStringMarkup) -> Result {
// Return Raw String
// Collect all MarkupStyles along the chain
// Apply Style to NSAttributedString
return applyMarkupStyle(markup.attributedString, with: collectMarkupStyle(markup))
}
func visit(_ markup: BoldMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
func visit(_ markup: LinkMarkup) -> Result {
// Look down to the RawString object
return collectAttributedString(markup)
}
// ...
}
private extension MarkupNSAttributedStringVisitor {
// Apply Style to NSAttributedString
func applyMarkupStyle(_ attributedString: NSAttributedString, with markupStyle: MarkupStyle?) -> NSAttributedString {
guard let markupStyle = markupStyle else { return attributedString }
let mutableAttributedString = NSMutableAttributedString(attributedString: attributedString)
mutableAttributedString.addAttributes(markupStyle.render(), range: NSMakeRange(0, mutableAttributedString.string.utf16.count))
return mutableAttributedString
}
func collectAttributedString(_ markup: Markup) -> NSMutableAttributedString {
// Collect from downstream
// Root -> Bold -> String("Bold")
// \
// > String("Test")
// Result: Bold Test
// Recursively visit and combine the final NSAttributedString layer by layer
return markup.childMarkups.compactMap({ visit(markup: $0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
func collectMarkupStyle(_ markup: Markup) -> MarkupStyle? {
// Collect from upstream
// String("Test") -> Bold -> Italic -> Root
// Result: style: Bold+Italic
// Recursively find the parent tag's markup style
// Then inherit styles layer by layer
var currentMarkup: Markup? = markup.parentMarkup
var currentStyle = components.value(markup: markup)
while let thisMarkup = currentMarkup {
guard let thisMarkupStyle = components.value(markup: thisMarkup) else {
currentMarkup = thisMarkup.parentMarkup
continue
}
if var thisCurrentStyle = currentStyle {
thisCurrentStyle.fillIfNil(from: thisMarkupStyle)
currentStyle = thisCurrentStyle
} else {
currentStyle = thisMarkupStyle
}
currentMarkup = thisMarkup.parentMarkup
}
if var currentStyle = currentStyle {
currentStyle.fillIfNil(from: rootStyle)
return currentStyle
} else {
return rootStyle
}
}
}
Corresponding to the original code’s MarkupNSAttributedStringVisitor.swift implementation
Ultimately, we can achieve:
1
2
3
4
5
6
7
8
9
10
11
Li{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d17600> font-family: \".SFUI-Regular\"; font-weight: normal; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}nk{
NSColor = "Blue";
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
NSLink = "https://zhgchg.li";
}Bold{
NSFont = "<UICTFont: 0x145d18710> font-family: \".SFUI-Semibold\"; font-weight: bold; font-style: normal; font-size: 13.00pt";
}
🎉🎉🎉🎉 Completed 🎉🎉🎉🎉
We have now completed the entire process of converting an HTML String to NSAttributedString.
Stripper — Removing HTML Tags
Removing HTML Tags is relatively simple; it only requires:
1
2
3
4
5
6
7
8
9
10
func attributedString(_ markup: Markup) -> NSAttributedString {
if let rawStringMarkup = markup as? RawStringMarkup {
return rawStringMarkup.attributedString
} else {
return markup.childMarkups.compactMap({ attributedString($0) }).reduce(NSMutableAttributedString()) { partialResult, attributedString in
partialResult.append(attributedString)
return partialResult
}
}
}
Corresponding to the original code’s MarkupStripperProcessor.swift implementation
This is similar to Render, but purely returns the content after finding RawStringMarkup.
Extend — Dynamic Expansion
To expand the coverage of all HTML tags and style attributes, a dynamic extension point was created to facilitate the direct dynamic expansion of objects from code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public struct ExtendTagName: HTMLTagName {
public let string: String
public init(_ w3cHTMLTagName: WC3HTMLTagName) {
self.string = w3cHTMLTagName.rawValue
}
public init(_ string: String) {
self.string = string.trimmingCharacters(in: .whitespacesAndNewlines).lowercased()
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagNameVisitor {
return visitor.visit(self)
}
}
// to
final class ExtendMarkup: Markup {
weak var parentMarkup: Markup? = nil
var childMarkups: [Markup] = []
func accept<V>(_ visitor: V) -> V.Result where V : MarkupVisitor {
return visitor.visit(self)
}
}
//----
public struct ExtendHTMLTagStyleAttribute: HTMLTagStyleAttribute {
public let styleName: String
public let render: ((String) -> (MarkupStyle?)) // Dynamic closure to change MarkupStyle
public init(styleName: String, render: @escaping ((String) -> (MarkupStyle?))) {
self.styleName = styleName
self.render = render
}
public func accept<V>(_ visitor: V) -> V.Result where V : HTMLTagStyleAttributeVisitor {
return visitor.visit(self)
}
}
ZHTMLParserBuilder
Finally, we use the Builder Pattern to allow external modules to quickly construct the objects needed for ZMarkupParser
, while also managing access level control.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
public final class ZHTMLParserBuilder {
private(set) var htmlTags: [HTMLTag] = []
private(set) var styleAttributes: [HTMLTagStyleAttribute] = []
private(set) var rootStyle: MarkupStyle?
private(set) var policy: MarkupStylePolicy = .respectMarkupStyleFromCode
public init() {
}
public static func initWithDefault() -> Self {
var builder = Self.init()
for htmlTagName in ZHTMLParserBuilder.htmlTagNames {
builder = builder.add(htmlTagName)
}
for styleAttribute in ZHTMLParserBuilder.styleAttributes {
builder = builder.add(styleAttribute)
}
return builder
}
public func set(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle?) -> Self {
return self.add(htmlTagName, withCustomStyle: markupStyle)
}
public func add(_ htmlTagName: HTMLTagName, withCustomStyle markupStyle: MarkupStyle? = nil) -> Self {
// Only one instance of the same tagName can exist
htmlTags.removeAll { htmlTag in
return htmlTag.tagName.string == htmlTagName.string
}
htmlTags.append(HTMLTag(tagName: htmlTagName, customStyle: markupStyle))
return self
}
public func add(_ styleAttribute: HTMLTagStyleAttribute) -> Self {
styleAttributes.removeAll { thisStyleAttribute in
return thisStyleAttribute.styleName == styleAttribute.styleName
}
styleAttributes.append(styleAttribute)
return self
}
public func set(rootStyle: MarkupStyle) -> Self {
self.rootStyle = rootStyle
return self
}
public func set(policy: MarkupStylePolicy) -> Self {
self.policy = policy
return self
}
public func build() -> ZHTMLParser {
// ZHTMLParser init is only accessible internally; it cannot be directly initialized from outside
// It can only be initialized through ZHTMLParserBuilder
return ZHTMLParser(htmlTags: htmlTags, styleAttributes: styleAttributes, policy: policy, rootStyle: rootStyle)
}
}
Corresponding implementation in the original source code: ZHTMLParserBuilder.swift
The initWithDefault
method will by default add all implemented HTMLTagName/Style Attributes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public extension ZHTMLParserBuilder {
static var htmlTagNames: [HTMLTagName] {
return [
A_HTMLTagName(),
B_HTMLTagName(),
BR_HTMLTagName(),
DIV_HTMLTagName(),
HR_HTMLTagName(),
I_HTMLTagName(),
LI_HTMLTagName(),
OL_HTMLTagName(),
P_HTMLTagName(),
SPAN_HTMLTagName(),
STRONG_HTMLTagName(),
U_HTMLTagName(),
UL_HTMLTagName(),
DEL_HTMLTagName(),
TR_HTMLTagName(),
TD_HTMLTagName(),
TH_HTMLTagName(),
TABLE_HTMLTagName(),
IMG_HTMLTagName(handler: nil),
// ...
]
}
}
public extension ZHTMLParserBuilder {
static var styleAttributes: [HTMLTagStyleAttribute] {
return [
ColorHTMLTagStyleAttribute(),
BackgroundColorHTMLTagStyleAttribute(),
FontSizeHTMLTagStyleAttribute(),
FontWeightHTMLTagStyleAttribute(),
LineHeightHTMLTagStyleAttribute(),
WordSpacingHTMLTagStyleAttribute(),
// ...
]
}
}
The ZHTMLParser
initialization is only accessible internally; it cannot be directly initialized from outside, and can only be initialized through ZHTMLParserBuilder
.
ZHTMLParser
encapsulates Render/Selector/Stripper operations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
public final class ZHTMLParser: ZMarkupParser {
let htmlTags: [HTMLTag]
let styleAttributes: [HTMLTagStyleAttribute]
let rootStyle: MarkupStyle?
internal init(...) {
}
// Retrieve link style attributes
public var linkTextAttributes: [NSAttributedString.Key: Any] {
// ...
}
public func selector(_ string: String) -> HTMLSelector {
// ...
}
public func selector(_ attributedString: NSAttributedString) -> HTMLSelector {
// ...
}
public func render(_ string: String) -> NSAttributedString {
// ...
}
// Allows rendering of NSAttributedString within nodes using HTMLSelector results
public func render(_ selector: HTMLSelector) -> NSAttributedString {
// ...
}
public func render(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
public func stripper(_ string: String) -> String {
// ...
}
public func stripper(_ attributedString: NSAttributedString) -> NSAttributedString {
// ...
}
// ...
}
Corresponding implementation in the original source code: ZHTMLParser.swift
UIKit Issues
The result of NSAttributedString
is most commonly displayed in a UITextView
, but there are some important considerations:
- The link style in
UITextView
is uniformly determined by thelinkTextAttributes
setting, and it does not consider the settings ofNSAttributedString.Key
, nor can individual styles be set; hence the need for theZMarkupParser.linkTextAttributes
property. - Currently,
UILabel
does not have a way to change link styles, and sinceUILabel
does not haveNSTextStorage
, if you want to loadNSTextAttachment
images, you need to handleUILabel
separately.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public extension UITextView {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
self.attributedText = parser.render(string)
self.linkTextAttributes = parser.linkTextAttributes
}
}
public extension UILabel {
func setHtmlString(_ string: String, with parser: ZHTMLParser) {
self.setHtmlString(NSAttributedString(string: string), with: parser)
}
func setHtmlString(_ string: NSAttributedString, with parser: ZHTMLParser) {
let attributedString = parser.render(string)
attributedString.enumerateAttribute(NSAttributedString.Key.attachment, in: NSMakeRange(0, attributedString.string.utf16.count), options: []) { (value, effectiveRange, nil) in
guard let attachment = value as? ZNSTextAttachment else {
return
}
attachment.register(self)
}
self.attributedText = attributedString
}
}
Thus, we extended UIKit, allowing external modules to simply call setHTMLString()
to complete the binding.
Complex Rendering Items — Item Lists
Here’s a record of the implementation regarding item lists.
Using <ol>
/ <ul>
to wrap <li>
in HTML represents an item list:
1
2
3
4
5
6
<ul>
<li>ItemA</li>
<li>ItemB</li>
<li>ItemC</li>
//...
</ul>
Using the parsing method mentioned earlier, we can retrieve other list items in visit(_ markup: ListItemMarkup)
to know the current list index (thanks to the conversion to AST).
1
2
3
4
func visit(_ markup: ListItemMarkup) -> Result {
let siblingListItems = markup.parentMarkup?.childMarkups.filter({ $0 is ListItemMarkup }) ?? []
let position = (siblingListItems.firstIndex(where: { $0 === markup }) ?? 0)
}
NSParagraphStyle
has an NSTextList
object that can be used to display list items, but in practice, it cannot be customized for the width of the whitespace (I personally feel the whitespace is too large). If there is whitespace between the bullet and the string, it can trigger a line break, which may look a bit odd, as shown in the image below:
The better part can potentially be achieved by setting headIndent
, firstLineHeadIndent
, and NSTextTab
, but testing revealed that if the string is too long or the size changes, it still does not present a perfect result.
Currently, we have only achieved an acceptable solution by manually inserting the item list string before the string.
We only use NSTextList.MarkerFormat
to generate the list item symbols, rather than using NSTextList
directly.
For a list of supported list symbols, refer to: MarkupStyleList.swift
Final display result: ( <ol><li>
)
Complex Rendering Items — Tables
Similar to the implementation of item lists, but for tables.
Using <table>
to wrap <tr>
for table rows, and wrapping <td>/<th>
for table columns in HTML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
Testing revealed that the native NSAttributedString.DocumentType.html
uses the private macOS API NSTextBlock
to display, thus being able to fully render HTML table styles and content.
A bit of a cheat! We cannot use private APIs 🥲
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
func visit(_ markup: TableColumnMarkup) -> Result {
let attributedString = collectAttributedString(markup)
let siblingColumns = markup.parentMarkup?.childMarkups.filter({ $0 is TableColumnMarkup }) ?? []
let position = (siblingColumns.firstIndex(where: { $0 === markup }) ?? 0)
// Optionally specify the desired width from the outside; can set to .max to avoid truncating the string
var maxLength: Int? = markup.fixedMaxLength
if maxLength == nil {
// If not specified, find the length of the string in the first row of the same column as the max length
if let tableRowMarkup = markup.parentMarkup as? TableRowMarkup,
let firstTableRow = tableRowMarkup.parentMarkup?.childMarkups.first(where: { $0 is TableRowMarkup }) as? TableRowMarkup {
let firstTableRowColumns = firstTableRow.childMarkups.filter({ $0 is TableColumnMarkup })
if firstTableRowColumns.indices.contains(position) {
let firstTableRowColumnAttributedString = collectAttributedString(firstTableRowColumns[position])
let length = firstTableRowColumnAttributedString.string.utf16.count
maxLength = length
}
}
}
if let maxLength = maxLength {
// If the column exceeds maxLength, truncate the string
if attributedString.string.utf16.count > maxLength {
attributedString.mutableString.setString(String(attributedString.string.prefix(maxLength))+"...")
} else {
attributedString.mutableString.setString(attributedString.string.padding(toLength: maxLength, withPad: " ", startingAt: 0))
}
}
if position < siblingColumns.count - 1 {
// Add whitespace as spacing; the external can specify how many spaces to use for spacing
attributedString.append(makeString(in: markup, string: String(repeating: " ", count: markup.spacing)))
}
return attributedString
}
func visit(_ markup: TableRowMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add a line break; see source code for details
return attributedString
}
func visit(_ markup: TableMarkup) -> Result {
let attributedString = collectAttributedString(markup)
attributedString.append(makeBreakLine(in: markup)) // Add a line break; see source code for details
attributedString.insert(makeBreakLine(in: markup), at: 0) // Add a line break; see source code for details
return attributedString
}
Final presentation effect as shown below:
Not perfect, but acceptable.
Complex Rendering Items — Images
Finally, the biggest challenge is loading remote images into NSAttributedString
.
Using <img>
in HTML to represent an image:
1
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg" width="300" height="125"/>
You can specify the desired display size through the width
/ height
HTML attributes.
Displaying images in NSAttributedString
is more complex than expected, and there isn’t a perfect implementation. Previously, when working on UITextView text wrapping, I encountered some pitfalls, but after further research, I found that there still isn’t a perfect solution.
Currently, we ignore the native NSTextAttachment
’s inability to reuse and release memory issues, and instead implement downloading images from remote sources, placing them into NSTextAttachment
, and ensuring content updates automatically.
This series of operations has been broken down into another small project for easier optimization and reuse in other projects:
The main reference is the series of articles on Asynchronous NSTextAttachments, but I modified the final content update part (after downloading, the UI needs to refresh to display) and added Delegate/DataSource for external extensibility.
The operational flow and relationships are as follows:
- Declare a
ZNSTextAttachmentable
object that encapsulates theNSTextStorage
object (whichUITextView
has) and theUILabel
itself (sinceUILabel
lacksNSTextStorage
). The operation method is solely to implementreplace attributedString from NSRange
. (func replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
) - The principle is to first use
ZNSTextAttachment
to wrap theimageURL
,PlaceholderImage
, and the size information to be displayed, then directly show the image using the placeholder. - When the system requires this image on the screen, it will call the
image(forBounds…)
method, at which point we start downloading the image data. - The DataSource allows external customization for how to download or implement image cache policies, with the default being a direct URLSession request for image data.
- After downloading, a new
ZResizableNSTextAttachment
is created, and the logic for setting the custom image size is implemented inattachmentBounds(for…)
. - Call the
replace(attachment: ZNSTextAttachment, to: ZResizableNSTextAttachment)
method to replace the position ofZNSTextAttachment
withZResizableNSTextAttachment
. - Emit a didLoad Delegate notification, allowing external connections if needed.
- Done.
For detailed code, refer to the Source Code.
The reason for not using NSLayoutManager.invalidateLayout(forCharacterRange: range, actualCharacterRange: nil)
or NSLayoutManager.invalidateDisplay(forCharacterRange: range)
to refresh the UI is that it was found that the UI did not correctly display updates; since we already know the range, directly triggering the replacement of NSAttributedString
ensures the UI updates correctly.
The final display result is as follows:
1
2
<span style="color:red">こんにちは</span>こんにちはこんにちは <br />
<img src="https://user-images.githubusercontent.com/33706588/219608966-20e0c017-d05c-433a-9a52-091bc0cfd403.jpg"/>
Testing & Continuous Integration
This project not only involved writing unit tests but also established snapshot tests for integration testing, making it easier to compare the final NSAttributedString
in a comprehensive manner.
The main functional logic is covered by unit tests, along with integration tests, resulting in a final Test Coverage of around 85%.
Snapshot Test
Directly import the framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
import SnapshotTesting
// ...
func testShouldKeepNSAttributedString() {
let parser = ZHTMLParserBuilder.initWithDefault().build()
let textView = UITextView()
textView.frame.size.width = 390
textView.isScrollEnabled = false
textView.backgroundColor = .white
textView.setHtmlString("html string...", with: parser)
textView.layoutIfNeeded()
assertSnapshot(matching: textView, as: .image, record: false)
}
// ...
This directly compares the final result to ensure that adjustments made during integration do not cause any issues.
Codecov Test Coverage
Integrating with Codecov.io (free for public repositories) allows for evaluating test coverage. You just need to install the Codecov GitHub App and set it up.
Once the Codecov and GitHub repository are set up, you can also add a codecov.yml
file in the root directory:
1
2
3
4
5
6
comment: # this is a top-level key
layout: "reach, diff, flags, files"
behavior: default
require_changes: false # if true: only post the comment if coverage changes
require_base: no # [yes :: must have a base report to post]
require_head: yes # [yes :: must have a head report to post]
This configuration enables automatic comments on the CI results after each PR is submitted.
Continuous Integration
GitHub Action CI integration: ci.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
name: CI
on:
workflow_dispatch:
pull_request:
types: [opened, reopened]
push:
branches:
- main
jobs:
build:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: spm build and test
run: |
set -o pipefail
xcodebuild test -workspace ZMarkupParser.xcworkspace -testPlan ZMarkupParser -scheme ZMarkupParser -enableCodeCoverage YES -resultBundlePath './scripts/TestResult.xcresult' -destination 'platform=iOS Simulator,name=iPhone 14,OS=16.1' build test | xcpretty
- name: Codecov
uses: codecov/codecov-action@v3.1.1
with:
xcode: true
xcode_archive_path: './scripts/TestResult.xcresult'
This configuration runs the build and test when a PR is opened/reopened or when pushing to the main branch, and uploads the test coverage report to Codecov.
Regex
Regarding regular expressions, each time I use them, I refine my understanding; although I didn’t use them extensively this time, I initially wanted to extract paired HTML tags, which led me to research how to write them.
Here are some cheat sheet notes from what I learned this time:
?:
allows ( ) to match group results without capturing them. e.g.(?:https?:\/\/)?(?:www\.)?example\.com
will return the entire URL inhttps://www.example.com
instead of justhttps://
andwww
..+?
performs a non-greedy match (returns the closest match). e.g.<.+?>
will return<a>
and</a>
in<a>test</a>
instead of the entire string.(?=XYZ)
matches any string until the stringXYZ
appears; note that[^XYZ]
represents any string until the characters X, Y, or Z appear. e.g.(?:__)(.+?(?=__))(?:__)
(matches any string until__
) will matchtest
.?R
recursively looks for values that match the same rule. e.g.\((?:[^()]|((?R)))+\)
will match(simple)
and(and(nested))
in(simple) (and(nested))
, including(nested)
.?<GroupName>
…\k<GroupName>
matches the previous group name. e.g.(?<tagName><a>).*(\k<GroupName>)
(?(X)yes|no)
matches the conditionyes
if theX
th match has a value (can also use group names), otherwise matchesno
. Swift does not currently support this.
Other good Regex resources:
- Swift Regex Quick Reference
- How Regular Expressions Work -> Refer to this when optimizing regex performance in this project.
- Regex Errors Leading to Infinite Searches, Causing Server Failures
- Regex101 for all regex rules
Swift Package Manager & Cocoapods
This was also my first time developing with SPM and Cocoapods… it was quite interesting. SPM is really convenient; however, if two projects depend on the same package, opening both projects simultaneously can lead to one of them not finding the package and failing to build.
Cocoapods has uploaded ZMarkupParser, but I haven’t tested its functionality since I used SPM. 😝
ChatGPT
In my actual development experience, I found it most useful for assisting with editing the README. I haven’t felt a significant impact during development; when asking mid-senior level questions, it often doesn’t provide accurate answers and sometimes gives incorrect ones (I encountered this when asking about regex rules, and the answers were not quite right). So, I ultimately returned to Google for accurate solutions.
Not to mention asking it to write code; unless it’s for simple code generation objects, don’t expect it to complete an entire tool structure. (At least for now, it seems that Copilot might be more helpful for writing code.)
However, it can provide a general direction for knowledge gaps, allowing us to quickly understand how certain things should be done. Sometimes, when our grasp is too weak, it can be difficult to quickly locate the correct direction on Google, and that’s when ChatGPT becomes quite helpful.
Declaration
After more than three months of research and development, I am exhausted, but I want to clarify that this approach is merely a feasible result of my research and may not be the best solution or may still have areas for optimization. This project is more like a stepping stone, hoping to achieve a perfect solution for converting a markup language to NSAttributedString
. Contributions are very welcome; many aspects still need the power of the community to improve.
Contributing
Here are some areas I think could be improved as of now (2023/03/12), which I will document in the repo:
- Performance/algorithm optimization; although it’s faster and more stable than the native
NSAttributedString.DocumentType.html
, there is still room for improvement. I believe its performance is definitely not on par with XMLParser. I hope one day it can achieve the same performance while maintaining customization and automatic error correction. - Support for more HTML tags and style attribute conversions.
- Further optimization of ZNSTextAttachment to implement reuse capabilities and release memory; may need to research CoreText.
- Support for Markdown parsing, as the underlying abstraction is not limited to HTML; thus, once the Markdown to Markup object is established, Markdown parsing can be completed. That’s why I named it ZMarkupParser instead of ZHTMLParser, hoping that one day it can also support Markdown to
NSAttributedString
. - Support for Any to Any conversions, e.g., HTML to Markdown, Markdown to HTML, since we have the original AST tree (Markup object), so implementing conversions between any markup is possible.
- Implement CSS
!important
functionality to enhance the inheritance strategy of MarkupStyle. - Strengthen HTML Selector functionality; currently, it only has the most basic filtering capabilities.
- So many more improvements; feel free to open an issue.
Summary
This concludes all the technical details and my journey in developing ZMarkupParser. It took me nearly three months of after-work and weekend time, countless research and practical processes, writing tests, improving test coverage, and establishing CI; finally, I have a somewhat presentable result. I hope this tool helps those who face similar challenges, and I look forward to everyone working together to make it even better.
Currently, it is applied in our company’s iOS app on pinkoi.com, and I haven’t encountered any issues. 😄
Further Reading
- ZMarkupParser HTML String to NSAttributedString Converter Tool
- String Rendering
- Asynchronous NSTextAttachments
If you have any questions or feedback, feel free to contact me.
This article was first published on Medium ➡️ Click Here
Automatically converted and synchronized using ZMediumToMarkdown and Medium-to-jekyll-starter.